Pluralsight Reviews: R Data Visualization

Beginning Data Visualization with R is Matthew Renze’s second course on R for Pluralsight, and it follows from the first on exploratory data analysis (review here). In this second course, Matthew takes a look at univariate and bivariate data visualization.

The flow of the course is similar to the first course: Matthew’s examples are based around a fictitious company and uses the Open Movie Database for raw data. In this particular course, we spend less time learning about transformations and more about the three biggest graphics systems within R: the base R graphics library, Lattice, and ggplot2. Matthew does a good job of showing how to create various forms of plots on these different libraries, showing when to use each type (e.g., box plots, bar graphs, and scatter plots). He also shows cases in which certain libraries do not have particular graph types.

I think this was a great course. If I had one knock, it’s that some of the examples suffered due to the data set. For example, Matthew showed us heatmaps and density plots of the number of movies by runtime. The plot did work and it did show what you can do, but I would have preferred to see him use a data set with at least two clusters so you can really get the power of these types of charts.

Regardless of that minor complaint, if you are interested in learning how to perform visualizations with R, this is a great course. I’m looking forward to his intermediate and advanced-level (forthcoming) courses as well.

The Marginal DBA

You Have A Performance Problem; What Do You Do?

Brent Ozar had a blog post about what hardware you can purchase with the price of two cores of Enterprise Edition and argues that you should probably spend some more money on hardware.  Gianluca Sartori has a blog post along similar lines.  By contrast, Andre Kamman had an entire SQLBits talk about not throwing hardware at the problem.  And finally, Jen McCown argues that you should spend the money on Enterprise Edition.  When you have a performance problem, who’s right?

The answer is, all of them.  And to head off any comments, my first paragraph is mostly a setup; ask any of these authors and I’m sure they’ll tell you that “it depends” is the right answer.  The purpose of this post is to dig a bit deeper and discuss when each of these points is the correct answer, looking at things from an economist’s perspective.

Detour Into Basic Economics

Before I get too deep into application of economic principles to the world of database administration, let’s cover these principles really quick so we’re all on the same page.

Scarcity

Your data center is only so big.  Your rack can only hold so many servers.  Your credit card has a limit, and so does your bank account.  At some point, your budget runs out.  And even if your budget is phenomenal, there are technological limits to how big a server can be, or even how big a cluster of servers can be.  No matter how much cash, how many top-tier experts, and how much time you have available, there is always a limit.

I’ve personally never had to worry about hitting the 128 TB RAM limit in Windows Server 2012 R2.  Even if you had a server capable of using 128 TB of RAM, I’m sure it’d be so expensive that there’s no way I’d ever be allowed near it.

This, in short, is scarcity:  you will always run out of resources before you run out of problems.

Opportunity Cost

In a world of scarcity, we can figure out the true cost of something:  what is the next-best alternative to this?  For example, let’s say we have the option to buy $100K worth of hardware.  What other alternatives exist?  We could bring in a consultant and pay that person $100K to tune queries.  We could send the current staff out for $100K worth of training.  We could hire a new person at $100K (although this is a stream of income rather than a one-time payment, so it’s a bit trickier of an answer).  Or we could buy $100K worth of software to help solve our problem.  Dollar values here simply help focus the mind; even without currencies, we can still understand the basic concept:  the actual cost of a good or service is what you forego when you decide to make the trade to obtain that good or service.

Understanding opportunity cost is critical to making good decisions.  It also helps lead into the third major topic today.

Marginal Utility

One of the toughest problems of pre-Marginal Revolution economics was the diamond paradox.  The short version of the diamond paradox is as follows:  water is more valuable than diamonds, in the sense that a person can live with water but no diamonds, but not vice versa.  So why is water so inexpensive, yet diamonds are so expensive?

The answer is in marginal utility.  The idea behind marginal utility is that your valuation of a thing changes as you obtain more of that thing.  The first glass of water for a thirsty man in the desert is worth quite a bit; the 900th glass of water, not so much.  In hardware terms, going from 8 GB of RAM to 16 GB of RAM is worth more to you than going from 192 GB to 200 GB.  The 8-16 jump allows you to do quite a bit that you might not have been able to do before; the jump from 192 to 200, even though it is the same total RAM difference, opens many fewer doors.

Detour’s Over; Back To The Main Point

Armed with these principles of economics, let’s dive into the problem.  To keep things simple, I’m going to think about three competing areas for our IT budgets:

  1. Purchasing newer, better, or more hardware
  2. Paying people to write better code or tune application performance
  3. Purchasing Enterprise Edition licenses rather than Standard Edition (or APS appliances rather than Enterprise Edition)

Applying the scarcity principle first, we know we can’t afford all of these things.  If your company can afford all of the items on the list, then this whole topic’s moot.  But honestly, I’ve never seen a company whose budget was so unlimited that they could keep hiring more and more people to fix code while buying more and more hardware and getting more and more expensive software.  At some point, you hit a limit and need to make the hard decision.

Here’s a simplistic interpretation of those limits:

LinearConstraints1.png

In this example, I have a set of hardware, queries, and SQL Server versions that I allocate to OLTP processing and OLAP processing.  I have three constraints (which I’ve made linear for the sake of simplicity):  hardware, application code, and SQL Server versions.  The idea behind this two-dimensional linear constraint picture is that you are able to pick any point on the X-Y plane which is less than or equal to ALL constraints.  In other words, you can pick any spot in the purple-lined section of the image.

Taking this scenario, the application is tuned reasonably well and the version of SQL Server isn’t going to help us much; we simply don’t have powerful enough hardware.  This might be something like trying to run a production SQL Server instance on 8 GB of RAM, or running on 5400 RPM hard drives.  Even if we hire people to tune queries like mad and push the red boundary out further, it doesn’t matter:  we will still have the same bottlenecks.

By contrast, in this case, if we purchase new hardware, we can shift the purple curve out.  Let’s say we get some fancy new hardware which solves the bottleneck.  Now our constraint problem might look something like the following:

LinearConstraints2.png

We still need to pick a spot somewhere in the purple-lined section, but notice that our constraint is no longer hardware.  In fact, we have two potential constraints:  version limitations and application limitations.  The answer to “what do we need to do?” just got a bit more difficult.  If we are in a situation in which we lean heavily toward OLTP activity, our next bottleneck is the application:  now it’s time to rewrite queries to perform a bit better.  By contrast, if we’re in a big warehouse environment, we will want to upgrade our servers to take advantage of OLAP features we can’t get in Standard Edition (e.g., clustered columnstore indexes or partitioning).

Can We Make This Practical?

In the real world, you aren’t going to create a linear programming problem with defined constraints…probably.  Nevertheless, the important part of the above section is the principle:  solve your tightest constraint.

The Easy Stuff

Here’s a bit of low-hanging fruit:

  1. If you’re running production SQL Server instances with less than 32 GB of RAM, you almost certainly have a hardware constraint.  Buy that RAM.  Going back to Brent’s post, you can buy a brand new server for $7K, but if you have a server that can hold at least 128 GB of RAM, it looks like you can buy that RAM for about a grand (and I’m guessing you can probably get it for less; that’s retail price).
  2. If you haven’t done any query tuning, you can probably find two or three costly queries and tune them easily.  They may not be the biggest, nastiest queries, but they help you reduce server load.  Similarly, find two or three missing indexes and add them.
  3. If you query relatively small sets of static data, put up a Redis server and cache those results.  A client of mine did that and went from about 85 million SQL Server queries per day to 60,000.  Even if your queries are perfectly optimized, that’s an incredible difference.

The nice thing about low-hanging fruit is that the opportunity cost is extremely low.  Spend a grand for RAM, 3-5 hours tuning queries, and a few hundred dollars a month on Premium Redis caching.  For a company of any reasonable size, those are rounding errors.

The Hard Stuff

Once you get beyond the low-hanging fruit, you start to see scarcity creeping back in.  You’ve got reasonable hardware specs, but maybe that Pure Storage array will help out…or maybe you need to bring in top-notch consultants to optimize queries in a way you could never do…or maybe it’s time to buy Enterprise Edition and take advantage of those sweet, sweet features like In-Memory OLTP.

Finding the right answer here is a lot harder of a problem.  One place to start looking is wait stats.  SQL Server keeps track of reasons why queries are waiting, be they hardware-related (disk, IO, CPU), scheduling related, network bandwidth related, or even because your application can’t handle how fast SQL Server’s pushing data across to it.  There’s a bit of interpretation involved with understanding wait stats, but Paul Randal is putting together the definitive resource for understanding wait stats.

The rest of my advice comes down to experience and testing.  If you’re in an environment in which you can run tests on hardware before purchase, that’s an easy way to tell if hardware is your bottleneck.  One scenario I had a few years back involved a case in which we got to play with a solid state disk array back when those were brand new.  We ran our production SQL Server instances on them for a week or so and saw almost no performance gain.  The reason was that our servers had 8 GB of RAM and 1-2 VCPUs, so faster disk simply exacerbated CPU and memory issues.  By contrast, at my current company, we had a chance to play with some newer servers with more powerful CPUs than what we have in production and saw a nice performance boost because we’re CPU-limited.

The nice thing is that my graphics above aren’t quite accurate in one big sense:  application tweaks and hardware work in parallel, meaning that buying new hardware can also push out the application tweak curve a bit and vice versa.  This means that, in many cases, it’s better to find the cheaper answer, be that hardware or sinking hours into application development and tuning.

Emphasizing Sunk Costs

There’s one last thing to discuss here:  the concept of sunk costs.  If you want a long-winded, rambling discussion of sunk costs, here’s one I wrote while in grad school.  For the short version, “sunk costs are sunk.”

Why is it that a company is willing to pay you $X a year to tune queries but isn’t willing to pay 0.1 * $X to get hardware that would end up solving the constraint issue above so much more easily?  The answer is that, from an accounting standpoint, the new hardware is a new cost, whereas your salary is already factored into their budget.  From the standpoint of the in-house accountant, your salary is a fixed cost (which they have to pay regardless of whether you’re spending 80 hours a week tuning queries like a boss or 5 hours a week of work & 35 hours of Facebook + Pokemon Go…at least until they fire you…) and from the accountant’s perspective, your effort doesn’t get shifted around; you’re still a DBA or database developer or whatever it is you do.  So your salary is a sunk cost.

I should mention a huge, huge point here:  salaries are sunk costs in the short term.  In the long term, salaries are neither fixed nor sunk costs.  In the long term, your manager can re-assign you to a different department (meaning that your salary is not sunk), and your salary can change or even go away if it makes sense from the company’s perspective (meaning that salary is a variable cost rather than a fixed cost).  But for our purposes, we’re interested in the short-term ramifications here, as “long-term” could mean a year or even several years.

Also, there is one big exception to the rule that your salary is a sunk cost:  if you get paid hourly, your wages become a variable cost, and it’s a lot easier to sell the company on trading variable costs for fixed costs:  “If you buy this hardware for $X, you don’t have to pay me 3 * $X to tune queries.”  That’s an actual cost reduction, whereas in the salary case, you’re still arguing for a net increase in company cost incurred.

So what can you do in that case?  Selling the company on “reduced cost” doesn’t cut it so much for salaried employees because you generally aren’t really reducing costs from the company’s standpoint.  Instead, you have to sell on opportunity cost:  if you weren’t spending 15 hours a week trying to tune queries to get this ETL process to work adequately, you could focus on Project X, which could net the company $Y in revenue (where $Y is an expected revenue return conditional upon the probability of success of Project X).  If $Y is substantially higher than the cost of hardware, you now have a solid business case you can take to higher-ups to get newer, better hardware.

Similarly, if you’re spending all of your time on application development and the company’s throwing barrels of cash at vendors for new hardware, you could make the case that getting some time to tune queries might allow the company to save money on net by deferring the cost of hardware purchases.  This is an easier case to make because, again, your salary is a sunk cost (unless you’re a wage-earning employee), so the opportunity cost comes down to what the next-best alternative is with your time.

Conclusion

For the most part, I can’t tell you whether you’re better off buying more hardware, tuning those queries, upgrading to Enterprise Edition, or doing something entirely different (at least unless you’re willing to shovel cash in my direction and have me take a look…).  What I hoped to do in this blog post was to give you some basic economics tools, letting you apply them to your own situation.  With these concepts in place, they’ll give you ammunition when going to management to ease your biggest constraint.  Managers and accountants are going to be a bit more amenable to arguments around opportunity cost, and your pleas are less likely to fall on deaf ears when you realize that sunk costs shouldn’t affect future behavior.

August Presentations

This is going to be a big month for me, presentation-wise.  After taking a couple of weeks off from presenting, I’ve got a jam-packed month.  Here are the events, in order:

  1. On Saturday, August 6th, I will present at SQL Saturday Louisville in Louisville, Kentucky.  I’ll give my talk on integrating SQL Server with Hadoop.
  2. On Saturday, August 13th, I will present at SQL Saturday Indianapolis in Indianapolis, Indiana.  I’m going to give my talk on integrating SQL Server with Hadoop there as well.
  3. On Tuesday, August 16th, I will give a microsession to the Triangle SQL Server Users Group.  This will be a short demonstration of using code auto-formatters to make T-SQL code more readable.
  4. On Thursday, August 18th, I am tentatively scheduled to give a talk to the Raleigh-Durham Power BI Users Group.  The talk will be on custom visualizations with Power BI, and will follow from Devin Knight’s series on the topic.
  5. On Saturday, August 20th, I will present at SQL Saturday Spartanburg in Spartanburg, South Carolina.  I will give two talks, the first on integrating SQL Server with Hadoop and the second talk on Azure ML.
  6. On Tuesday, August 23rd, I will speak at the Polyglot .NET User Group, where we’ll cover the Scala language.
  7. On Thursday, August 25th, I am tentatively scheduled to speak at the Future of Data:  Triangle user group.  My topic there will be integrating SQL Server with Hadoop.
  8. On Saturday, August 27th, I will present at SQL Saturday Columbus in Columbus, Georgia.  My topic there is the APPLY operator.

All of these events are free, and if you’re in the area, stop on by.

TIL: RetainSameConnection

I recently had to debug a co-worker’s SSIS package, trying to figure out why it wasn’t working quite right.  We have a sharded infrastructure, meaning that our customers’ data is split out among a number of servers with equivalent schemas.  For this particular package, we wanted to, for each customer, get the relevant data from a common ETL server and push results into the appropriate server for that customer.

My co-worker had set up a dynamic connection (see Rafael Salas or Hari Bagra for details on how to do this), but something weird was happening:  the package was trying to push everything to the same server.  I confirmed that if all relevant customers loaded were for the same server, that the process would work correctly, and that I could run each server load one at a time, so there weren’t any problems connecting to particular servers or parameters overriding this choice.  It’s like the connection was “sticky,” connecting successfully to the first server and then ignoring the later changes.

It turns out, that’s exactly the case.  He turned on the RetainSameConnection flag for the connection manager.  If you set the value equal to true, then SSIS will open a single connection for the duration of the package run.  There are a couple of times in which you might want to set this to true:

  1. You have a complex package which hits the same connections over and over again, or you have to do lots of operations in a foreach loop.  This can remove the overhead of re-opening connections over and over again, thereby improving performance.
  2. You are using global temp tables and want multiple SSIS tasks to manipulate data in those global temp tables.

My co-worker was doing option #2, setting up a global temp table.  He turned on RetainSameConnection to get that functionality working, not realizing that it necessarily precluded the other requirement, that this package be able to modify the connection to point to a different server based on the current customer.

As a note, if you set RetainSameConnection to false, you will lose the ability to use global temp tables in multiple SSIS tasks.  The reason is that the connection will close between tasks, and once the last session which references a global temp table closes, that temp table can be marked for cleanup.

In my case, I ripped out the global temp table logic and set RetainSameConnection to False.  That allowed me to use dynamic connections as expected.

SQL Saturday Distances, A Closer Look

Last week, I put together an analysis of SQL Saturdays.  There were a lot of great comments, including one by Jack Corbett, who was interested in seeing a closer analysis of how far attendees tend to travel:

Most people from Maine aren’t going to anything outside of New England, and most aren’t going to go south of Boston. Seems to limit opportunities for attendees. Instead of analyzing the distance between locations, let’s look at how far non-speakers are traveling to go to events and limit based on that.

Fortunately(?), I happen to have just the data set.  PASS has made certain Summit and SQL Saturday data available as part of the SQL Server Launch 2016 Discovery Days.  Now, let me warn you:  this is going to be a long blog post.  But hey, it’s Friday, so cozy on up with your beverage of choice and let’s get data scienceing.

Dirty Data Done Dirt Cheap

Airing of Grievances

Before I begin, allow me to perform the data science Airing of Grievances.  This is an important part of data analysis which most people gloss over, instead jumping right into the “clean up the dirty data” phase.  But no, let’s revel in its filth for just a few moments.

  • There are some strange data type choices here.  For example, UserGUID (which is the deidentifier for individuals) is a uniqueidentifier on PASSMembersWithGeolocation, but is a different type in SQLSatRegistrations.  There are some other data type choices which aren’t ideal (for example, the SQL Saturday number is a float?!), but that’s not a huge deal.
  • What is a huge deal is that some of the postal codes are wrong.  If you look at New Hampshire addresses, the postal codes have leading 0s lopped off.  That came back to burn me in my first analysis, requiring me to re-run my geolocation function.
  • There is no breakout by event; we only see a denormalized junction of SQL Saturday events and attendees.  That can make calculations a bit harder, so I’d like to normalize this a bit.
  • The second-biggest issue is that the data set is, in my opinion, incomplete.  There are a few reasons why:
    • We only have data for people who registered, not for people who attended.  There might be some pattern in people who registered but did not attend, but we won’t be able to see that.  There are also going to be people who attended but did not register; these day-of attendees aren’t going to be in the system, but we let them in anyhow if we have room.  I think those attendees would skew toward “local” rather than “distant” drive times, meaning that you’re not going to travel 500 miles to sneak into an event for which you didn’t register…usually.
    • There is no separation of speakers, sponsors, and “normal” attendees.  Speakers and sponsors travel across the country to visit these things (we’re weird that way), but the big question is, how far do non-speaking, non-sponsoring attendees normally travel?  I have no way of knowing for sure who is which, so I’ll try to answer that in the data cleansing section.
    • This analysis is only through about April of 2016.  There have been several events since, which could affect our analysis, but I don’t have that data.
    • There are lots of missing fields.  For example, there are 9 American SQL Saturdays which had no postal code set.  There are also a lot of attendees with missing data—they didn’t fill in the city or the postal code or the region (state).  There are even attendees with zero information filled out, and those are just going to be useless for our examination.
  • This leads to my biggest issue:  nasty data.
    • There are no controls on data entry.  If you want to set city to “Nope, not gonna tell” then it’ll save that.
    • Furthermore, all of the entries are free-form.  There are no checks to ensure that a postal code is valid, that a city exists, etc.
    • In short, there are a lot of misunderstandings, miscodings, mistrust, and misanthropy in this data set.

Limitations of Analysis

Given the issues in this data set, I have to make a couple tough calls with the data.  First, I’m only going to look at US attendees of US SQL Saturdays.  I wanted to look at US and Canada SQL Saturdays like I did in my last analysis, but I had a lot of geocoding problems with Canadian data, so I had to throw that away.  Also, I needed to throw away a lot of attendee data because I simply didn’t have enough to get a valid latitude and longitude.  So we’ll have to go ahead knowing that our analysis is incomplete and any conclusions we draw will be uncertain—but that’s the nature of statistical analysis, so we’ll trudge on nonetheless.

Doing Data Science

Initial Preparation

Grab the PassDataIndividualTables data set.  We’re going to create a database called PASSData and load the tables which are not named “_xlnm_FilterDatabase.”  Once we have that, I did some basic data cleansing work.  I made UserGUID a uniqueidentifier across the board and changed SatNum to an INT instead of a FLOAT.  Then, I built a couple of tables:  dbo.SQLSaturday and dbo.SQLSaturdayAttendee. First, dbo.SQLSaturday:

CREATE TABLE dbo.SQLSaturday
(
	EventNumber INT NOT NULL,
	EventDate DATE NULL,
	EventName NVARCHAR(50) NOT NULL,
	StreetAddress NVARCHAR(100) NOT NULL,
	City NVARCHAR(100) NOT NULL,
	Country NVARCHAR(100) NOT NULL,
	PostalCode VARCHAR(14) NULL,
	Latitude DECIMAL(8,4) NULL,
	Longitude DECIMAL(8,4) NULL,
	GeoPoint AS GEOGRAPHY::Point(Latitude, Longitude, 4326),
	CONSTRAINT [PK_SQLSaturday] PRIMARY KEY CLUSTERED(EventNumber)
);

INSERT INTO dbo.SQLSaturday
(
	EventNumber,
	EventDate,
	EventName,
	StreetAddress,
	City,
	Country,
	PostalCode
)
SELECT DISTINCT
	s.SatNum,
	ss.EventDate,
	s.EventName,
	CASE
		WHEN s.StreetAddress IS NULL AND s.VenueName = 'N/A' THEN ''
		WHEN s.StreetAddress IS NULL AND s.VenueName <> 'N/A' THEN s.VenueName
		ELSE s.StreetAddress
	END,
	s.City,
	s.Country,
	s.PostalCode
FROM dbo.SQLSatRegistrations s
	OUTER APPLY
	(
		SELECT TOP(1)
			EventDate
		FROM dbo.SQLSatSessions sess
		WHERE
			sess.SatNum = s.SatNum
	) ss
WHERE
	s.Country IN ('United States');

--Tacoma 2009 and Jacksonville 2008 (#20) did not have dates set.  Set them manually.
UPDATE dbo.SQLSaturday SET EventDate = '2008-09-12' WHERE EventNumber = 20 AND EventDate IS NULL;
UPDATE dbo.SQLSaturday SET EventDate = '2009-04-01' WHERE EventNumber = 11 AND EventDate IS NULL;

I added Latitude and Longitude so that I could geocode data, and then a Geography data type so that I can calculate distances. Notice that there were two SQL Saturdays which did not have dates. I’m prety sure that Tacoma (event #11) never took place, but I updated it and Jacksonville’s second 2008 event regardless, just to make sure everything was accurate.

Next up, I want to create a table for attendees.

CREATE TABLE dbo.SQLSaturdayAttendee
(
	UserGUID UNIQUEIDENTIFIER NOT NULL,
	City NVARCHAR(50) NULL,
	Region NVARCHAR(50) NULL,
	Country NVARCHAR(50) NULL,
	PostalCode NVARCHAR(50) NULL,
	NumberAttended INT NOT NULL,
	Latitude DECIMAL(8,4) NULL,
	Longitude DECIMAL(8,4) NULL,
	GeoPoint AS GEOGRAPHY::Point(Latitude, Longitude, 4326),
	CONSTRAINT [PK_SQLSaturdayAttendee] PRIMARY KEY CLUSTERED(UserGUID)
);

INSERT INTO dbo.SQLSaturdayAttendee
(
	UserGUID,
	City,
	Region,
	Country,
	PostalCode,
	NumberAttended
)
SELECT
	p.UserGUID,
	p.City,
	p.Region,
	p.Country,
	p.PostalCode,
	COUNT(1)
FROM dbo.SQLSatRegistrations r 
	INNER JOIN dbo.PASSMembersWithGeolocation p 
		ON r.UserGUID = p.UserGUID 
WHERE
	r.Country in ('United States')
	AND
	(
		(p.city IS NOT NULL AND p.region IS NOT NULL)
		OR (p.postalcode IS NOT NULL)
	)
	AND ISNULL(p.Country, 'United States') = 'United States'
GROUP BY
	p.UserGUID,
	p.City,
	p.Region,
	p.Country,
	p.PostalCode;

--Fix cases when postal codes had leading 0s lopped off.
UPDATE dbo.SQLSaturdayAttendee
SET
	PostalCode = RIGHT('00000' + PostalCode, 5)
WHERE
	LEN(PostalCode) < 5;

I have the same Latitude, Longitude, and GeoPoint attributes, and we’ll make use of those later. I also put on a NumberAttended attribute, which sums up the number of times a person has attended a SQL Saturday. I’m limiting the attendee set to those who attended events in the United States (r.Country) and who have either City+Region or PostalCode set. Then, I’m going to fix postal codes, adding leading 0s to ZIP Codes missing them.

Quick Analysis

We’re far enough along that I can do some quick data set analysis and see what we’ve got.

--Basic data
SELECT COUNT(1) FROM dbo.SQLSatRegistrations;
SELECT COUNT(1) FROM dbo.SQLSatRegistrations WHERE Country IN ('United States');
SELECT COUNT(1) FROM dbo.SQLSatRegistrations r INNER JOIN dbo.SQLSaturdayAttendee a ON r.UserGUID = a.UserGUID;
SELECT COUNT(1) FROM dbo.SQLSaturdayAttendee;

BasicCounts

There are 152,000 registrants in the data set, of which 101,164 registered for United States events.  Our data set contains 33,419 distinct attendees, who make up 65,682 of the 100K United States event registrants.

Geocoding Time!

I’m going to do three separate rounds of geocoding.  First, let’s geocode the SQL Saturday events.  There are only about 300 of them, so I can use the same patterns that I used in my previous post.  So let’s move over to some R code for a bit.

install.packages('RDSTK', repos = "http://cran.us.r-project.org")
install.packages("ggmap")
install.packages("RODBC")
install.packages("dplyr")
install.packages("httr")
install.packages("rjson")
install.packages("lubridate")
library(RDSTK)
library(ggmap)
library(RODBC)
library(dplyr)
library(lubridate)

conn <- odbcDriverConnect("Driver=SQL Server;Server=.;Initial Catalog=PASSData;Provider=SQLNCLI11.1;Integrated Security=SSPI")
sqlsats <- sqlQuery(conn, "SELECT EventNumber, StreetAddress, City, Country, ISNULL(PostalCode, '') AS PostalCode from PASSData.dbo.SQLSaturday")
sqlsats$GeoLookup <- paste(sqlsats$StreetAddress, sqlsats$City, sqlsats$Country, sqlsats$PostalCode, sep = " ")

# Perform Google Maps geocoding.  Google Maps provides 1000 geocoding hits per day for free.
# Our data set is ~300 observations, so we can do that without issue.
sqlsats <- cbind(sqlsats, t(sapply(sqlsats$GeoLookup, geocode, USE.NAMES=FALSE)))

#A few records failed to get lat/long
#I will handle them as one-off updates in SQL.
sqlsats
sqlsats.good <- filter(sqlsats, !is.na(lon))
sqlsats.good
# Via http://stackoverflow.com/questions/14334840/how-to-insert-a-dataframe-into-a-sql-server-table
# Not a great practice, but for 300 rows, it'll do.
cmd <- paste("UPDATE PASSData.dbo.SQLSaturday SET Latitude = ",sqlsats.good$lat,", Longitude = ",sqlsats.good$lon," WHERE EventNumber = ", sqlsats.good$EventNumber, sep="", collapse=";")
result <- sqlQuery(conn, cmd, as.is=TRUE)

The first dozen lines or so are loading lots of packages. I’m not going to use all of them for this first section, but we’ll need them later. Then, we make a connection to SQL Server and grab data from the SQLSaturday table. I create a GeoLookup column to make lookups a bit easier.

The next line, where we do the geocoding, is a really cool thing you can do in R. Basically, I want to take my sqlsats data frame and append latitude and longitude to it. That’s what cbind (column bind) gives us. To understand the t(sapply(…, geocode, …)) bit, you have to work from the inside out. The geocode function is what we use to get latitude and longitude from Google Maps given an address. We want to apply that function to every observation in the sqlsats data frame (in T-SQL parlance, every row in the table), so we use the sapply variant of the apply function, and use the t() function to transpose the results so we can append them to our sqlsats data frame.

From there, I filter out any SQL Satudays missing longitude and update my SQLSaturday table with the longitudes and latitudes we grabbed.

As mentioned, there are some SQL Saturdays missing latitude and longitude, so I had to do lookups manually. There were also a couple of SQL Saturdays which were miscoded because they didn’t have states listed.

--Manual cleanup
UPDATE dbo.SQLSaturday SET Latitude = 30.2782, Longitude = -81.511850 WHERE StreetAddress = '1 UNF Drive';
UPDATE dbo.SQLSaturday SET Latitude = 41.6618, Longitude = -91.5365 WHERE StreetAddress = '101 Jessup Hall';
UPDATE dbo.SQLSaturday SET Latitude = 45.5026, Longitude = -73.5735 WHERE StreetAddress = '2000 McGill College, 5e étage';
--Columbia was marked as W. Columbia in New York, rather than Columbia South Carolina
UPDATE dbo.SQLSaturday SET Latitude = 33.9518, Longitude = -81.1157 WHERE City = 'West Columbia';
--Richmond was miscoded as well for 2010.
UPDATE dbo.SQLSaturday SET Latitude = 37.6488, Longitude = 	-78.7015 WHERE EventNumber = 30;

Next up, I want to geocode the attendee data. I’m going to have to do this a different way for a couple of reasons. First, we’re going to have way more than 1000 attendees, so I can’t use Google Maps. Instead, I’m going to use the OpenStreetView data set to perform free geocoding. It’s slower than the Google API, but at least I don’t have to pay…

I’m going to get two subsets of attendees: those who put in a valid ZIP Code (and I can throw out everything else because typically, ZIP Codes are more distinctive than cities), and those who put in valid-ish city + region combos.

--Postal codes
SELECT DISTINCT 
	a.Country,
	a.PostalCode
FROM dbo.SQLSaturdayAttendee a
WHERE
	a.PostalCode IS NOT NULL;

--Approximately valid cities
SELECT DISTINCT
	a.City,
	a.Region,
	a.Country
FROM dbo.SQLSaturdayAttendee a
WHERE
	a.PostalCode IS NULL
	AND LEN(City) > 2
	AND LEN(Region) > 1
	AND City NOT LIKE '%[0-9]%'				--Some people put in street addresses
	AND Region NOT LIKE '%[0-9]%'			--Some people put in strange values
	--Some people are very regional
	AND Region NOT IN ('Midwest', 'North', 'South', 'North America', 'mid-atlantic', 'East', 'IN', 'this one', 'Mountain West', 'NA', 'N/A')
	AND City NOT LIKE '%@%'					--Some people put in e-mail addresses
	AND City <> 'N/A'						--Some people don't like listing city
	AND City <> 'NA'
ORDER BY
	a.City,
	a.Region;

There are 6544 distinct postal codes in the data set, and 2406 distinct city-region pairings, adding up to 8950 lookups we’ll need to do. These 8950 lookups will apply to upwards of 33,273 attendees (which you can get by removing the DISTINCT clauses). At this point, I want to mention that we’re down to about 30K attendees from our original 93K data set (that is, there were 93,838 unique UserGUIDs in dbo.SQLSatRegistrations). If I were trying to do predictive analysis, I’d start to get really concerned here because we’re throwing away 2/3 of the attendee set. But let’s soldier on, because 30K attendees should still at least give us some idea.

Geocoding this time, as I mentioned, will be a bit trickier. Let’s get to the R:

# From http://stackoverflow.com/questions/22887833/r-how-to-geocode-a-simple-address-using-data-science-toolbox
# My own modifications include the if-else block for lookup failures
geo.dsk <- function(addr){ # single address geocode with data sciences toolkit
  require(httr)
  require(rjson)
  url      <- "http://www.datasciencetoolkit.org/maps/api/geocode/json"
  response <- GET(url,query=list(sensor="FALSE",address=addr))
  json <- fromJSON(content(response,type="text")) if(length(json['results'][[1]]) > 0) {
    loc  <- json['results'][[1]][[1]]$geometry$location
    return(c(address=addr,long=loc$lng, lat= loc$lat))
  } else {
    return(c(address='',long=0,lat=0))
  }
}

zips <- sqlQuery(conn, "SELECT DISTINCT a.Country, a.PostalCode FROM PASSData.dbo.SQLSaturdayAttendee a WHERE a.PostalCode IS NOT NULL;")
zips$CheckString = paste(zips$PostalCode, zips$Country, sep = " ")
zips$lat <- 0
zips$long <- 0

result <- do.call(rbind,lapply(as.character(zips$CheckString),geo.dsk))
zips <- data.frame(result)
zips
head(zips)

sqlSave(conn, zips, tablename="ZipStage")

cities <- sqlQuery(conn, "SELECT DISTINCT a.City, a.Region, ISNULL(a.Country, '') AS Country FROM PASSData.dbo.SQLSaturdayAttendee a WHERE a.PostalCode IS NULL AND LEN(City) > 2 AND LEN(Region) > 1 AND City NOT LIKE '%[0-9]%' AND Region NOT LIKE '%[0-9]%' AND Region NOT IN ('Midwest', 'North', 'South', 'North America', 'mid-atlantic', 'East', 'IN', 'this one', 'Mountain West', 'NA', 'N/A') AND City NOT LIKE '%@%' AND City <> 'N/A' AND City <> 'NA' ORDER BY a.City, a.Region;")
cities$CheckString = paste(cities$City, cities$Region, cities$Country, sep = " ")
cities$lat <- 0
cities$long <- 0

result <- do.call(rbind,lapply(as.character(cities$CheckString),geo.dsk))
cities <- data.frame(result)
cities

sqlSave(conn, cities, tablename="CitiesStage")

I copy-pasted the geo.dsk function from Stack Overflow and added in an if-else statement to check to see if json[‘results’] has values. Basically, we’re hitting the Data Science Toolkit’s geocoding API endpoint with the partial address (postal code or city + state) we have, and then we’re shredding the returned JSON to get latitude and longitude. If the lookup failed for whatever reason, I return a latitude and longitude of 0.

The first data set I pull out is ZIP Codes. I first create a data frame from my SQL query, add a CheckString variable to paste together the address components, and then call geo.dsk to perform lookups for each CheckString in zips. The results get converted into a data frame, which becomes my new zips, and I’ll use RODBC’s sqlSave function to save the data in a staging table that I’ll call ZipStage. We do the same thing for cities.

Now that I have latitudes and longitudes, I’m going to import them from my staging table into the SQLSaturdayAttendee table. Note that I didn’t paste together a massive update statement for the 9000-ish results; instead, I put them in a staging table, as that’s going to be a lot more efficient than doing 9000 update statements.

UPDATE a
SET
	Latitude = z.lat,
	Longitude = z.long
FROM dbo.SQLSaturdayAttendee a
	INNER JOIN master.dbo.ZIPStage z
		ON CONCAT(a.PostalCode, ' ', a.Country) = z.address

UPDATE a
SET
	Latitude = c.lat,
	Longitude = c.long
FROM dbo.SQLSaturdayAttendee a
	INNER JOIN master.dbo.CitiesStage c
		ON CONCAT(a.City, ' ', a.Region, ' ', a.Country) = c.address
		AND a.PostalCode IS NULL

--Manual cleanup of Columbus, Ohio, which was geocoded to somewhere around Berea
UPDATE dbo.SQLSaturdayAttendee SET Latitude = 39.9612, Longitude = -82.9988 WHERE City = 'Columbus' AND Region IN ('Ohio', 'OH') AND PostalCode IS NULL;

One interesting thing I found was that the geocoding service put Columbus, Ohio up near Cleveland, which skewed Columbus results significantly. I’m not sure why it did that, but this leads me to believe that there are probably a few other geocoding errors that I didn’t catch. I don’t think they’re going to alter radically the results, however, so I didn’t spend any more time looking for them.

Who Speaks?

One consideration I had with this data set is that I didn’t want to include speakers and sponsors in the attendee set.  Ideally, I’d keep them separated because their goals and motivations are different.  For this data set, I’m really interested in seeing behavior patterns of non-speaker, non-sponsor attendees.  But how do I figure out who’s an attendee and who’s a speaker when the data set is completely anonymized and there aren’t any flags for speakers or sponsors of events?

Twin Peaks

My expectation is that speaker/sponsor behavior will differ from “normal” attendee behavior.  Here’s a quick data sample to show my conjecture in action.  I’m going to query everybody in ZIP Code 27703 (which is where I live).  See if you can figure out which one of these is me…

SELECT
	*
FROM dbo.SQLSaturdayAttendee a
WHERE
	a.PostalCode = '27703'
ORDER BY
	a.NumberAttended DESC;

27703Distribution

If I try to generalize this phenomenon, I’ll look for how many people attended (or at least registered for) a particular number of SQL Saturday events.  Ideally, what I’d find is a lot of people with 1, 2, or 3 events; then a dip; then an increasing number of people at some point; and finally another dip.

I’m going to use a quick T-SQL query to get the raw data, and then analyze it in R because R is perfect for this type of quick analysis.

#Figure out the cutoff for likely speakers.
cmd <- "SELECT NumberAttended, COUNT(1) AS People FROM PASSData.dbo.SQLSaturdayAttendee GROUP BY NumberAttended ORDER BY NumberAttended;"
attendee.breakdown <- sqlQuery(conn, cmd)
#A vast majority of entrants have been to one SQL Saturday.  Let's take the log of the distribution
#to see the pattern a bit better.
barplot(xlab="Number Of SQL Saturdays Attended", attendee.breakdown$People, names=attendee.breakdown$NumberAttended)
#What we see is an exponential decay pattern.  This is unfortunate for us because we
#were hoping to find a two-peaked distribution.
barplot(xlab="Number Of SQL Saturdays Attended", log(attendee.breakdown$People), names=attendee.breakdown$NumberAttended)

We’re not looking for pretty; we’re looking for fast. If I wanted pretty, I’d pull down ggplot2 and spend a while sprucing up the graphics. Anyhow, here’s what you get when you run the first barplot:

AttendedCount

That’s really hard to read because such a high percentage of attendees have been to just one SQL Saturday. So next up, I’m going to take the log of number of attendees. This will give us a better idea of the underlying distribution of data.

AttendedCountLog

I believe the appropriate response at this point is, “Well, shoot.”  We see a fairly smooth exponential decrease as number of SQL Saturdays attended goes up.  There’s a little blip around the 13-14 mark, but you’d have to do some major squinting to make out a second distribution in the data.  As a result, I don’t have a great bifurcation mark, and so I’m picking a quick and dirty number:  9.  If you’ve been to at least 9 SQL Saturdays, you’ve probably gone from “casual attendee” to speaker (or at least active volunteer).  It’s easier to think of counter-examples in places, like Jacksonville, which have had quite a few more SQL Saturdays than places like Omaha, but I think 9’s about as good a cutoff point as any.

Analyzing Attendees

So now, after about 3000 words, we get the main event.  Who’s really going to these things?  As usual, I’m going to start off with a T-SQL statement which frames my data nicely, and then perform analysis using R.

SELECT
	s.EventNumber,
	s.EventName,
	s.EventDate,
	s.City,
	a.UserGUID,
	a.NumberAttended,
	s.GeoPoint.STDistance(a.GeoPoint) / 1609.344 AS DistanceInMiles
FROM PASSData.dbo.SQLSatRegistrations r
	INNER JOIN PASSData.dbo.SQLSaturday s
		ON s.EventNumber = r.SatNum
	INNER JOIN PASSData.dbo.SQLSaturdayAttendee a
		ON r.UserGUID = a.UserGUID
WHERE
	a.Longitude IS NOT NULL
  AND (s.GeoPoint.STDistance(a.GeoPoint) / 1609.344) < 3000
	AND a.NumberAttended < 9;

This is my data set, which returns 55019 rows. This means that we lost approximately half of our registrants as a result of data cleansing. Again, this should make us quite circumspect about predicting results, but I think we have enough data to do some descriptive analytics without making fools of ourselves.

One stipulation I want to make is that I’m cutting off any entries over 3000 miles. My belief is that nobody is flying cross-country to visit a SQL Saturday (again, excluding speakers and sponsors). The only exception I could see is Hawaii, so I’m probably skewing that result a bit…but even then, most of the long-distance travelers are probably speakers and sponsors going on vacation. If the distance I came up with is more than 3000 miles, I’m just treating it as bad data.

#Get non-speaker attendee distance from each event.
cmd <- "SELECT
	s.EventNumber,
	s.EventName,
	s.EventDate,
	s.City,
	a.UserGUID,
	a.NumberAttended,
	s.GeoPoint.STDistance(a.GeoPoint) / 1609.344 AS DistanceInMiles
FROM PASSData.dbo.SQLSatRegistrations r
	INNER JOIN PASSData.dbo.SQLSaturday s
		ON s.EventNumber = r.SatNum
	INNER JOIN PASSData.dbo.SQLSaturdayAttendee a
		ON r.UserGUID = a.UserGUID
WHERE
	a.Longitude IS NOT NULL
  AND (s.GeoPoint.STDistance(a.GeoPoint) / 1609.344) < 3000
	AND a.NumberAttended < 9;"
attendee <- sqlQuery(conn, cmd)

From here until I get through the results, I’m going to throw in a bunch of images and single lines of code. I think the images describe what the code is doing, so I’ll mostly add color commentary instead of function call descriptions.

boxplot(attendee$DistanceInMiles, data=attendee, main="Attendee Distance Traveled", horizontal=TRUE)

AttendeeDistanceTraveled

This box and whisker plot shows a couple of things: the bottom 75th percentile doesn’t travel that far, and there are a lot of outliers in this data set. As useful as this visualization is, I also want to get a summary of results.

summary(attendee$DistanceInMiles)

TotalSummary

The median distance is 19 miles, and the mean distance is 94 miles. 75% of attendees travel less than 70 miles in this data set. So where do we get 400-600 miles? Maybe it’s the 90th or 95th percentile.

quantile(attendee$DistanceInMiles, c(0.25, 0.5, 0.75, 0.9, 0.95))

TotalQuantile

95% of attendees travel under 400 miles. But how has this changed over time? I’m going to use the year function in lubridate and apply it to my distance in miles plot to group by year.

boxplot(attendee$DistanceInMiles~year(attendee$EventDate), main="Attendee Distance Traveled By Year", horizontal=TRUE)

AttendeeDistanceTraveledByYear

This graphic is packed, but we can see a couple of things. First, the median has dropped a bit since 2007-2008. Second, the 75th percentile has also dropped a bit. I think both of these make a lot of sense: as we get more events, people don’t have to travel as far to attend a SQL Saturday. If the only one in your region is 300 miles away, you can justify going that far once a year for a unique event. But if there are 6 in your region over the course of a year, why not just pick the closest one or two?

Still, I want to see some hard numbers, so I’m going to call the summary function for each year.

tapply(attendee$DistanceInMiles, year(attendee$EventDate), summary)
$`2007`
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
   2.896   13.410   93.150  229.600  182.700 2147.000 

$`2008`
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
   0.2401   10.5600   27.3500  159.9000  118.8000 2515.0000 

$`2009`
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
   0.2845    9.6410   19.2300  116.2000   73.5600 2473.0000 

$`2010`
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
   0.1588   10.1300   20.5500  120.6000   95.0500 2717.0000 

$`2011`
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
   0.6631   10.5800   23.2100  123.3000  106.2000 2686.0000 

$`2012`
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
   0.1913    9.7210   21.3200  113.0000   92.9900 2843.0000 

$`2013`
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
   0.0337    9.3650   19.7400  103.6000   72.5000 2710.0000 

$`2014`
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
   0.0337    8.5420   17.3000   84.3200   60.1100 2694.0000 

$`2015`
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
   0.0337    8.1410   16.5700   74.9300   49.7500 2700.0000 

$`2016`
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
   0.0337    9.3220   19.8800   81.8100   57.3300 2685.0000 

The mean has dropped considerably from over 110-120 miles down to 75-85 since 2014. 75% of attendees now won’t go more than about an hour for an event—or, maybe phrased differently, events are finding plenty of local attendees. Let’s see what it looks like at the 90th and 95th percentiles.

tapply(attendee$DistanceInMiles, year(attendee$EventDate), quantile, type=8, probs=c(0.25, 0.5, 0.75, 0.9, 0.95))
$`2007`
       25%        50%        75%        90%        95% 
  13.29615   93.14978  185.26700  865.47610 1025.75327 

$`2008`
      25%       50%       75%       90%       95% 
 10.56431  27.35487 118.84406 457.74800 823.98637 

$`2009`
       25%        50%        75%        90%        95% 
  9.616829  19.226355  74.048157 204.863135 466.467323 

$`2010`
      25%       50%       75%       90%       95% 
 10.12820  20.55441  95.39654 224.65303 546.36305 

$`2011`
      25%       50%       75%       90%       95% 
 10.57865  23.21103 106.20535 235.10075 571.08523 

$`2012`
       25%        50%        75%        90%        95% 
  9.717188  21.323516  92.995248 250.098624 450.006102 

$`2013`
       25%        50%        75%        90%        95% 
  9.365244  19.742029  72.515673 215.197417 439.364723 

$`2014`
       25%        50%        75%        90%        95% 
  8.541756  17.301598  60.168109 177.018031 316.380895 

$`2015`
       25%        50%        75%        90%        95% 
  8.141225  16.571739  49.746705 155.265559 259.042488 

$`2016`
       25%        50%        75%        90%        95% 
  9.320905  19.878521  57.334977 155.104415 279.424947 

Okay, I mentioned I’d focus mostly on color commentary. Here’s an exception: I specified type=8 for my quantile function. Which type you choose can make a huge difference in calculation. The link recommends type 8 when you don’t have a reason to go with a different calculation, but you can use any of the continuous calculation types.  They will differ a bit (and you can see differences between what summary provided us and what quantiles gives us), but for this data set, I don’t think you get any crazy differences by selecting type 7 or 9.

Finally, this leads me to an interesting question: are we getting more registrants per event now than before? If we’re seeing event growth, attendee growth, and distance shrinkage, that indicates that SQL Saturdays are becoming more popular, drawing in people who previously wouldn’t travel the 20 miles across town to spend a day learning. Back to SQL Server once more.

SELECT
	YEAR(s.EventDate) AS EventYear,
	COUNT(DISTINCT s.EventNumber) AS NumberOfEvents,
	COUNT(1) AS NumberOfRegistrants,
	1.0 * COUNT(1) / COUNT(DISTINCT s.EventNumber) AS AverageAttendance
FROM dbo.SQLSaturday s
	INNER JOIN dbo.SQLSatRegistrations r
		ON s.EventNumber = r.SatNum
GROUP BY
	YEAR(s.EventDate)
ORDER BY
	EventYear DESC;

RegistrantAveragePerYear

This time I included anybody who is in the list, regardless of whether we could geocode that person’s location. Given that 2016 was only about a third of the way over, I’m not sure we can put much credence into those registrant counts. But we have seen an increase over the years. It’s not a major increase, but it is an increase.

Conclusions

I’ve hedged my bets a few times during this post because I know the data’s incomplete, and there’s a risk that the points of the data set I’m missing would conclusively prove me wrong.  Nevertheless, I think the likelihood of those extra data points being radically different than the ~50% that we were able to keep is fairly low, so I’ll make a couple of cautious conclusions.

First, non-speakers and non-sponsors do not appear to travel very far to SQL Saturdays.  Looking at the yearly quantile breakout above, by the time we get to 2014, people have become a bit insular in their SQL Saturday attendance.  The median traveler (after that initial event in Orlando in 2007) has never traveled more than about 20 miles.

Also, I think the 600 mile radius is too large, based on the data set.  Looking at the 95th percentile of travelers, those people haven’t gone more than 400 miles on average since 2013, and the number has dropped down to the 260-280 range.  Also, the 95th percentile, even in a heavily skewed distribution like this one, is looking far out there.  Even if you scale it back to just the 90th percentile (which is still considered an edge case area), attendees since 2014 won’t even go 200 miles.  That means that there could be some overlap between Cleveland and Columbus (which, having visited both the past couple of years, I do see), but there’d be almost no overlap between Cleveland and Cincinnati, and you’d be stretching to see much overlap between Columbus and Louisville.

But my real conclusion is this:  grab the data set and draw your own conclusions.  I’m working through this stuff the same as anybody else could, and we’re in a community full of people who know how to analyze data.  Maybe you’ll find something I missed which either bolsters or undercuts one of my points; either way, I want to know and I want to push you to dig just a little deeper than I did.  (But hopefully what you find won’t completely invalidate my points…)

Become A Better Speaker

That’s a big title.  Expect “(on the margin)” in tiny font after the title.

What Other People Say

Let me start off with some great resources, and then I’ll move to a few things that have helped me in the past.

Troy Hunt

Troy is a master of preparation.  Check out his prep work he did for NDC.  I don’t remember if he linked in either of these posts, but he will schedule tweets for certain parts of his talk because he knows exactly when he’ll get to that point.  Also check out his speaker anti-patterns and fix them.  Of his anti-patterns, I most hate the Reader.  Please don’t read your slides; your audience can read faster than you can speak, and your audience can read the slides later.  Tell engaging stories, tie things together, and give people something they can remember when they do re-read those slides.  You’ll also enjoy his checklist of the 19 things he needs to do before the talk begins.

What I get from Troy:  Practice, practice, practice.  One thing I want to get better at is taping myself and forcing me to listen to it.  There are verbal tics which you can only find when you listen to yourself.

Brent Ozar

Brent is a top marketer in the SQL Server world.  Here’s some great advice from him on delivering technical presentations.  He’s absolutely right about giving people a link at the end telling them where to get additional materials.  He has additional information on how to handle slides, working with the audience, and getting prepped for the session.  I think his music idea is interesting; at a couple all-day cons we’ve hosted locally, the guy who hosts tends to put on some early morning music before festivities begin.  It lightens up the mood considerably and works a lot better than a bunch of introverts staring quietly at their phones for 40 minutes until someone makes a move to begin.

What I get from Brent:  I’ve cribbed a few things from him.  All of my slide decks have two pieces of information at the end:  “To learn more, go here” and “And for help, contact me.”  The first bit then points to a short form address URL I bought (http://CSmore.info) and which redirects you to the longer-form address.  On that link, I include slides, a link to my GitHub repo, and additional links and helpful information.  The second bit on contacting includes my e-mail address and Twitter handle.

Paul Randal

Paul has a great Pluralsight course on communication.  This course isn’t just about presentations, but there are two modules, one on writing presentations and one on giving presentations.  I highly recommend this course.

What I get from Paul:  Practice, prepare, and don’t panic.  Compare against my previous review of this course.  Things will inevitably go wrong, so take spare cables and extras of anything you can fit.  At this point, I even carry a mini projector in case of emergency.  I’ve not needed to use it but there might come a day.

Julia Evans

Julia has a fantastic blog post on improving talks and conferences.  I definitely like her point about understanding your target audience.  Her argument in favor of lightning talks is interesting, and I think for beginners, it’s a good idea.  For more experienced speakers, however, 10 minutes is barely an introduction, and sometimes I want a deep dive.  Those have to be longer talks just by their nature.

Another great point she makes is to give hard talks:  aim for beginners to scrape part of it but for experts to learn from it as well.  I absolutely love Kyle Kingsbury’s work too and he helped me get a handle on distributed systems, but in a way that I could re-read his posts several months later and pick out points I never got before.

What I get from Julia:  Find your motivation and make your talks work better for a broader range of people.  I have one talk in particular on the APPLY operator which has the goal of making sure that pretty much anybody, regardless of how long you’ve been in the field, learns something new.  There are a couple of examples which are easier for new users to understand and a couple of examples which are definitely more advanced but still straightforward enough that a new user can get there (even if it does take a little longer).  Ideally, I’d like all of my talks to be that way.

What I Recommend

Here are a few recommendations that I’ll throw out at you.  I’m going to try not to include too much overlap with the above links, as I really want you to read those posts and watch those videos.

  • Practice!  Practice until you know the first five minutes cold.  There are some presenters who will practice a talk multiple times a day for several days in a row.  I’m not one of those people, but if you’re capable of it, go for it.
  • Record yourself.  Find all of those placeholder words and beat them out of yourself.  I don’t mean just “Uh,” “Er,” “Um,” “Ah,” and so on.  In my case, I have a bad habit of starting sentences with “So…”  I’m working on eliminating that habit.  Recordings keep you honest.
  • Tell interesting stories.  To crib from one of my 18th century main men, Edmund Burke, “Example is the best school of mankind, and they will learn at no other.”  Theory ties your work together, and stories drive the audience.  Stories about failure and recovery from failure are particularly interesting; that’s one of the core tenets of drama.
  • Prep and practice your demos.  If you’re modifying anything (databases, settings, etc.) over the course of your demo, have a revert script at the end or revert your VM.  That way, you won’t forget about it at the end of a great talk, give the talk again later (after you’ve forgotten that you never rolled everything back), and have your demos fail because you forgot.  Not that this has happened to me…
  • Speaking of failure, prepare for failure.
    • Have extra cables.  I have all kinds of adapters for different types of projectors.  I have VGA, DVI (though I’ve only seen one or two projectors which required this), and HDMI adapters for both of my laptops in my bag at all times.
    • Prepare to be offline.  If your talk can be done offline, you should do it that way.  Internet connections at conferences are pretty crappy, and a lot of demo failures can be chalked up to flaky networks.  This means having your slides available locally, having your demo scripts available locally, etc.
    • Have your slides and demo scripts on a spare drive, USB stick, or something.  If all else fails, maybe you can borrow somebody else’s laptop for the talk.  I had to do this once.  It was embarrassing, but I got through it and actually got good scores.  The trick is to adapt, improvise, and overcome.  And you do that with preparation and practice.
    • Have your slides and demo scripts available online.  I know I mentioned assuming that your internet connection will flake out, but if your system flakes out and someone lends you a laptop but can’t accept USB sticks (maybe it’s a company laptop), at least you can grab the slides and code online.
    • If you do need an internet connection, have a MiFi or phone you can tether to your laptop, just in case.  If you have two or three redundant internet sources, the chances of them all failing are much lower than any single one failing.
    • Have a spare laptop if you can.  That’s hard and potentially expensive, but sometimes a computer just goes bye-bye an hour before your presentation.
  • Install updates on your VMs regularly.  Do it two nights before your presentation; that way, if an update nukes your system, you have a day to recover.  Also, it reduces the risk that Windows 10 will pick your presentation as the perfect time to install 700 updates.  Very helpful, that Windows 10 is.
  • When in doubt, draw it out.  I have embraced touchscreens on laptops, bought a nice stylus, and love drawing on the screen.  I think it helps the audience understand where you’re going better than using a laser pointer, and sometimes you don’t have a whiteboard.  If you don’t like touchscreens, ZoomIt still works with a mouse.
  • Speaking of which, learn how to use ZoomIt or some other magnification tool.  Even if you set your fonts bigger (which yes, you need to do), you will want to focus in on certain parts of text or deal with apps like SQL Server Management Studio which have fixed-size sections.

There are tomes of useful information on this topic, so a single blog post won’t have all of the answers, but hopefully this is a start.