TIL: Database Mirroring, TDE, And Encryption

I recently had to build database mirroring for a database with Transparent Data Encryption enabled. Ahmad Yaseen had a very nice walkthrough which covered most of what I needed to know. There were a couple things I needed to do to get everything working, so I figured it was worth a blog post.

Set Up Certificates

The first thing I had to do was set up a master key and certificate on my primary instance:

USE master;
GO
CREATE MASTER KEY ENCRYPTION BY PASSWORD = 'Some Master Key Password';
go
CREATE CERTIFICATE TDECertificate WITH SUBJECT = Some TDE Certificate';
go

Then we need to turn encryption on for the database, which is a two-step process:

USE [SomeDatabase]
CREATE DATABASE ENCRYPTION KEY
WITH ALGORITHM = AES_256
ENCRYPTION BY SERVER CERTIFICATE TDECertificate;
GO
ALTER DATABASE [SomeDatabase] SET ENCRYPTION ON;

At this point, TDE is on for the primary instance.

Backups

Now it’s time to take some backups. First, let’s back up the various keys and certificates:

USE [master]
GO
--Back up the service master key
--Note that the password here is the FILE password and not the KEY password!
BACKUP SERVICE MASTER KEY TO FILE = 'C:\Temp\ServiceMasterKey.key' ENCRYPTION BY PASSWORD = 'Service Master Key Password';
GO
--Back up the database master key
--Again, the password here is the FILE password and not the KEY password.
BACKUP MASTER KEY TO FILE = 'C:\Temp\DatabaseMasterKey.key' ENCRYPTION BY PASSWORD = 'Database Master Key Password';
GO
--Back up the TDE certificate we created.
--We could create a private key with password here as well.
BACKUP CERTIFICATE [TDECertificate] TO FILE = 'C:\Temp\TDECertificate.cert'
	WITH PRIVATE KEY (FILE = 'C:\Temp\TDECertificatePrivateKey.key', ENCRYPTION BY PASSWORD = 'Some Private Key Password');
GO

Then we want to take a database backup and log file backup. I’ll let you take care of that part.

Now I want to get mirroring set up.

Mirror Certificates

On the mirror instance, let’s restore the various certificates. I’m assuming that this is a true mirroring instance and that you haven’t created any keys. I also moved the keys, certificates, and backups over to the mirroring instance’s C:\Temp folder.

USE [master]
GO
--Test restoration of the keys and certificate.
RESTORE SERVICE MASTER KEY FROM FILE = 'C:\Temp\ServiceMasterKey.key' DECRYPTION BY PASSWORD = 'Service Master Key Password';
GO
--For the master key, we need to use the file decription and then the original password used for key encryption.  Otherwise,
--your restoration attempt will fail.
RESTORE MASTER KEY FROM FILE = 'C:\Temp\DatabaseMasterKey.key'
	DECRYPTION BY PASSWORD = 'Database Master Key Password'
	ENCRYPTION BY PASSWORD = 'Some Master Key Password' FORCE;
GO
OPEN MASTER KEY DECRYPTION BY PASSWORD = 'Some Master Key Password';
GO
CREATE CERTIFICATE [TDECertificate] FROM FILE = 'C:\Temp\TDECertificate.cert'
WITH PRIVATE KEY (FILE = 'C:\Temp\TDECertificatePrivateKey.key', DECRYPTION BY PASSWORD = 'Some Private Key Password');
GO

I needed to use the FORCE directive when restoring the master key. Otherwise, this part went smoothly.

Database And Log Restoration

Before restoring the database files, I needed to open the master key file.

OPEN MASTER KEY DECRYPTION BY PASSWORD = 'Some Master Key Password';
ALTER MASTER KEY ADD ENCRYPTION BY SERVICE MASTER KEY;
RESTORE DATABASE [SomeDatabase] FROM DISK = 'C:\Temp\SomeDatabase.bak' WITH NORECOVERY, REPLACE;
RESTORE LOG [SomeDatabase] FROM DISK = 'C:\Temp\SomeDatabase.trn' WITH NORECOVERY, REPLACE;

Now we have the mirror database in Restoring mode, and it’s using our service master key, so TDE is ready to work on this database as well.

From there, I was able to finish setting up mirroring. Note that I didn’t need to do anything special to the witness server–it doesn’t need any certificates or keys to do its job.

Additional Links

In addition to Ahmad’s post, I read though this post on the topic. I also read Don Castelino’s post on mirroring configuration failure scenarios. I ran into a couple of these errors, so it was a helpful article.

Finally, I want to leave this Simon McAuliffe post on the uselessness of Transparent Data Encryption. I’m not sure I’m 100% in agreement, but it’s a thought-provoking post.

Pluralsight Reviews: Riak Fundamentals

I recently wrapped up watching Riak Fundamentals, a course on Riak that Adron Hall put together in 2013.

I’d like to see more Riak content on Pluralsight, as this course wasn’t quite what I’d hoped.  I went into it hoping for a course which would give an intro to developing a simple Riak solution.  Instead, this course is a step or two removed from that.  It explains the concept of Riak (at least as it was in November of 2013) and some of the basic administrative commands and tools.

This wasn’t a bad course, but I’d really like to see something a bit more up to date, covering development and including concepts like effective use of CRDTs.

Incidentally, I did learn from this course that Riak was developed using Erlang, which is going to be helpful in a presentation I’m giving in a couple of weeks, so that’s cool.

TIL: Slack + SQL Server

No, I don’t mean the SQL Server community on Slack—although you should definitely join up there.

I have a client whose primary method of communication is Slack.  They have e-mail accounts, but hardly use them.  They also have no real monitoring solution in place, so I wanted to be able to do some quick checks and, in the event of an error, push a message to a Slack channel that they could monitor.

It turns out that there is already a Github repo for a CLR function to push a message out to Slack.  The folks at PowerupCloud have a nice step-by-step walkthrough on how they did it.  I wish they had continued the series on integrating further, particularly around SQL Agent alerts.  If you’re a paying Slack customer, you can integrate e-mail with Slack.  You could also throw Zapier into the mix and possibly get a free integration point.

If you want to avoid the external integration process, you can read the default trace or system health XE (about which I learned a lot from Ed Leighton-Dick’s presentation on the topic) and write calls to Slack based on results there.

The Marginal DBA

You Have A Performance Problem; What Do You Do?

Brent Ozar had a blog post about what hardware you can purchase with the price of two cores of Enterprise Edition and argues that you should probably spend some more money on hardware.  Gianluca Sartori has a blog post along similar lines.  By contrast, Andre Kamman had an entire SQLBits talk about not throwing hardware at the problem.  And finally, Jen McCown argues that you should spend the money on Enterprise Edition.  When you have a performance problem, who’s right?

The answer is, all of them.  And to head off any comments, my first paragraph is mostly a setup; ask any of these authors and I’m sure they’ll tell you that “it depends” is the right answer.  The purpose of this post is to dig a bit deeper and discuss when each of these points is the correct answer, looking at things from an economist’s perspective.

Detour Into Basic Economics

Before I get too deep into application of economic principles to the world of database administration, let’s cover these principles really quick so we’re all on the same page.

Scarcity

Your data center is only so big.  Your rack can only hold so many servers.  Your credit card has a limit, and so does your bank account.  At some point, your budget runs out.  And even if your budget is phenomenal, there are technological limits to how big a server can be, or even how big a cluster of servers can be.  No matter how much cash, how many top-tier experts, and how much time you have available, there is always a limit.

I’ve personally never had to worry about hitting the 128 TB RAM limit in Windows Server 2012 R2.  Even if you had a server capable of using 128 TB of RAM, I’m sure it’d be so expensive that there’s no way I’d ever be allowed near it.

This, in short, is scarcity:  you will always run out of resources before you run out of problems.

Opportunity Cost

In a world of scarcity, we can figure out the true cost of something:  what is the next-best alternative to this?  For example, let’s say we have the option to buy $100K worth of hardware.  What other alternatives exist?  We could bring in a consultant and pay that person $100K to tune queries.  We could send the current staff out for $100K worth of training.  We could hire a new person at $100K (although this is a stream of income rather than a one-time payment, so it’s a bit trickier of an answer).  Or we could buy $100K worth of software to help solve our problem.  Dollar values here simply help focus the mind; even without currencies, we can still understand the basic concept:  the actual cost of a good or service is what you forego when you decide to make the trade to obtain that good or service.

Understanding opportunity cost is critical to making good decisions.  It also helps lead into the third major topic today.

Marginal Utility

One of the toughest problems of pre-Marginal Revolution economics was the diamond paradox.  The short version of the diamond paradox is as follows:  water is more valuable than diamonds, in the sense that a person can live with water but no diamonds, but not vice versa.  So why is water so inexpensive, yet diamonds are so expensive?

The answer is in marginal utility.  The idea behind marginal utility is that your valuation of a thing changes as you obtain more of that thing.  The first glass of water for a thirsty man in the desert is worth quite a bit; the 900th glass of water, not so much.  In hardware terms, going from 8 GB of RAM to 16 GB of RAM is worth more to you than going from 192 GB to 200 GB.  The 8-16 jump allows you to do quite a bit that you might not have been able to do before; the jump from 192 to 200, even though it is the same total RAM difference, opens many fewer doors.

Detour’s Over; Back To The Main Point

Armed with these principles of economics, let’s dive into the problem.  To keep things simple, I’m going to think about three competing areas for our IT budgets:

  1. Purchasing newer, better, or more hardware
  2. Paying people to write better code or tune application performance
  3. Purchasing Enterprise Edition licenses rather than Standard Edition (or APS appliances rather than Enterprise Edition)

Applying the scarcity principle first, we know we can’t afford all of these things.  If your company can afford all of the items on the list, then this whole topic’s moot.  But honestly, I’ve never seen a company whose budget was so unlimited that they could keep hiring more and more people to fix code while buying more and more hardware and getting more and more expensive software.  At some point, you hit a limit and need to make the hard decision.

Here’s a simplistic interpretation of those limits:

LinearConstraints1.png

In this example, I have a set of hardware, queries, and SQL Server versions that I allocate to OLTP processing and OLAP processing.  I have three constraints (which I’ve made linear for the sake of simplicity):  hardware, application code, and SQL Server versions.  The idea behind this two-dimensional linear constraint picture is that you are able to pick any point on the X-Y plane which is less than or equal to ALL constraints.  In other words, you can pick any spot in the purple-lined section of the image.

Taking this scenario, the application is tuned reasonably well and the version of SQL Server isn’t going to help us much; we simply don’t have powerful enough hardware.  This might be something like trying to run a production SQL Server instance on 8 GB of RAM, or running on 5400 RPM hard drives.  Even if we hire people to tune queries like mad and push the red boundary out further, it doesn’t matter:  we will still have the same bottlenecks.

By contrast, in this case, if we purchase new hardware, we can shift the purple curve out.  Let’s say we get some fancy new hardware which solves the bottleneck.  Now our constraint problem might look something like the following:

LinearConstraints2.png

We still need to pick a spot somewhere in the purple-lined section, but notice that our constraint is no longer hardware.  In fact, we have two potential constraints:  version limitations and application limitations.  The answer to “what do we need to do?” just got a bit more difficult.  If we are in a situation in which we lean heavily toward OLTP activity, our next bottleneck is the application:  now it’s time to rewrite queries to perform a bit better.  By contrast, if we’re in a big warehouse environment, we will want to upgrade our servers to take advantage of OLAP features we can’t get in Standard Edition (e.g., clustered columnstore indexes or partitioning).

Can We Make This Practical?

In the real world, you aren’t going to create a linear programming problem with defined constraints…probably.  Nevertheless, the important part of the above section is the principle:  solve your tightest constraint.

The Easy Stuff

Here’s a bit of low-hanging fruit:

  1. If you’re running production SQL Server instances with less than 32 GB of RAM, you almost certainly have a hardware constraint.  Buy that RAM.  Going back to Brent’s post, you can buy a brand new server for $7K, but if you have a server that can hold at least 128 GB of RAM, it looks like you can buy that RAM for about a grand (and I’m guessing you can probably get it for less; that’s retail price).
  2. If you haven’t done any query tuning, you can probably find two or three costly queries and tune them easily.  They may not be the biggest, nastiest queries, but they help you reduce server load.  Similarly, find two or three missing indexes and add them.
  3. If you query relatively small sets of static data, put up a Redis server and cache those results.  A client of mine did that and went from about 85 million SQL Server queries per day to 60,000.  Even if your queries are perfectly optimized, that’s an incredible difference.

The nice thing about low-hanging fruit is that the opportunity cost is extremely low.  Spend a grand for RAM, 3-5 hours tuning queries, and a few hundred dollars a month on Premium Redis caching.  For a company of any reasonable size, those are rounding errors.

The Hard Stuff

Once you get beyond the low-hanging fruit, you start to see scarcity creeping back in.  You’ve got reasonable hardware specs, but maybe that Pure Storage array will help out…or maybe you need to bring in top-notch consultants to optimize queries in a way you could never do…or maybe it’s time to buy Enterprise Edition and take advantage of those sweet, sweet features like In-Memory OLTP.

Finding the right answer here is a lot harder of a problem.  One place to start looking is wait stats.  SQL Server keeps track of reasons why queries are waiting, be they hardware-related (disk, IO, CPU), scheduling related, network bandwidth related, or even because your application can’t handle how fast SQL Server’s pushing data across to it.  There’s a bit of interpretation involved with understanding wait stats, but Paul Randal is putting together the definitive resource for understanding wait stats.

The rest of my advice comes down to experience and testing.  If you’re in an environment in which you can run tests on hardware before purchase, that’s an easy way to tell if hardware is your bottleneck.  One scenario I had a few years back involved a case in which we got to play with a solid state disk array back when those were brand new.  We ran our production SQL Server instances on them for a week or so and saw almost no performance gain.  The reason was that our servers had 8 GB of RAM and 1-2 VCPUs, so faster disk simply exacerbated CPU and memory issues.  By contrast, at my current company, we had a chance to play with some newer servers with more powerful CPUs than what we have in production and saw a nice performance boost because we’re CPU-limited.

The nice thing is that my graphics above aren’t quite accurate in one big sense:  application tweaks and hardware work in parallel, meaning that buying new hardware can also push out the application tweak curve a bit and vice versa.  This means that, in many cases, it’s better to find the cheaper answer, be that hardware or sinking hours into application development and tuning.

Emphasizing Sunk Costs

There’s one last thing to discuss here:  the concept of sunk costs.  If you want a long-winded, rambling discussion of sunk costs, here’s one I wrote while in grad school.  For the short version, “sunk costs are sunk.”

Why is it that a company is willing to pay you $X a year to tune queries but isn’t willing to pay 0.1 * $X to get hardware that would end up solving the constraint issue above so much more easily?  The answer is that, from an accounting standpoint, the new hardware is a new cost, whereas your salary is already factored into their budget.  From the standpoint of the in-house accountant, your salary is a fixed cost (which they have to pay regardless of whether you’re spending 80 hours a week tuning queries like a boss or 5 hours a week of work & 35 hours of Facebook + Pokemon Go…at least until they fire you…) and from the accountant’s perspective, your effort doesn’t get shifted around; you’re still a DBA or database developer or whatever it is you do.  So your salary is a sunk cost.

I should mention a huge, huge point here:  salaries are sunk costs in the short term.  In the long term, salaries are neither fixed nor sunk costs.  In the long term, your manager can re-assign you to a different department (meaning that your salary is not sunk), and your salary can change or even go away if it makes sense from the company’s perspective (meaning that salary is a variable cost rather than a fixed cost).  But for our purposes, we’re interested in the short-term ramifications here, as “long-term” could mean a year or even several years.

Also, there is one big exception to the rule that your salary is a sunk cost:  if you get paid hourly, your wages become a variable cost, and it’s a lot easier to sell the company on trading variable costs for fixed costs:  “If you buy this hardware for $X, you don’t have to pay me 3 * $X to tune queries.”  That’s an actual cost reduction, whereas in the salary case, you’re still arguing for a net increase in company cost incurred.

So what can you do in that case?  Selling the company on “reduced cost” doesn’t cut it so much for salaried employees because you generally aren’t really reducing costs from the company’s standpoint.  Instead, you have to sell on opportunity cost:  if you weren’t spending 15 hours a week trying to tune queries to get this ETL process to work adequately, you could focus on Project X, which could net the company $Y in revenue (where $Y is an expected revenue return conditional upon the probability of success of Project X).  If $Y is substantially higher than the cost of hardware, you now have a solid business case you can take to higher-ups to get newer, better hardware.

Similarly, if you’re spending all of your time on application development and the company’s throwing barrels of cash at vendors for new hardware, you could make the case that getting some time to tune queries might allow the company to save money on net by deferring the cost of hardware purchases.  This is an easier case to make because, again, your salary is a sunk cost (unless you’re a wage-earning employee), so the opportunity cost comes down to what the next-best alternative is with your time.

Conclusion

For the most part, I can’t tell you whether you’re better off buying more hardware, tuning those queries, upgrading to Enterprise Edition, or doing something entirely different (at least unless you’re willing to shovel cash in my direction and have me take a look…).  What I hoped to do in this blog post was to give you some basic economics tools, letting you apply them to your own situation.  With these concepts in place, they’ll give you ammunition when going to management to ease your biggest constraint.  Managers and accountants are going to be a bit more amenable to arguments around opportunity cost, and your pleas are less likely to fall on deaf ears when you realize that sunk costs shouldn’t affect future behavior.

SQL Saturday Distances, A Closer Look

Last week, I put together an analysis of SQL Saturdays.  There were a lot of great comments, including one by Jack Corbett, who was interested in seeing a closer analysis of how far attendees tend to travel:

Most people from Maine aren’t going to anything outside of New England, and most aren’t going to go south of Boston. Seems to limit opportunities for attendees. Instead of analyzing the distance between locations, let’s look at how far non-speakers are traveling to go to events and limit based on that.

Fortunately(?), I happen to have just the data set.  PASS has made certain Summit and SQL Saturday data available as part of the SQL Server Launch 2016 Discovery Days.  Now, let me warn you:  this is going to be a long blog post.  But hey, it’s Friday, so cozy on up with your beverage of choice and let’s get data scienceing.

Dirty Data Done Dirt Cheap

Airing of Grievances

Before I begin, allow me to perform the data science Airing of Grievances.  This is an important part of data analysis which most people gloss over, instead jumping right into the “clean up the dirty data” phase.  But no, let’s revel in its filth for just a few moments.

  • There are some strange data type choices here.  For example, UserGUID (which is the deidentifier for individuals) is a uniqueidentifier on PASSMembersWithGeolocation, but is a different type in SQLSatRegistrations.  There are some other data type choices which aren’t ideal (for example, the SQL Saturday number is a float?!), but that’s not a huge deal.
  • What is a huge deal is that some of the postal codes are wrong.  If you look at New Hampshire addresses, the postal codes have leading 0s lopped off.  That came back to burn me in my first analysis, requiring me to re-run my geolocation function.
  • There is no breakout by event; we only see a denormalized junction of SQL Saturday events and attendees.  That can make calculations a bit harder, so I’d like to normalize this a bit.
  • The second-biggest issue is that the data set is, in my opinion, incomplete.  There are a few reasons why:
    • We only have data for people who registered, not for people who attended.  There might be some pattern in people who registered but did not attend, but we won’t be able to see that.  There are also going to be people who attended but did not register; these day-of attendees aren’t going to be in the system, but we let them in anyhow if we have room.  I think those attendees would skew toward “local” rather than “distant” drive times, meaning that you’re not going to travel 500 miles to sneak into an event for which you didn’t register…usually.
    • There is no separation of speakers, sponsors, and “normal” attendees.  Speakers and sponsors travel across the country to visit these things (we’re weird that way), but the big question is, how far do non-speaking, non-sponsoring attendees normally travel?  I have no way of knowing for sure who is which, so I’ll try to answer that in the data cleansing section.
    • This analysis is only through about April of 2016.  There have been several events since, which could affect our analysis, but I don’t have that data.
    • There are lots of missing fields.  For example, there are 9 American SQL Saturdays which had no postal code set.  There are also a lot of attendees with missing data—they didn’t fill in the city or the postal code or the region (state).  There are even attendees with zero information filled out, and those are just going to be useless for our examination.
  • This leads to my biggest issue:  nasty data.
    • There are no controls on data entry.  If you want to set city to “Nope, not gonna tell” then it’ll save that.
    • Furthermore, all of the entries are free-form.  There are no checks to ensure that a postal code is valid, that a city exists, etc.
    • In short, there are a lot of misunderstandings, miscodings, mistrust, and misanthropy in this data set.

Limitations of Analysis

Given the issues in this data set, I have to make a couple tough calls with the data.  First, I’m only going to look at US attendees of US SQL Saturdays.  I wanted to look at US and Canada SQL Saturdays like I did in my last analysis, but I had a lot of geocoding problems with Canadian data, so I had to throw that away.  Also, I needed to throw away a lot of attendee data because I simply didn’t have enough to get a valid latitude and longitude.  So we’ll have to go ahead knowing that our analysis is incomplete and any conclusions we draw will be uncertain—but that’s the nature of statistical analysis, so we’ll trudge on nonetheless.

Doing Data Science

Initial Preparation

Grab the PassDataIndividualTables data set.  We’re going to create a database called PASSData and load the tables which are not named “_xlnm_FilterDatabase.”  Once we have that, I did some basic data cleansing work.  I made UserGUID a uniqueidentifier across the board and changed SatNum to an INT instead of a FLOAT.  Then, I built a couple of tables:  dbo.SQLSaturday and dbo.SQLSaturdayAttendee. First, dbo.SQLSaturday:

CREATE TABLE dbo.SQLSaturday
(
	EventNumber INT NOT NULL,
	EventDate DATE NULL,
	EventName NVARCHAR(50) NOT NULL,
	StreetAddress NVARCHAR(100) NOT NULL,
	City NVARCHAR(100) NOT NULL,
	Country NVARCHAR(100) NOT NULL,
	PostalCode VARCHAR(14) NULL,
	Latitude DECIMAL(8,4) NULL,
	Longitude DECIMAL(8,4) NULL,
	GeoPoint AS GEOGRAPHY::Point(Latitude, Longitude, 4326),
	CONSTRAINT [PK_SQLSaturday] PRIMARY KEY CLUSTERED(EventNumber)
);

INSERT INTO dbo.SQLSaturday
(
	EventNumber,
	EventDate,
	EventName,
	StreetAddress,
	City,
	Country,
	PostalCode
)
SELECT DISTINCT
	s.SatNum,
	ss.EventDate,
	s.EventName,
	CASE
		WHEN s.StreetAddress IS NULL AND s.VenueName = 'N/A' THEN ''
		WHEN s.StreetAddress IS NULL AND s.VenueName <> 'N/A' THEN s.VenueName
		ELSE s.StreetAddress
	END,
	s.City,
	s.Country,
	s.PostalCode
FROM dbo.SQLSatRegistrations s
	OUTER APPLY
	(
		SELECT TOP(1)
			EventDate
		FROM dbo.SQLSatSessions sess
		WHERE
			sess.SatNum = s.SatNum
	) ss
WHERE
	s.Country IN ('United States');

--Tacoma 2009 and Jacksonville 2008 (#20) did not have dates set.  Set them manually.
UPDATE dbo.SQLSaturday SET EventDate = '2008-09-12' WHERE EventNumber = 20 AND EventDate IS NULL;
UPDATE dbo.SQLSaturday SET EventDate = '2009-04-01' WHERE EventNumber = 11 AND EventDate IS NULL;

I added Latitude and Longitude so that I could geocode data, and then a Geography data type so that I can calculate distances. Notice that there were two SQL Saturdays which did not have dates. I’m prety sure that Tacoma (event #11) never took place, but I updated it and Jacksonville’s second 2008 event regardless, just to make sure everything was accurate.

Next up, I want to create a table for attendees.

CREATE TABLE dbo.SQLSaturdayAttendee
(
	UserGUID UNIQUEIDENTIFIER NOT NULL,
	City NVARCHAR(50) NULL,
	Region NVARCHAR(50) NULL,
	Country NVARCHAR(50) NULL,
	PostalCode NVARCHAR(50) NULL,
	NumberAttended INT NOT NULL,
	Latitude DECIMAL(8,4) NULL,
	Longitude DECIMAL(8,4) NULL,
	GeoPoint AS GEOGRAPHY::Point(Latitude, Longitude, 4326),
	CONSTRAINT [PK_SQLSaturdayAttendee] PRIMARY KEY CLUSTERED(UserGUID)
);

INSERT INTO dbo.SQLSaturdayAttendee
(
	UserGUID,
	City,
	Region,
	Country,
	PostalCode,
	NumberAttended
)
SELECT
	p.UserGUID,
	p.City,
	p.Region,
	p.Country,
	p.PostalCode,
	COUNT(1)
FROM dbo.SQLSatRegistrations r 
	INNER JOIN dbo.PASSMembersWithGeolocation p 
		ON r.UserGUID = p.UserGUID 
WHERE
	r.Country in ('United States')
	AND
	(
		(p.city IS NOT NULL AND p.region IS NOT NULL)
		OR (p.postalcode IS NOT NULL)
	)
	AND ISNULL(p.Country, 'United States') = 'United States'
GROUP BY
	p.UserGUID,
	p.City,
	p.Region,
	p.Country,
	p.PostalCode;

--Fix cases when postal codes had leading 0s lopped off.
UPDATE dbo.SQLSaturdayAttendee
SET
	PostalCode = RIGHT('00000' + PostalCode, 5)
WHERE
	LEN(PostalCode) < 5;

I have the same Latitude, Longitude, and GeoPoint attributes, and we’ll make use of those later. I also put on a NumberAttended attribute, which sums up the number of times a person has attended a SQL Saturday. I’m limiting the attendee set to those who attended events in the United States (r.Country) and who have either City+Region or PostalCode set. Then, I’m going to fix postal codes, adding leading 0s to ZIP Codes missing them.

Quick Analysis

We’re far enough along that I can do some quick data set analysis and see what we’ve got.

--Basic data
SELECT COUNT(1) FROM dbo.SQLSatRegistrations;
SELECT COUNT(1) FROM dbo.SQLSatRegistrations WHERE Country IN ('United States');
SELECT COUNT(1) FROM dbo.SQLSatRegistrations r INNER JOIN dbo.SQLSaturdayAttendee a ON r.UserGUID = a.UserGUID;
SELECT COUNT(1) FROM dbo.SQLSaturdayAttendee;

BasicCounts

There are 152,000 registrants in the data set, of which 101,164 registered for United States events.  Our data set contains 33,419 distinct attendees, who make up 65,682 of the 100K United States event registrants.

Geocoding Time!

I’m going to do three separate rounds of geocoding.  First, let’s geocode the SQL Saturday events.  There are only about 300 of them, so I can use the same patterns that I used in my previous post.  So let’s move over to some R code for a bit.

install.packages('RDSTK', repos = "http://cran.us.r-project.org")
install.packages("ggmap")
install.packages("RODBC")
install.packages("dplyr")
install.packages("httr")
install.packages("rjson")
install.packages("lubridate")
library(RDSTK)
library(ggmap)
library(RODBC)
library(dplyr)
library(lubridate)

conn <- odbcDriverConnect("Driver=SQL Server;Server=.;Initial Catalog=PASSData;Provider=SQLNCLI11.1;Integrated Security=SSPI")
sqlsats <- sqlQuery(conn, "SELECT EventNumber, StreetAddress, City, Country, ISNULL(PostalCode, '') AS PostalCode from PASSData.dbo.SQLSaturday")
sqlsats$GeoLookup <- paste(sqlsats$StreetAddress, sqlsats$City, sqlsats$Country, sqlsats$PostalCode, sep = " ")

# Perform Google Maps geocoding.  Google Maps provides 1000 geocoding hits per day for free.
# Our data set is ~300 observations, so we can do that without issue.
sqlsats <- cbind(sqlsats, t(sapply(sqlsats$GeoLookup, geocode, USE.NAMES=FALSE)))

#A few records failed to get lat/long
#I will handle them as one-off updates in SQL.
sqlsats
sqlsats.good <- filter(sqlsats, !is.na(lon))
sqlsats.good
# Via http://stackoverflow.com/questions/14334840/how-to-insert-a-dataframe-into-a-sql-server-table
# Not a great practice, but for 300 rows, it'll do.
cmd <- paste("UPDATE PASSData.dbo.SQLSaturday SET Latitude = ",sqlsats.good$lat,", Longitude = ",sqlsats.good$lon," WHERE EventNumber = ", sqlsats.good$EventNumber, sep="", collapse=";")
result <- sqlQuery(conn, cmd, as.is=TRUE)

The first dozen lines or so are loading lots of packages. I’m not going to use all of them for this first section, but we’ll need them later. Then, we make a connection to SQL Server and grab data from the SQLSaturday table. I create a GeoLookup column to make lookups a bit easier.

The next line, where we do the geocoding, is a really cool thing you can do in R. Basically, I want to take my sqlsats data frame and append latitude and longitude to it. That’s what cbind (column bind) gives us. To understand the t(sapply(…, geocode, …)) bit, you have to work from the inside out. The geocode function is what we use to get latitude and longitude from Google Maps given an address. We want to apply that function to every observation in the sqlsats data frame (in T-SQL parlance, every row in the table), so we use the sapply variant of the apply function, and use the t() function to transpose the results so we can append them to our sqlsats data frame.

From there, I filter out any SQL Satudays missing longitude and update my SQLSaturday table with the longitudes and latitudes we grabbed.

As mentioned, there are some SQL Saturdays missing latitude and longitude, so I had to do lookups manually. There were also a couple of SQL Saturdays which were miscoded because they didn’t have states listed.

--Manual cleanup
UPDATE dbo.SQLSaturday SET Latitude = 30.2782, Longitude = -81.511850 WHERE StreetAddress = '1 UNF Drive';
UPDATE dbo.SQLSaturday SET Latitude = 41.6618, Longitude = -91.5365 WHERE StreetAddress = '101 Jessup Hall';
UPDATE dbo.SQLSaturday SET Latitude = 45.5026, Longitude = -73.5735 WHERE StreetAddress = '2000 McGill College, 5e étage';
--Columbia was marked as W. Columbia in New York, rather than Columbia South Carolina
UPDATE dbo.SQLSaturday SET Latitude = 33.9518, Longitude = -81.1157 WHERE City = 'West Columbia';
--Richmond was miscoded as well for 2010.
UPDATE dbo.SQLSaturday SET Latitude = 37.6488, Longitude = 	-78.7015 WHERE EventNumber = 30;

Next up, I want to geocode the attendee data. I’m going to have to do this a different way for a couple of reasons. First, we’re going to have way more than 1000 attendees, so I can’t use Google Maps. Instead, I’m going to use the OpenStreetView data set to perform free geocoding. It’s slower than the Google API, but at least I don’t have to pay…

I’m going to get two subsets of attendees: those who put in a valid ZIP Code (and I can throw out everything else because typically, ZIP Codes are more distinctive than cities), and those who put in valid-ish city + region combos.

--Postal codes
SELECT DISTINCT 
	a.Country,
	a.PostalCode
FROM dbo.SQLSaturdayAttendee a
WHERE
	a.PostalCode IS NOT NULL;

--Approximately valid cities
SELECT DISTINCT
	a.City,
	a.Region,
	a.Country
FROM dbo.SQLSaturdayAttendee a
WHERE
	a.PostalCode IS NULL
	AND LEN(City) > 2
	AND LEN(Region) > 1
	AND City NOT LIKE '%[0-9]%'				--Some people put in street addresses
	AND Region NOT LIKE '%[0-9]%'			--Some people put in strange values
	--Some people are very regional
	AND Region NOT IN ('Midwest', 'North', 'South', 'North America', 'mid-atlantic', 'East', 'IN', 'this one', 'Mountain West', 'NA', 'N/A')
	AND City NOT LIKE '%@%'					--Some people put in e-mail addresses
	AND City <> 'N/A'						--Some people don't like listing city
	AND City <> 'NA'
ORDER BY
	a.City,
	a.Region;

There are 6544 distinct postal codes in the data set, and 2406 distinct city-region pairings, adding up to 8950 lookups we’ll need to do. These 8950 lookups will apply to upwards of 33,273 attendees (which you can get by removing the DISTINCT clauses). At this point, I want to mention that we’re down to about 30K attendees from our original 93K data set (that is, there were 93,838 unique UserGUIDs in dbo.SQLSatRegistrations). If I were trying to do predictive analysis, I’d start to get really concerned here because we’re throwing away 2/3 of the attendee set. But let’s soldier on, because 30K attendees should still at least give us some idea.

Geocoding this time, as I mentioned, will be a bit trickier. Let’s get to the R:

# From http://stackoverflow.com/questions/22887833/r-how-to-geocode-a-simple-address-using-data-science-toolbox
# My own modifications include the if-else block for lookup failures
geo.dsk <- function(addr){ # single address geocode with data sciences toolkit
  require(httr)
  require(rjson)
  url      <- "http://www.datasciencetoolkit.org/maps/api/geocode/json"
  response <- GET(url,query=list(sensor="FALSE",address=addr))
  json <- fromJSON(content(response,type="text")) if(length(json['results'][[1]]) > 0) {
    loc  <- json['results'][[1]][[1]]$geometry$location
    return(c(address=addr,long=loc$lng, lat= loc$lat))
  } else {
    return(c(address='',long=0,lat=0))
  }
}

zips <- sqlQuery(conn, "SELECT DISTINCT a.Country, a.PostalCode FROM PASSData.dbo.SQLSaturdayAttendee a WHERE a.PostalCode IS NOT NULL;")
zips$CheckString = paste(zips$PostalCode, zips$Country, sep = " ")
zips$lat <- 0
zips$long <- 0

result <- do.call(rbind,lapply(as.character(zips$CheckString),geo.dsk))
zips <- data.frame(result)
zips
head(zips)

sqlSave(conn, zips, tablename="ZipStage")

cities <- sqlQuery(conn, "SELECT DISTINCT a.City, a.Region, ISNULL(a.Country, '') AS Country FROM PASSData.dbo.SQLSaturdayAttendee a WHERE a.PostalCode IS NULL AND LEN(City) > 2 AND LEN(Region) > 1 AND City NOT LIKE '%[0-9]%' AND Region NOT LIKE '%[0-9]%' AND Region NOT IN ('Midwest', 'North', 'South', 'North America', 'mid-atlantic', 'East', 'IN', 'this one', 'Mountain West', 'NA', 'N/A') AND City NOT LIKE '%@%' AND City <> 'N/A' AND City <> 'NA' ORDER BY a.City, a.Region;")
cities$CheckString = paste(cities$City, cities$Region, cities$Country, sep = " ")
cities$lat <- 0
cities$long <- 0

result <- do.call(rbind,lapply(as.character(cities$CheckString),geo.dsk))
cities <- data.frame(result)
cities

sqlSave(conn, cities, tablename="CitiesStage")

I copy-pasted the geo.dsk function from Stack Overflow and added in an if-else statement to check to see if json[‘results’] has values. Basically, we’re hitting the Data Science Toolkit’s geocoding API endpoint with the partial address (postal code or city + state) we have, and then we’re shredding the returned JSON to get latitude and longitude. If the lookup failed for whatever reason, I return a latitude and longitude of 0.

The first data set I pull out is ZIP Codes. I first create a data frame from my SQL query, add a CheckString variable to paste together the address components, and then call geo.dsk to perform lookups for each CheckString in zips. The results get converted into a data frame, which becomes my new zips, and I’ll use RODBC’s sqlSave function to save the data in a staging table that I’ll call ZipStage. We do the same thing for cities.

Now that I have latitudes and longitudes, I’m going to import them from my staging table into the SQLSaturdayAttendee table. Note that I didn’t paste together a massive update statement for the 9000-ish results; instead, I put them in a staging table, as that’s going to be a lot more efficient than doing 9000 update statements.

UPDATE a
SET
	Latitude = z.lat,
	Longitude = z.long
FROM dbo.SQLSaturdayAttendee a
	INNER JOIN master.dbo.ZIPStage z
		ON CONCAT(a.PostalCode, ' ', a.Country) = z.address

UPDATE a
SET
	Latitude = c.lat,
	Longitude = c.long
FROM dbo.SQLSaturdayAttendee a
	INNER JOIN master.dbo.CitiesStage c
		ON CONCAT(a.City, ' ', a.Region, ' ', a.Country) = c.address
		AND a.PostalCode IS NULL

--Manual cleanup of Columbus, Ohio, which was geocoded to somewhere around Berea
UPDATE dbo.SQLSaturdayAttendee SET Latitude = 39.9612, Longitude = -82.9988 WHERE City = 'Columbus' AND Region IN ('Ohio', 'OH') AND PostalCode IS NULL;

One interesting thing I found was that the geocoding service put Columbus, Ohio up near Cleveland, which skewed Columbus results significantly. I’m not sure why it did that, but this leads me to believe that there are probably a few other geocoding errors that I didn’t catch. I don’t think they’re going to alter radically the results, however, so I didn’t spend any more time looking for them.

Who Speaks?

One consideration I had with this data set is that I didn’t want to include speakers and sponsors in the attendee set.  Ideally, I’d keep them separated because their goals and motivations are different.  For this data set, I’m really interested in seeing behavior patterns of non-speaker, non-sponsor attendees.  But how do I figure out who’s an attendee and who’s a speaker when the data set is completely anonymized and there aren’t any flags for speakers or sponsors of events?

Twin Peaks

My expectation is that speaker/sponsor behavior will differ from “normal” attendee behavior.  Here’s a quick data sample to show my conjecture in action.  I’m going to query everybody in ZIP Code 27703 (which is where I live).  See if you can figure out which one of these is me…

SELECT
	*
FROM dbo.SQLSaturdayAttendee a
WHERE
	a.PostalCode = '27703'
ORDER BY
	a.NumberAttended DESC;

27703Distribution

If I try to generalize this phenomenon, I’ll look for how many people attended (or at least registered for) a particular number of SQL Saturday events.  Ideally, what I’d find is a lot of people with 1, 2, or 3 events; then a dip; then an increasing number of people at some point; and finally another dip.

I’m going to use a quick T-SQL query to get the raw data, and then analyze it in R because R is perfect for this type of quick analysis.

#Figure out the cutoff for likely speakers.
cmd <- "SELECT NumberAttended, COUNT(1) AS People FROM PASSData.dbo.SQLSaturdayAttendee GROUP BY NumberAttended ORDER BY NumberAttended;"
attendee.breakdown <- sqlQuery(conn, cmd)
#A vast majority of entrants have been to one SQL Saturday.  Let's take the log of the distribution
#to see the pattern a bit better.
barplot(xlab="Number Of SQL Saturdays Attended", attendee.breakdown$People, names=attendee.breakdown$NumberAttended)
#What we see is an exponential decay pattern.  This is unfortunate for us because we
#were hoping to find a two-peaked distribution.
barplot(xlab="Number Of SQL Saturdays Attended", log(attendee.breakdown$People), names=attendee.breakdown$NumberAttended)

We’re not looking for pretty; we’re looking for fast. If I wanted pretty, I’d pull down ggplot2 and spend a while sprucing up the graphics. Anyhow, here’s what you get when you run the first barplot:

AttendedCount

That’s really hard to read because such a high percentage of attendees have been to just one SQL Saturday. So next up, I’m going to take the log of number of attendees. This will give us a better idea of the underlying distribution of data.

AttendedCountLog

I believe the appropriate response at this point is, “Well, shoot.”  We see a fairly smooth exponential decrease as number of SQL Saturdays attended goes up.  There’s a little blip around the 13-14 mark, but you’d have to do some major squinting to make out a second distribution in the data.  As a result, I don’t have a great bifurcation mark, and so I’m picking a quick and dirty number:  9.  If you’ve been to at least 9 SQL Saturdays, you’ve probably gone from “casual attendee” to speaker (or at least active volunteer).  It’s easier to think of counter-examples in places, like Jacksonville, which have had quite a few more SQL Saturdays than places like Omaha, but I think 9’s about as good a cutoff point as any.

Analyzing Attendees

So now, after about 3000 words, we get the main event.  Who’s really going to these things?  As usual, I’m going to start off with a T-SQL statement which frames my data nicely, and then perform analysis using R.

SELECT
	s.EventNumber,
	s.EventName,
	s.EventDate,
	s.City,
	a.UserGUID,
	a.NumberAttended,
	s.GeoPoint.STDistance(a.GeoPoint) / 1609.344 AS DistanceInMiles
FROM PASSData.dbo.SQLSatRegistrations r
	INNER JOIN PASSData.dbo.SQLSaturday s
		ON s.EventNumber = r.SatNum
	INNER JOIN PASSData.dbo.SQLSaturdayAttendee a
		ON r.UserGUID = a.UserGUID
WHERE
	a.Longitude IS NOT NULL
  AND (s.GeoPoint.STDistance(a.GeoPoint) / 1609.344) < 3000
	AND a.NumberAttended < 9;

This is my data set, which returns 55019 rows. This means that we lost approximately half of our registrants as a result of data cleansing. Again, this should make us quite circumspect about predicting results, but I think we have enough data to do some descriptive analytics without making fools of ourselves.

One stipulation I want to make is that I’m cutting off any entries over 3000 miles. My belief is that nobody is flying cross-country to visit a SQL Saturday (again, excluding speakers and sponsors). The only exception I could see is Hawaii, so I’m probably skewing that result a bit…but even then, most of the long-distance travelers are probably speakers and sponsors going on vacation. If the distance I came up with is more than 3000 miles, I’m just treating it as bad data.

#Get non-speaker attendee distance from each event.
cmd <- "SELECT
	s.EventNumber,
	s.EventName,
	s.EventDate,
	s.City,
	a.UserGUID,
	a.NumberAttended,
	s.GeoPoint.STDistance(a.GeoPoint) / 1609.344 AS DistanceInMiles
FROM PASSData.dbo.SQLSatRegistrations r
	INNER JOIN PASSData.dbo.SQLSaturday s
		ON s.EventNumber = r.SatNum
	INNER JOIN PASSData.dbo.SQLSaturdayAttendee a
		ON r.UserGUID = a.UserGUID
WHERE
	a.Longitude IS NOT NULL
  AND (s.GeoPoint.STDistance(a.GeoPoint) / 1609.344) < 3000
	AND a.NumberAttended < 9;"
attendee <- sqlQuery(conn, cmd)

From here until I get through the results, I’m going to throw in a bunch of images and single lines of code. I think the images describe what the code is doing, so I’ll mostly add color commentary instead of function call descriptions.

boxplot(attendee$DistanceInMiles, data=attendee, main="Attendee Distance Traveled", horizontal=TRUE)

AttendeeDistanceTraveled

This box and whisker plot shows a couple of things: the bottom 75th percentile doesn’t travel that far, and there are a lot of outliers in this data set. As useful as this visualization is, I also want to get a summary of results.

summary(attendee$DistanceInMiles)

TotalSummary

The median distance is 19 miles, and the mean distance is 94 miles. 75% of attendees travel less than 70 miles in this data set. So where do we get 400-600 miles? Maybe it’s the 90th or 95th percentile.

quantile(attendee$DistanceInMiles, c(0.25, 0.5, 0.75, 0.9, 0.95))

TotalQuantile

95% of attendees travel under 400 miles. But how has this changed over time? I’m going to use the year function in lubridate and apply it to my distance in miles plot to group by year.

boxplot(attendee$DistanceInMiles~year(attendee$EventDate), main="Attendee Distance Traveled By Year", horizontal=TRUE)

AttendeeDistanceTraveledByYear

This graphic is packed, but we can see a couple of things. First, the median has dropped a bit since 2007-2008. Second, the 75th percentile has also dropped a bit. I think both of these make a lot of sense: as we get more events, people don’t have to travel as far to attend a SQL Saturday. If the only one in your region is 300 miles away, you can justify going that far once a year for a unique event. But if there are 6 in your region over the course of a year, why not just pick the closest one or two?

Still, I want to see some hard numbers, so I’m going to call the summary function for each year.

tapply(attendee$DistanceInMiles, year(attendee$EventDate), summary)
$`2007`
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
   2.896   13.410   93.150  229.600  182.700 2147.000 

$`2008`
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
   0.2401   10.5600   27.3500  159.9000  118.8000 2515.0000 

$`2009`
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
   0.2845    9.6410   19.2300  116.2000   73.5600 2473.0000 

$`2010`
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
   0.1588   10.1300   20.5500  120.6000   95.0500 2717.0000 

$`2011`
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
   0.6631   10.5800   23.2100  123.3000  106.2000 2686.0000 

$`2012`
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
   0.1913    9.7210   21.3200  113.0000   92.9900 2843.0000 

$`2013`
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
   0.0337    9.3650   19.7400  103.6000   72.5000 2710.0000 

$`2014`
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
   0.0337    8.5420   17.3000   84.3200   60.1100 2694.0000 

$`2015`
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
   0.0337    8.1410   16.5700   74.9300   49.7500 2700.0000 

$`2016`
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
   0.0337    9.3220   19.8800   81.8100   57.3300 2685.0000 

The mean has dropped considerably from over 110-120 miles down to 75-85 since 2014. 75% of attendees now won’t go more than about an hour for an event—or, maybe phrased differently, events are finding plenty of local attendees. Let’s see what it looks like at the 90th and 95th percentiles.

tapply(attendee$DistanceInMiles, year(attendee$EventDate), quantile, type=8, probs=c(0.25, 0.5, 0.75, 0.9, 0.95))
$`2007`
       25%        50%        75%        90%        95% 
  13.29615   93.14978  185.26700  865.47610 1025.75327 

$`2008`
      25%       50%       75%       90%       95% 
 10.56431  27.35487 118.84406 457.74800 823.98637 

$`2009`
       25%        50%        75%        90%        95% 
  9.616829  19.226355  74.048157 204.863135 466.467323 

$`2010`
      25%       50%       75%       90%       95% 
 10.12820  20.55441  95.39654 224.65303 546.36305 

$`2011`
      25%       50%       75%       90%       95% 
 10.57865  23.21103 106.20535 235.10075 571.08523 

$`2012`
       25%        50%        75%        90%        95% 
  9.717188  21.323516  92.995248 250.098624 450.006102 

$`2013`
       25%        50%        75%        90%        95% 
  9.365244  19.742029  72.515673 215.197417 439.364723 

$`2014`
       25%        50%        75%        90%        95% 
  8.541756  17.301598  60.168109 177.018031 316.380895 

$`2015`
       25%        50%        75%        90%        95% 
  8.141225  16.571739  49.746705 155.265559 259.042488 

$`2016`
       25%        50%        75%        90%        95% 
  9.320905  19.878521  57.334977 155.104415 279.424947 

Okay, I mentioned I’d focus mostly on color commentary. Here’s an exception: I specified type=8 for my quantile function. Which type you choose can make a huge difference in calculation. The link recommends type 8 when you don’t have a reason to go with a different calculation, but you can use any of the continuous calculation types.  They will differ a bit (and you can see differences between what summary provided us and what quantiles gives us), but for this data set, I don’t think you get any crazy differences by selecting type 7 or 9.

Finally, this leads me to an interesting question: are we getting more registrants per event now than before? If we’re seeing event growth, attendee growth, and distance shrinkage, that indicates that SQL Saturdays are becoming more popular, drawing in people who previously wouldn’t travel the 20 miles across town to spend a day learning. Back to SQL Server once more.

SELECT
	YEAR(s.EventDate) AS EventYear,
	COUNT(DISTINCT s.EventNumber) AS NumberOfEvents,
	COUNT(1) AS NumberOfRegistrants,
	1.0 * COUNT(1) / COUNT(DISTINCT s.EventNumber) AS AverageAttendance
FROM dbo.SQLSaturday s
	INNER JOIN dbo.SQLSatRegistrations r
		ON s.EventNumber = r.SatNum
GROUP BY
	YEAR(s.EventDate)
ORDER BY
	EventYear DESC;

RegistrantAveragePerYear

This time I included anybody who is in the list, regardless of whether we could geocode that person’s location. Given that 2016 was only about a third of the way over, I’m not sure we can put much credence into those registrant counts. But we have seen an increase over the years. It’s not a major increase, but it is an increase.

Conclusions

I’ve hedged my bets a few times during this post because I know the data’s incomplete, and there’s a risk that the points of the data set I’m missing would conclusively prove me wrong.  Nevertheless, I think the likelihood of those extra data points being radically different than the ~50% that we were able to keep is fairly low, so I’ll make a couple of cautious conclusions.

First, non-speakers and non-sponsors do not appear to travel very far to SQL Saturdays.  Looking at the yearly quantile breakout above, by the time we get to 2014, people have become a bit insular in their SQL Saturday attendance.  The median traveler (after that initial event in Orlando in 2007) has never traveled more than about 20 miles.

Also, I think the 600 mile radius is too large, based on the data set.  Looking at the 95th percentile of travelers, those people haven’t gone more than 400 miles on average since 2013, and the number has dropped down to the 260-280 range.  Also, the 95th percentile, even in a heavily skewed distribution like this one, is looking far out there.  Even if you scale it back to just the 90th percentile (which is still considered an edge case area), attendees since 2014 won’t even go 200 miles.  That means that there could be some overlap between Cleveland and Columbus (which, having visited both the past couple of years, I do see), but there’d be almost no overlap between Cleveland and Cincinnati, and you’d be stretching to see much overlap between Columbus and Louisville.

But my real conclusion is this:  grab the data set and draw your own conclusions.  I’m working through this stuff the same as anybody else could, and we’re in a community full of people who know how to analyze data.  Maybe you’ll find something I missed which either bolsters or undercuts one of my points; either way, I want to know and I want to push you to dig just a little deeper than I did.  (But hopefully what you find won’t completely invalidate my points…)

Calculating SQL Saturday Distances

EDIT 2016-07-21 17:30:00 — It turns out that my distance calculation query was a little bit off. I was accidentally filtering out SQL Saturdays which did not have any 600-mile collisions in the next week but did have a 600-mile collision in the current week. With this in place, there are now 38 pair collisions out of 209 combos. I’ve updated the commentary portion to match these corrected numbers, but they don’t fundamentally alter the point. I also found that I mistakenly mis-coded Kansas City’s state in 2015; it has been corrected in the data file. Moral of this story: don’t blog too early in the morning?

A couple of days ago, Argenis Fernandez put up a notice that PASS is changing the rules for SQL Saturdays:

The first change we are going to make is to the amount of funding support we provide organizers. Currently any organizer, whatever the size of their SQLSaturday event, receives $500. Starting January 1, 2017, we will be reducing this to amount to $250 and targeting those cities/locations that really need the financial support to grow, rather than well-established events that are better capable of attracting sponsors and financial support. When assessing those events that require financial support, PASS will be meeting with event organizers and complete a review of past performance.

Which brings me to the second change: implementing a new 600-mile radius and concurrent event timing restriction to reduce competition between individual SQLSaturday events in Canada/USA.  The event timing restriction means that events within the 600-mile radius cannot occur on the same weekend or one weekend prior or after.  This will help to deliver a more consistent and satisfying experience for sponsors, which will reduce sponsor exhaustion and assist with increasing overall ROI.  The existing 400-mile radius restriction for all other regions will remain.

I don’t mind the first rule change at all, and from what I’ve seen, I don’t think anybody does.  That $250 might be the difference between breaking even and going slightly in the hole, but if you know about it beforehand, it’s not that hard to squeeze $250 out of a budget or find $250 in sponsorships somewhere.  As a result, it saves PASS somewhere between $25-50K a year and most events won’t notice a difference.

The second change, however, has been…controversial.  Steve Jones has a well-written argument against the rule change, particularly in the way this was announced.

How Much Does This Change?

I wanted to check out current SQL Saturdays for 2015 and 2016  and see how many would have been affected by the new rule.  So let’s take a look!

Getting The Data

I decided to go with a simplistic approach.  Rather than looking up the specific addresses of each SQL Saturday, I went by city name.  The way I figure it, if I’m 10-20 miles off, it won’t make a huge difference.  I also decided to limit my focus to the US and Canada, as those are the areas affected by the rule change.

I coded each event with city, state (or province), and event date.  You can get my results here in CSV format.

Geocoding The Data

My CSV doesn’t include latitude and longitude, so I’m going to load the data into R, get that latitude and longitude information, and then put it into SQL Server. First create the SQL Server table:

CREATE TABLE dbo.SQLSaturdays
(
	City VARCHAR(40),
	State CHAR(2),
	EventDate DATE,
	Latitude DECIMAL(8, 4),
	Longitude DECIMAL(8, 4),
	GeoPoint AS GEOGRAPHY::Point(Latitude, Longitude, 4326)
);
GO

And here’s the R script I threw together:

install.packages("ggmap")
install.packages("RODBC")
library(ggmap)
library(RODBC)

sqlsat <- read.csv("C:\\Temp\\2015_2016_SQLSaturdays.csv")
sqlsat$City <- as.character(sqlsat$City)
sqlsat$State <- as.character(sqlsat$State)
sqlsat$Date <- as.Date(sqlsat$Date, format="%m/%d/%Y")
sqlsat$GeoLookup <- paste(sqlsat$City, sqlsat$State, sep = " ")

# Perform Google Maps geocoding.  Google Maps provides 1000 geocoding hits per day for free.
# Our data set is only 106 observations, so we can do that without issue.
sqlsat <- cbind(sqlsat, t(sapply(sqlsat$GeoLookup, geocode, USE.NAMES=FALSE)))

sqlsat

conn <- odbcDriverConnect("Driver=SQL Server;Server=.;Initial Catalog=Scratch;Provider=SQLNCLI11.1;Integrated Security=SSPI")

# Via http://stackoverflow.com/questions/14334840/how-to-insert-a-dataframe-into-a-sql-server-table
# Not a great practice, but for 106 rows, it'll do.
values <- paste("('",sqlsat$City,"','",sqlsat$State,"','",sqlsat$Date,"',",sqlsat$lat,",",sqlsat$lon,")", sep="", collapse=",")
cmd <- paste("INSERT INTO Scratch.dbo.SQLSaturdays(City, State, EventDate, Latitude, Longitude) VALUES ", values)
result <- sqlQuery(conn, cmd, as.is=TRUE)

close(conn)

There are a couple not-so-great practices (particularly around the way I inserted data into SQL Server) but it does the job, especially when you only have 106 rows.  Also, if you want to play along at home, you’ll probably want to change the connection and database names.

Determining Distance

From there, I ran a T-SQL query to do the following:

  1. Turn latitude and longitude into GEOGRAPHY points
  2. Find SQL Saturday pairs which occur within one week of one another
  3. Calculate the distance in miles between these two city pairs
  4. Return only city pairs which are less than 600 miles apart

Here’s the script:

WITH sqlsats AS
(
	SELECT
		City,
		State,
		EventDate,
		Latitude,
		Longitude,
		GeoPoint,
		DENSE_RANK() OVER (ORDER BY EventDate) dr
	FROM dbo.SQLSaturdays s
),
chronoprox AS
(
	SELECT
		s.City,
		s.State,
		s.EventDate,
		s.Latitude,
		s.Longitude,
		s.GeoPoint,
		sNext.City AS NextCity,
		sNext.State AS NextState,
		sNext.EventDate AS NextEventDate,
		sNext.GeoPoint AS NextGeoPoint
	FROM sqlsats s
		LEFT OUTER JOIN sqlsats sNext
			ON (s.dr = sNext.dr - 1 OR s.dr = sNext.dr)
			AND s.EventDate >= DATEADD(DAY, -7, sNext.EventDate)
			AND NOT (s.City = sNext.City AND s.State = sNext.State)
),
geoprox AS
(
	SELECT
		cp.City,
		cp.State,
		cp.EventDate,
		cp.Latitude,
		cp.Longitude,
		cp.GeoPoint,
		cp.NextCity,
		cp.NextState,
		cp.NextEventDate,
		cp.NextGeoPoint,
		cp.GeoPoint.STDistance(cp.NextGeoPoint) / 1609.344 AS DistanceInMiles
	FROM chronoprox cp
	WHERE
		cp.NextGeoPoint IS NOT NULL
)
SELECT
	gp.City,
	gp.State,
	gp.EventDate,
	gp.NextCity,
	gp.NextState,
	gp.NextEventDate,
	gp.DistanceInMiles
FROM geoprox gp
WHERE
	gp.DistanceInMiles < 600
ORDER BY
	gp.NextEventDate;

The end result is that there were 38 city pairs with a distance of less than 600 miles in 2015 and 2016.  23 of these pairings took place in 2015, and 15 in 2016 (including scheduled events which have not yet happened).

Deriving Conclusions

10 of the 38 city-pairs were in the run-up to PASS Summit 2015, and Raleigh & Charlotte were right in the middle of that.

As Grand Poobah of the Raleigh SQL Saturday, this does affect me, as we have some larger events (Atlanta, Orlando, Washington, Charlotte) in our region, as well as smaller institutions (Richmond, Spartanburg).  Finding a good date has become a bit harder (in my case, because I don’t want to jump claim on any of those events’ dates), but I also don’t want to oversell that difficulty:  out of 209 potential collisions, we only saw 38.  And that’s factoring in current concerns people have about the total number of available dates, such as how you would never have a SQL Saturday around Thanksgiving or Christmas or Labor Day or…

I also don’t want to undersell the marginal increase in difficulty for smaller SQL Saturdays to keep going—it’s hard to find a venue even in the best of circumstances (trust me, I know!), and feeling like you’re going to be locked out of a good portion of the year reduces your options even further.  My original concern was mostly around midwestern events, as they’re going to have the most geographical overlap, but if you check out the pairs, the southeast is actually the hardest-hit region:

--I took the results from the previous query and put them into a temp table called #tmp.
WITH records AS
(
	SELECT
		City,
		State
	FROM #tmp
	UNION ALL
	SELECT
		nextcity,
		nextstate
	FROM #tmp
)
SELECT
	City,
	State,
	COUNT(1) AS Collisions
FROM records
GROUP BY
	City,
	State
ORDER BY
	Collisions DESC;

SQLSaturdayCollisions

Kansas City tops the list with 6 collisions, followed by Raleigh with 5 (go us!). Also near the top of the list are Spartanburg, Dallas, Columbus (GA), and Orlando have 4 apiece.  Note that the total number of collisions adds up to 76 because I include both “sides” of the 38 collisions.  This does affect 34 separate events, though.

What If?

While I have the data, how about we play around with a couple of alternatives?

  1. If we shrink the collision radius to 500 miles, we’re down to 30 collisions, including 14 in 2016.
  2. At 400 miles (the international limit), we have 18 collision pairs, including 7 in 2016.
  3. At 700 miles, we’re up to 54 collisions, including 20 in 2016.  As a side note, at 700 miles, we would have had collisions in all 9 of the cities which had SQL Saturdays in the first three weeks of October 2015 (Columbus GA, Pittsburgh, KC, Orlando, Raleigh, Minneapolis, Boston, Charlotte, Dallas). At 600 miles, Boston eeks out of the list.

Final Thoughts (For Now)

I’m still on the fence about this decision.  As a speaker, I like the fact that there are so many events on the east coast and midwest, and I’d hate to see the number of opportunities to visit other cities and try to help people learn drop as a result of this.  Looking at the data, I think we’ll lose some events on the margin.  There is some opportunity for growth in less desirable times of the year (think July), but the problem is that if you’re already running a smaller event, picking a bad time of year will guarantee that not many people will be able to show up.

But at the same time, I’ve heard from several events that I’ve attended that sponsorships are drying up this year.  If that’s the case across the board, then we might have reached the event limit, particularly in areas like the southeast which have a large number of events.

Finding Invalid String Combinations

Today, I had to write a query to parse a column in SQL Server to figure out if the process which loads this column is using delimiters correctly.

Problem Description

One of the developers on my team came to me asking about how to find a hidden recordset character in a column.  They were using this recordset character (CHAR(30)) to separate four-digit numbers in a VARCHAR column, and somehow bad data was getting in, where the recordset character was separating subsets of those four-digit numbers.  My mission was to figure out how widespread this was.

Let’s Prep Some Data

For this demo, I’m going to introduce some slightly simplified data, but pretty close to reality.

CREATE TABLE #Test
(
	CustomerID INT,
	CampaignID INT,
	Value VARCHAR(30)
);

INSERT INTO #Test
(
	CustomerID,
	CampaignID,
	Value
)
VALUES
(1, 1, CONCAT('1234', CHAR(30), '5678')),							--Good
(1, 2, '5678'),														--Good
(1, 3, CONCAT('1', CHAR(30), '2', CHAR(30), '6', CHAR(30), '8')),	--Bad
(1, 4, '12345678'),													--Bad
(2, 1, CONCAT('123', CHAR(30), '4', CHAR(30), '5678')),				--Bad
(2, 2, ''),															--Ignore
(2, 6, CONCAT('2222', CHAR(30), '4442'));							--Good

I’ve marked records as “good” if they follow the practice of 4-digit blocks with recordset separators.

Rant Time

Before I get to the code, let me get to the rant.  Normalization is a vital part of database design.  The scenario I am about to describe fails even first normal form, and this has caused a small amount of suffering on the part of developers and on my part as well.  Ignore first normal form at your peril; eventually, you will suffer as a result.

For a primer on normal forms, I have a presentation (which I’ll probably update by the next time I give it).  In this particular scenario, it’s obvious that we’re storing sets of 0 or more numeric values, so instead of a single table with a character-based value field, we really ought to have a table which stores integers.

CREATE TABLE #campaign
(
	CustomerID INT,
	CampaignID INT
);

INSERT INTO #campaign
(
	CustomerID,
	CampaignID
)
VALUES
(1,1),
(1,2),
(1,3),
(1,4),
(2,1),
(2,2),
(2,6);

CREATE TABLE #CampaignValue
(
	CustomerID INT,
	CampaignID INT,
	Value INT
);

INSERT INTO #CampaignValue
(
	CustomerID,
	CampaignID,
	Value
)
VALUES
(1,1,1234),
(1,1,5678),
(1,2,5678),
(1,3,1268),
(1,4,1234),
(1,4,5678),
(2,1,1234),
(2,1,5678),
(2,6,2222),
(2,6,4442);

There are several advantages to this.  First, we guarantee that there will never be an invalid row separator because SQL Server handles row separation for us automatically.  Second, we can put a check constraint in place if the number must be four digits (e.g., Value between 1000 and 9999 because 0 is not a valid lead character).  Third, it avoids the following code to detect issues.

Issue Detection

The following script finds any issues.  I start by building a tally table.  I am a huge fan of tally tables and recommend you have one on every instance.  If you don’t have one, I create one in a temp table.

From here, I’ll show the code and then walk through each section.

SET NOCOUNT ON;
CREATE TABLE #Tally
(
	N INT IDENTITY(0,1) NOT NULL PRIMARY KEY
);
GO
INSERT INTO #Tally DEFAULT VALUES
GO 10000
GO
WITH records AS
(
	SELECT
		t.CustomerID,
		t.CampaignID,
		t.Value,
		val.ReplacedValue,
		val.RemovedValue
	FROM #Test t
	CROSS APPLY
	(
		SELECT
			REPLACE(t.Value, CHAR(30), 'x') AS ReplacedValue,
			REPLACE(t.Value, CHAR(30), '') AS RemovedValue
	) val
	WHERE
		t.Value &lt;&gt; ''
),
tallyresults AS
(
	SELECT
		r.CustomerID,
		r.CampaignID,
		r.Value,
		r.ReplacedValue,
		r.RemovedValue,
		t.N,
		SUBSTRING(r.RemovedValue, t.N + 1, 4) AS SplitVersion
	FROM records r
	CROSS JOIN #Tally t
	WHERE
		t.N &lt; LEN(r.RemovedValue)
		AND t.N % 4 = 0
),
tallyjoin AS
(
	SELECT DISTINCT
		tr.CustomerID,
		tr.CampaignID,
		tr.Value,
		tr.ReplacedValue,
		tr.RemovedValue,
		REPLACE
		(
			CAST
			(
				STUFF
				((
					SELECT
						',' + tri.SplitVersion
					FROM tallyresults tri
					WHERE
						tri.CustomerID = tr.CustomerID
						AND tri.CampaignID = tr.CampaignID
					ORDER BY
						tri.N
					FOR XML PATH ('')
				), 1, 1, '') AS VARCHAR(8000)
			), N',', CHAR(30)
		) AS NewValue
	FROM tallyresults tr
)
SELECT
	tj.CustomerID,
	tj.CampaignID,
	tj.Value,
	tj.ReplacedValue,
	tj.RemovedValue,
	tj.NewValue,
	val.NewReplacedValue
FROM tallyjoin tj
	CROSS APPLY
	(
		SELECT
			REPLACE(tj.NewValue, CHAR(30), 'x') AS NewReplacedValue
	) val
WHERE
	tj.Value &lt;&gt; tj.NewValue;

In this solution, I have three common table expressions.  Let’s start with the first one, records.  Records does two things:  first, it gives us a visual indicator of what where my recordset characters are by replacing them with x’s.

To do this, I use one of my favorite T-SQL features:  CROSS APPLY.

SELECT
	t.CustomerID,
	t.CampaignID,
	t.Value,
	val.ReplacedValue,
	val.RemovedValue
FROM #Test t
CROSS APPLY
(
	SELECT
		REPLACE(t.Value, CHAR(30), 'x') AS ReplacedValue,
		REPLACE(t.Value, CHAR(30), '') AS RemovedValue
) val
WHERE
	t.Value &lt;&gt; ''

What I’m doing is taking each record in #Test and performing two REPLACE operations. I’m doing this in a derived function I’m calling val. The derived function operates at no extra cost to the query, but simplifies my SELECT clause by moving calculations elsewhere. In this particular scenario, I’m not really gaining anything, but in my real-life version, I performed some other work against t.Value before running the REPLACE functions, and so I would otherwise need to repeat that work in several places. If you’re interested in learning more about how to simplify code using APPLY, check out an example from my APPLY Yourself talk.

After using APPLY to simplify the calculations, I feed the results from records into a second common table expression, tallyresults.

SELECT
	r.CustomerID,
	r.CampaignID,
	r.Value,
	r.ReplacedValue,
	r.RemovedValue,
	t.N,
	SUBSTRING(r.RemovedValue, t.N + 1, 4) AS SplitVersion
FROM records r
CROSS JOIN #Tally t
WHERE
	t.N &lt; LEN(r.RemovedValue)
	AND t.N % 4 = 0

This joins my records to a tally table, which gives one row for each character in RemovedValue (that is, the numbers without recordset separators).  I then retain only the values which start a sequence, and use SUBSTRING to snatch up four digits. What I’m left with is a column named SplitVersion, which has one row for each customer, campaign, and 4-digit value (which is equivalent to my normalized table’s structure).

If that wasn’t exciting enough, we now need to slam this back together into our denormalized format, and that’s what tallyjoin does. It uses the FOR XML PATH trick to concatenate my four-digit values into one string, separated by commas. You might be wondering why I use comma instead of CHAR(30), and the answer is that converting CHAR(30) to XML returns a nasty result, so instead of trying to handle that, I use a character which is copacetic and translate it back using the REPLACE function after casting my “XML” result to varchar.

SELECT DISTINCT
	tr.CustomerID,
	tr.CampaignID,
	tr.Value,
	tr.ReplacedValue,
	tr.RemovedValue,
	REPLACE
	(
		CAST
		(
			STUFF
			((
				SELECT
					',' + tri.SplitVersion
				FROM tallyresults tri
				WHERE
					tri.CustomerID = tr.CustomerID
					AND tri.CampaignID = tr.CampaignID
				ORDER BY
					tri.N
				FOR XML PATH ('')
			), 1, 1, '') AS VARCHAR(8000)
		), N',', CHAR(30)
	) AS NewValue
FROM tallyresults tr

This query is pretty straightforward except for the XML trick. I need to get a distinct set of results because I don’t need the same customer ID, campaign ID, and value showing up multiple times; I only need it once, and I know there will be duplicates because I’m combining multiple rows together in a subquery. Note that I removed SplitVersion and N from the result set; if I don’t do that, I’ll continue to show every row, which means I’d have duplicate results at the end.

Finally, I wrap it up by selecting values from tallyjoin and doing one last replace of recordset values with x’s so we can see the separators in our results:

Recordset Result Set

Conclusion

This is an excellent case study for the importance of proper database normalization.  Due to one little bug in the code, I needed to write a fairly complicated query to recover from it.  With a properly normalized database, we would not have had this bug.