Calculating SQL Saturday Distances

EDIT 2016-07-21 17:30:00 — It turns out that my distance calculation query was a little bit off. I was accidentally filtering out SQL Saturdays which did not have any 600-mile collisions in the next week but did have a 600-mile collision in the current week. With this in place, there are now 38 pair collisions out of 209 combos. I’ve updated the commentary portion to match these corrected numbers, but they don’t fundamentally alter the point. I also found that I mistakenly mis-coded Kansas City’s state in 2015; it has been corrected in the data file. Moral of this story: don’t blog too early in the morning?

A couple of days ago, Argenis Fernandez put up a notice that PASS is changing the rules for SQL Saturdays:

The first change we are going to make is to the amount of funding support we provide organizers. Currently any organizer, whatever the size of their SQLSaturday event, receives $500. Starting January 1, 2017, we will be reducing this to amount to $250 and targeting those cities/locations that really need the financial support to grow, rather than well-established events that are better capable of attracting sponsors and financial support. When assessing those events that require financial support, PASS will be meeting with event organizers and complete a review of past performance.

Which brings me to the second change: implementing a new 600-mile radius and concurrent event timing restriction to reduce competition between individual SQLSaturday events in Canada/USA.  The event timing restriction means that events within the 600-mile radius cannot occur on the same weekend or one weekend prior or after.  This will help to deliver a more consistent and satisfying experience for sponsors, which will reduce sponsor exhaustion and assist with increasing overall ROI.  The existing 400-mile radius restriction for all other regions will remain.

I don’t mind the first rule change at all, and from what I’ve seen, I don’t think anybody does.  That $250 might be the difference between breaking even and going slightly in the hole, but if you know about it beforehand, it’s not that hard to squeeze $250 out of a budget or find $250 in sponsorships somewhere.  As a result, it saves PASS somewhere between $25-50K a year and most events won’t notice a difference.

The second change, however, has been…controversial.  Steve Jones has a well-written argument against the rule change, particularly in the way this was announced.

How Much Does This Change?

I wanted to check out current SQL Saturdays for 2015 and 2016  and see how many would have been affected by the new rule.  So let’s take a look!

Getting The Data

I decided to go with a simplistic approach.  Rather than looking up the specific addresses of each SQL Saturday, I went by city name.  The way I figure it, if I’m 10-20 miles off, it won’t make a huge difference.  I also decided to limit my focus to the US and Canada, as those are the areas affected by the rule change.

I coded each event with city, state (or province), and event date.  You can get my results here in CSV format.

Geocoding The Data

My CSV doesn’t include latitude and longitude, so I’m going to load the data into R, get that latitude and longitude information, and then put it into SQL Server. First create the SQL Server table:

CREATE TABLE dbo.SQLSaturdays
(
	City VARCHAR(40),
	State CHAR(2),
	EventDate DATE,
	Latitude DECIMAL(8, 4),
	Longitude DECIMAL(8, 4),
	GeoPoint AS GEOGRAPHY::Point(Latitude, Longitude, 4326)
);
GO

And here’s the R script I threw together:

install.packages("ggmap")
install.packages("RODBC")
library(ggmap)
library(RODBC)

sqlsat <- read.csv("C:\\Temp\\2015_2016_SQLSaturdays.csv")
sqlsat$City <- as.character(sqlsat$City)
sqlsat$State <- as.character(sqlsat$State)
sqlsat$Date <- as.Date(sqlsat$Date, format="%m/%d/%Y")
sqlsat$GeoLookup <- paste(sqlsat$City, sqlsat$State, sep = " ")

# Perform Google Maps geocoding.  Google Maps provides 1000 geocoding hits per day for free.
# Our data set is only 106 observations, so we can do that without issue.
sqlsat <- cbind(sqlsat, t(sapply(sqlsat$GeoLookup, geocode, USE.NAMES=FALSE)))

sqlsat

conn <- odbcDriverConnect("Driver=SQL Server;Server=.;Initial Catalog=Scratch;Provider=SQLNCLI11.1;Integrated Security=SSPI")

# Via http://stackoverflow.com/questions/14334840/how-to-insert-a-dataframe-into-a-sql-server-table
# Not a great practice, but for 106 rows, it'll do.
values <- paste("('",sqlsat$City,"','",sqlsat$State,"','",sqlsat$Date,"',",sqlsat$lat,",",sqlsat$lon,")", sep="", collapse=",")
cmd <- paste("INSERT INTO Scratch.dbo.SQLSaturdays(City, State, EventDate, Latitude, Longitude) VALUES ", values)
result <- sqlQuery(conn, cmd, as.is=TRUE)

close(conn)

There are a couple not-so-great practices (particularly around the way I inserted data into SQL Server) but it does the job, especially when you only have 106 rows.  Also, if you want to play along at home, you’ll probably want to change the connection and database names.

Determining Distance

From there, I ran a T-SQL query to do the following:

  1. Turn latitude and longitude into GEOGRAPHY points
  2. Find SQL Saturday pairs which occur within one week of one another
  3. Calculate the distance in miles between these two city pairs
  4. Return only city pairs which are less than 600 miles apart

Here’s the script:

WITH sqlsats AS
(
	SELECT
		City,
		State,
		EventDate,
		Latitude,
		Longitude,
		GeoPoint,
		DENSE_RANK() OVER (ORDER BY EventDate) dr
	FROM dbo.SQLSaturdays s
),
chronoprox AS
(
	SELECT
		s.City,
		s.State,
		s.EventDate,
		s.Latitude,
		s.Longitude,
		s.GeoPoint,
		sNext.City AS NextCity,
		sNext.State AS NextState,
		sNext.EventDate AS NextEventDate,
		sNext.GeoPoint AS NextGeoPoint
	FROM sqlsats s
		LEFT OUTER JOIN sqlsats sNext
			ON (s.dr = sNext.dr - 1 OR s.dr = sNext.dr)
			AND s.EventDate >= DATEADD(DAY, -7, sNext.EventDate)
			AND NOT (s.City = sNext.City AND s.State = sNext.State)
),
geoprox AS
(
	SELECT
		cp.City,
		cp.State,
		cp.EventDate,
		cp.Latitude,
		cp.Longitude,
		cp.GeoPoint,
		cp.NextCity,
		cp.NextState,
		cp.NextEventDate,
		cp.NextGeoPoint,
		cp.GeoPoint.STDistance(cp.NextGeoPoint) / 1609.344 AS DistanceInMiles
	FROM chronoprox cp
	WHERE
		cp.NextGeoPoint IS NOT NULL
)
SELECT
	gp.City,
	gp.State,
	gp.EventDate,
	gp.NextCity,
	gp.NextState,
	gp.NextEventDate,
	gp.DistanceInMiles
FROM geoprox gp
WHERE
	gp.DistanceInMiles < 600
ORDER BY
	gp.NextEventDate;

The end result is that there were 38 city pairs with a distance of less than 600 miles in 2015 and 2016.  23 of these pairings took place in 2015, and 15 in 2016 (including scheduled events which have not yet happened).

Deriving Conclusions

10 of the 38 city-pairs were in the run-up to PASS Summit 2015, and Raleigh & Charlotte were right in the middle of that.

As Grand Poobah of the Raleigh SQL Saturday, this does affect me, as we have some larger events (Atlanta, Orlando, Washington, Charlotte) in our region, as well as smaller institutions (Richmond, Spartanburg).  Finding a good date has become a bit harder (in my case, because I don’t want to jump claim on any of those events’ dates), but I also don’t want to oversell that difficulty:  out of 209 potential collisions, we only saw 38.  And that’s factoring in current concerns people have about the total number of available dates, such as how you would never have a SQL Saturday around Thanksgiving or Christmas or Labor Day or…

I also don’t want to undersell the marginal increase in difficulty for smaller SQL Saturdays to keep going—it’s hard to find a venue even in the best of circumstances (trust me, I know!), and feeling like you’re going to be locked out of a good portion of the year reduces your options even further.  My original concern was mostly around midwestern events, as they’re going to have the most geographical overlap, but if you check out the pairs, the southeast is actually the hardest-hit region:

--I took the results from the previous query and put them into a temp table called #tmp.
WITH records AS
(
	SELECT
		City,
		State
	FROM #tmp
	UNION ALL
	SELECT
		nextcity,
		nextstate
	FROM #tmp
)
SELECT
	City,
	State,
	COUNT(1) AS Collisions
FROM records
GROUP BY
	City,
	State
ORDER BY
	Collisions DESC;

SQLSaturdayCollisions

Kansas City tops the list with 6 collisions, followed by Raleigh with 5 (go us!). Also near the top of the list are Spartanburg, Dallas, Columbus (GA), and Orlando have 4 apiece.  Note that the total number of collisions adds up to 76 because I include both “sides” of the 38 collisions.  This does affect 34 separate events, though.

What If?

While I have the data, how about we play around with a couple of alternatives?

  1. If we shrink the collision radius to 500 miles, we’re down to 30 collisions, including 14 in 2016.
  2. At 400 miles (the international limit), we have 18 collision pairs, including 7 in 2016.
  3. At 700 miles, we’re up to 54 collisions, including 20 in 2016.  As a side note, at 700 miles, we would have had collisions in all 9 of the cities which had SQL Saturdays in the first three weeks of October 2015 (Columbus GA, Pittsburgh, KC, Orlando, Raleigh, Minneapolis, Boston, Charlotte, Dallas). At 600 miles, Boston eeks out of the list.

Final Thoughts (For Now)

I’m still on the fence about this decision.  As a speaker, I like the fact that there are so many events on the east coast and midwest, and I’d hate to see the number of opportunities to visit other cities and try to help people learn drop as a result of this.  Looking at the data, I think we’ll lose some events on the margin.  There is some opportunity for growth in less desirable times of the year (think July), but the problem is that if you’re already running a smaller event, picking a bad time of year will guarantee that not many people will be able to show up.

But at the same time, I’ve heard from several events that I’ve attended that sponsorships are drying up this year.  If that’s the case across the board, then we might have reached the event limit, particularly in areas like the southeast which have a large number of events.

Advertisements

16 thoughts on “Calculating SQL Saturday Distances

    • Hi, Steve. The answer is that, for this historical analysis, I am implicitly including both sides of the “one week apart” rule, as we can take advantage of city distances being reflexive. In other words, if the distance between Raleigh & Charlotte is ~150 miles, the distance between Charlotte & Raleigh is also 150 miles, and so the order in which we look at the two cities does not matter as long as we look at them as a pair.

      Here’s a quick example: suppose we have cities A, B, and C giving SQL Saturdays in that order, one on each week. If dist(A, B) < 600 miles and dist(B, C) < 600 miles, there will be two results which show up in the data set: A–B and B–C. If you want to get the number of collisions a particular city is involved in, you'd need to UNION ALL the City and NewCity columns, like in the aggregation query near the end of the post. That gives you a unique number of SQL Saturday collisions for a particular city.

      The analysis gets a little more complex when dealing with multiple colliding events on the same day, so let me use a concrete example here. I'll use Raleigh on October 10th, 2015 as an example here, as it was right in the middle of a SQL Saturday storm (9 US events in 3 weeks, including two others on the same day). Looking at results for Raleigh as City or NextCity, we get 5 rows back. They are:
      1 – Columbus GA (10/3) — Raleigh (10/3)
      2 – Pittsburgh (10/3) — Raleigh (10/3)
      3 – Orlando (10/10) — Raleigh (10/10)
      4 – Raleigh (10/10) — Orlando (10/10)
      5 – Raleigh (10/10) — Charlotte (10/17)

      3 & 4 are technically duplicates (so even my comment of 38 total collisions above does include duplicates when the collision is on the same day), but we can see four distinct city collisions here. Collisions from the week before leave Raleigh in the NextCity column, whereas collisions in the week after leave Raleigh in the City column. When we have a collision on the same day, Raleigh shows up in both columns (call those extra-bad collisions and weigh them double…).

      Now, if you were planning on hosting an event on October 17th and wanted to see if there were any collisions, this query would not work; you would need to look in both directions because your action would modify the data structure in place. But for this kind of historical analysis, we can do it in a single pass.

      • Thanks, Kevin. That makes perfect sense. I was thinking you might only be looking city -> city, not citycity.

        Appreciate the analysis. I think this helps to determine the impact.

  1. I think what is missing in your analysis is how a smaller region, like the northeast is impacted. We basically don’t want to hold winter events because of the potential that bad weather will kill the event which removes about 12 weeks from our choice. Then a 600 mile radius would basically eliminate 33 other weeks for a venue like Portland, ME, because there are regularly 11 other SQLSaturday’s in a 600 mile radius (NY state (4), Boston (2), Providence, Hartford, Washington DC, Philadelphia, Baltimore). Most people from Maine aren’t going to anything outside of New England, and most aren’t going to go south of Boston. Seems to limit opportunities for attendees. Instead of analyzing the distance between locations, let’s look at how far non-speakers are traveling to go to events and limit based on that.

  2. SQLSoldier says:

    Maybe some of the smaller markets in the same geographical region should consider combining their SQL Saturdays. The user groups in Redmond, WA and Olympia, WA did this many years ago. They would alternate the location, 1 year in Redmond, 1 year in Olympia, and the 2 user groups would work together to run it.

    A model like this might work for events like SQLSaturday Rochester and SQLSaturday Albany. I have attended SQL Saturdays all around the country and parts of Canada, and Rochester was by far the smallest I’ve ever attended.

    • Small doesn’t mean unnecessary. It is over a 3 hour drive from Rochester to Albany. Rochester draws from Buffalo as well, while Albany can draw from PA, CT, and VT. I still think someone needs to look at where the attendees are coming from and see how far the majority are currently traveling to go to events. My experience is that most people (non-speakers) are within 2 hours driving of the event and that’s a lot closer than 600 miles.

      I just think that SQLSaturday should be about making good SQL training and spreading the word about PASS easily accessible to as many people as possible and enforcing the 600 mile radius and a week between events makes that unnecessarily difficult to do.

  3. I would say that I think it makes sense for regional events to work together. Albany, Rochester, Toronto, etc. should talk to each other Lincoln/Omaha do as well.

    However, I’d be sadder if we went every other year for events because there’s another close by in the area. I do think that Albany and Rochester can both have an event every year, perhaps even working together to fund raise and share some efforts, perhaps spreading their events out to different parts of the year. But they have to fit their budget, not just adopt an event and then try to raise some budget.

    I would be stunned if more than a small fraction, 10% perhaps, drove more than an hour to an event. We can’t get plenty of people to drive across town in some places. I think 400mi was crazy, and 600 is ridiculous.

    The problem, to me, is not too many events, it’s much to grandiose an idea of what a SQL Saturday is meant to be. It’s not a showcase for organizers or speakers. It’s a way to touch, inspire, and train people.

  4. Wow, 600 miles.

    So if Atlanta keeps its traditional second week of May date, all events as far away as ( Miami, New Orleans, St Louis, Chicago, Detroit, Cleveland, Pittsburgh, Baltimore, Washington DC ) are blocked from a May (4th weekend is Memorial Day) SQL Saturday. That is a huge section of the country. Washington DC effectively blocks from Montreal, Boston, NYC, Jacksonville FL, Atlanta, Chicago, Detroit and, Toronto) Traditionally Portland holds its SQL Saturday the weekend before PASS, which encourages PASS Speakers to also speak in Portland while traveling to the west coast. As Portland is within the 600 mile radius of PASS Summit, would that one be blocked as well?

    Now as an attendee, I would not travel more than maybe two hours to attend. I am lucky, we have great events in Atlanta and Chattanooga that are months apart. As a speaker(my company does not pay me to speak), so 4 hours (200 miles) is my normal limit. 400 miles was good, the 600 mile limit ( twelve hour driving radius ) is a bit excessive.

    Was this limit put in by the vendors who want to speak at every SQL Saturday?

  5. The other thing is that those 38 conflicts are going to move to different weeks, which will increase the number of conflicts. Look at this scenario.

    You theoretically are the organizer of SQL Saturday in Washington DC. Unfortunately a very central location to the rest of the east coast. Based on the assumptions that most organizers do not like to schedule late June – Labor Day because of vacations. Late October/ early November tends to be out as well because of PASS Summit. Holidays like Memorial day, Labor Day, July 4, Christmas plus two weeks before, and Thanksgiving and the weekend before TG are also out.

    This year, the other organizers have theoretically scheduled there events (I made up these dates, to make an evil point) and the following locations are all within 600 miles of Washington DC

    1/14 – Boston MA
    1/28 – Chicago IL
    2/04 – Charlotte NC
    2/18 – Toronto ONT
    3/04 – Nashville, TN
    3/18 – New York, NY
    4/01 – Pittsburgh, PA
    4/15 – Montreal, QEB
    4/29 – Louisville, KY
    5/13 – Atlanta GA
    6/03 – Detroit
    Summer Holiday
    9/10 – Chattanooga, TN
    9/24 – Upstate NYC
    10/1 – Raleigh, NC
    PASS
    12/3 – Philadelphia, PA

    Washington DC (which has has great attendance and organization) is out scheduled for the year. Yes, they can try a summer or do an ANTI Pass east coast October event, but these 14 effectively events block the DC group out.

    Yes, this is a worse scenario, but conflicts will happen. And when the impacted SQL Saturdays migrate, the impact will be greater than expected.

    • Great scenario. I do think you’ll see fewer SQLSaturday events, but not fewer training events. I already know people who are considering just not using the SQLSaturday brand and doing events on their own. With tools like EventBrite and other event management software out there you don’t really need to have the SQLSaturday tools.

    • Evil, slightly. Good to look at best case and worst case scenarios.

      What I’d say here is June, July, August aren’t out of bounds. Events have moved to these spaces, and while you might think DC can’t have one, I’d disagree. We’re talking avg of 300 people, which is nothing.

      However, I do agree that as events have to migrate, there will be other impacts.

    • Venue analysis would be interesting, but it would require a relevant data set. I don’t have any relevant data, but if it became available, it’d be worth taking a look.

      • I suspect that some organizers might be reluctant to disclose this data. I would love to know, however, which venues are free, which require some cleaning/security/insurance cost, and which require rental.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s