Finding Max Concurrent Operations With T-SQL (Part 2)

Last Time On 36 Chambers

Yesterday, I wrote a large number of words to give us some sample data that we could use today.

To give you a quick reminder of the problem we’re trying to solve, we can have multiple concurrent work items processing for a customer, but we want to limit that number to 2.  The development team put in a check to throttle customers and want us to write a query to ensure that they wrote their check correctly.

*Narrator’s voice*:  They didn’t.

Meanwhile, Back At The Wrench

Because this is a difficult problem, I immediately looked for an Itzik Ben-Gan solution and found one.  Here’s my take on his solution, but definitely read his article.

Based on yesterday’s work, we can generate an arbitrary number of values which approximate a normal distribution, so we’re going to build 5 million rows worth of start and stop times for customers.  Here’s the code we will use:

DROP TABLE IF EXISTS #WorkItems;
CREATE TABLE #WorkItems
(
	WorkItemID INT NOT NULL PRIMARY KEY CLUSTERED,
	CustomerID INT NOT NULL,
	WorkItemStart DATETIME2(0) NOT NULL,
	WorkItemEnd DATETIME2(0) NULL
);

DECLARE
	@NumberOfRecords INT = 5000000,
	@NumberOfCustomers INT = 150,
	@StartDate DATETIME2(0) = '2018-12-18 15:00:00',
	@MeanRunLengthInSeconds DECIMAL(5,2) = 90.0,
	@StdDevRunLengthInSeconds DECIMAL(5,2) = 18.8,
	@MeanLengthBetweenRunsInSeconds DECIMAL(5,2) = 128.0,
	@StdDevLengthBetweenRunsInSeconds DECIMAL(5,2) = 42.3,
	@Precision INT = 1;

WITH
L0 AS(SELECT 1 AS c UNION ALL SELECT 1),
L1 AS(SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B),
L2 AS(SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B),
L3 AS(SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B),
L4 AS(SELECT 1 AS c FROM L3 AS A CROSS JOIN L3 AS B),
L5 AS(SELECT 1 AS c FROM L4 AS A CROSS JOIN L4 AS B),
Nums AS(SELECT ROW_NUMBER() OVER(ORDER BY (SELECT 0)) AS n FROM L5),
Vals AS
(
	SELECT TOP (@NumberOfRecords)
		n,
		r.CustomerID,
		s.NumberOfSecondsToRun,
		s.NumberOfSecondsBeforeNextRunBegins
	FROM Nums
		CROSS APPLY (
			SELECT
				RAND(CHECKSUM(NEWID())) AS rand1,
				RAND(CHECKSUM(NEWID())) AS rand2,
				CAST(@NumberOfCustomers * RAND(CHECKSUM(NEWID())) AS INT) + 1 AS CustomerID
		) r
		CROSS APPLY
		(
			SELECT
				ROUND((SQRT(-2.0 * LOG(r.rand1)) * COS(2 * PI() * r.rand2)) * @StdDevRunLengthInSeconds, @Precision) + @MeanRunLengthInSeconds AS NumberOfSecondsToRun,
				ROUND((SQRT(-2.0 * LOG(r.rand1)) * SIN(2 * PI() * r.rand2)) * @StdDevLengthBetweenRunsInSeconds, @Precision) + @MeanLengthBetweenRunsInSeconds AS NumberOfSecondsBeforeNextRunBegins
		) s
),
records AS
(
	SELECT
		v.n AS WorkItemID,
		v.CustomerID,
		v.NumberOfSecondsToRun,
		v.NumberOfSecondsBeforeNextRunBegins,
		SUM(v.NumberOfSecondsBeforeNextRunBegins) OVER (PARTITION BY v.CustomerID ORDER BY v.n ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) - v.NumberOfSecondsBeforeNextRunBegins AS TotalNumberOfSeconds
	FROM vals v
)
INSERT INTO #WorkItems
(
	WorkItemID,
	CustomerID,
	WorkItemStart,
	WorkItemEnd
)
SELECT
	r.WorkItemID,
	r.CustomerID,
	s.WorkItemStart,
	DATEADD(SECOND, r.NumberOfSecondsToRun, s.WorkItemStart) AS WorkItemEnd
FROM records r
	CROSS APPLY
	(
		SELECT
			DATEADD(SECOND, r.TotalNumberOfSeconds, @StartDate) AS WorkItemStart
	) s;

On the ridiculously overpowered server that I’m abusing to do this blog post, it took about 32 seconds to generate 5 million rows.  We are generating data for 150 customers using a uniform distribution, so we’d expect somewhere around 33,333 records per customer.  In my sample, customer work item counts range from 32,770 to 33,747 so there’s a little bit of variance, but not much.

Building A Solution:  A Single Customer

Just like before, we’re going to squeeze a few hundred more words out of this post build up a query step by step.  Our first step will be to get the starting times and ending times for each of our selected customer’s work items.  Each starting point and each ending point will take a row, so we’ll use UNION ALL to unpivot this data.

DECLARE
	@CustomerID INT = 1;

SELECT
	wi.CustomerID,
	wi.WorkItemStart AS TimeUTC,
	1 AS IsStartingPoint,
	ROW_NUMBER() OVER (ORDER BY wi.WorkItemStart) AS StartOrdinal
FROM #WorkItems wi
WHERE
	wi.CustomerID = @CustomerID

UNION ALL

SELECT
	wi.CustomerID,
	wi.WorkItemEnd AS TimeUTC,
	0 AS IsStartingPoint,
	NULL AS StartOrdinal
FROM #WorkItems wi
WHERE
	wi.CustomerID = @CustomerID

For my data set, I get something like the following:

Getting our start times and our end times.

As a quick note, the start times are all in order but the end times are arbitrarily ordered, as ordering won’t matter where we’re going.

Then, we want to build out step two, where we add an ordering based on time for each of the start and stop points:

DECLARE
	@CustomerID INT = 1;

WITH StartStopPoints AS
(
	SELECT
		wi.CustomerID,
		wi.WorkItemStart AS TimeUTC,
		1 AS IsStartingPoint,
		ROW_NUMBER() OVER (ORDER BY wi.WorkItemStart) AS StartOrdinal
	FROM #WorkItems wi
	WHERE
		wi.CustomerID = @CustomerID
	UNION ALL
	SELECT
		wi.CustomerID,
		wi.WorkItemEnd AS TimeUTC,
		0 AS IsStartingPoint,
		NULL AS StartOrdinal
	FROM #WorkItems wi
	WHERE
		wi.CustomerID = @CustomerID
)
SELECT
	s.CustomerID,
	s.TimeUTC,
	s.IsStartingPoint,
	s.StartOrdinal,
	ROW_NUMBER() OVER (ORDER BY s.TimeUTC, s.IsStartingPoint) AS StartOrEndOrdinal
FROM StartStopPoints s;

Running this gives us the following results for my data:

Customer 1 Time Ordered
A plan, starting to come together. Which I am beginning to love.

You can probably see by this point how the pieces are coming together:  each time frame has a starting point and an ending point.  If there were no overlap at all, we’d see in the fourth column a number followed by a NULL, followed by a number followed by a NULL, etc.  But we clearly don’t see that:  we see work item ordinals 3 and 4 share some overlap:  item 3 started at 3:06:15 PM and ended after item 4’s start of 3:07:20 PM.  This means that those two overlapped to some extent.  Then we see two NULL values, which means they both ended before 5 began.  So far so good for our developers!

The final calculation looks like this:

DECLARE
	@CustomerID INT = 1;

WITH StartStopPoints AS
(
	SELECT
		wi.CustomerID,
		wi.WorkItemStart AS TimeUTC,
		1 AS IsStartingPoint,
		ROW_NUMBER() OVER (ORDER BY wi.WorkItemStart) AS StartOrdinal
	FROM #WorkItems wi
	WHERE
		wi.CustomerID = @CustomerID

	UNION ALL

	SELECT
		wi.CustomerID,
		wi.WorkItemEnd AS TimeUTC,
		0 AS IsStartingPoint,
		NULL AS StartOrdinal
	FROM #WorkItems wi
	WHERE
		wi.CustomerID = @CustomerID
),
StartStopOrder AS
(
	SELECT
		s.CustomerID,
		s.TimeUTC,
		s.IsStartingPoint,
		s.StartOrdinal,
		ROW_NUMBER() OVER (ORDER BY s.TimeUTC, s.IsStartingPoint) AS StartOrEndOrdinal
	FROM StartStopPoints s
)
SELECT
	MAX(2 * s.StartOrdinal - s.StartOrEndOrdinal) AS MaxConcurrentWorkItems
FROM StartStopOrder s
WHERE
	s.IsStartingPoint = 1;

Because we know that each start (probably) has an end, we need to multiply StartOrdinal by 2.  Then, we want to compare the result of that calculation to StartOrEndOrdinal.  If you go back up to the previous image, you can see how this gives us 1 concurrent item for the first three work items, but as soon as we add in work item 4, 2*4-6 = 2, so we now have two concurrent work items.  By the way, this ran in about 26ms for me and took 182 reads if you use an index that I create below.  But don’t skip that far ahead yet; we have more to do!

Voila.  Or viola, whichever.

But wait, there’s more!

Where Am I Overlapping The Most?

Naturally, if I know that my max overlap window is 3, I’d be curious about when that happened—maybe I can correlate it with logs and figure out which intern to blame (protip:  have interns in different time zones so you always have a likely culprit).

Getting the results is easy, for some definition of “easy” which doesn’t really fit with the normal definition of “easy.”

First, we will create a temp table called #MaxConcurrentItems.  This will hold all of the work items which are equal to the max concurrent item count.

DROP TABLE IF EXISTS #MaxConcurrentItems;
CREATE TABLE #MaxConcurrentItems
(
	WorkItemID INT PRIMARY KEY CLUSTERED,
	CustomerID INT,
	MaxConcurrentWorkItems INT
);

DECLARE
	@CustomerID INT = 1,
	@MaxConcurrentWorkItems INT = 0;

WITH StartStopPoints AS
(
	SELECT
		wi.WorkItemID,
		wi.CustomerID,
		wi.WorkItemStart AS TimeUTC,
		1 AS IsStartingPoint,
		ROW_NUMBER() OVER (ORDER BY wi.WorkItemStart) AS StartOrdinal
	FROM #WorkItems wi
	WHERE
		wi.CustomerID = @CustomerID

	UNION ALL

	SELECT
		wi.WorkItemID,
		wi.CustomerID,
		wi.WorkItemEnd AS TimeUTC,
		0 AS IsStartingPoint,
		NULL AS StartOrdinal
	FROM #WorkItems wi
	WHERE
		wi.CustomerID = @CustomerID
),
StartStopOrder AS
(
	SELECT
		s.WorkItemID,
		s.CustomerID,
		s.TimeUTC,
		s.IsStartingPoint,
		s.StartOrdinal,
		ROW_NUMBER() OVER (ORDER BY s.TimeUTC, s.IsStartingPoint) AS StartOrEndOrdinal
	FROM StartStopPoints s
),
MaxConcurrency AS
(
	SELECT
		MAX(2 * s.StartOrdinal - s.StartOrEndOrdinal) AS MaxConcurrentWorkItems
	FROM StartStopOrder s
	WHERE
		s.IsStartingPoint = 1
)
INSERT INTO #MaxConcurrentItems
(
	WorkItemID,
	CustomerID,
	MaxConcurrentWorkItems
)
SELECT
	s.WorkItemID,
	s.CustomerID,
	M.MaxConcurrentWorkItems
FROM StartStopOrder s
	CROSS JOIN MaxConcurrency m
WHERE
	2 * s.StartOrdinal - s.StartOrEndOrdinal = m.MaxConcurrentWorkItems;

So let’s explain this step by step:

  1. StartStopPoints and StartStopOrder are the same as before.
  2. Instead of returning MaxConcurrentWorkItems, we’re dropping that into a CTE called MaxConcurrency.
  3. Once we have the max level of concurrency, we want to go back to StartStopOrder and get all of the cases where the number of concurrent work items is equal to the maximum number of concurrent work items.  We insert this into #MaxConcurrentWorkItems.

From there, I populate @MaxConcurrentWorkItems with the max value we retrieve above (so I don’t need to calculate it again) and I want to get the record which pushed us to the max level of concurrency as well as all of the prior records in that grouping.  We already know a technique for turning one row into multiple rows:  a tally table.

I do need to go back and get the start ordinals for each row in #WorkItems for that customer ID; that way, I can join against the tally table and get prior records, defined where MaxRow.StartOrdinal = WorkItem.StartOrdinal + n - 1.  The -1 at the end is because our tally table starts from 1 instead of 0; if yours has a 0 value, then you can change the inequality operation in the WHERE clause below and include just the n rows without any subtraction involved.

Here is the solution in all its glory:

DROP TABLE IF EXISTS #MaxConcurrentItems;
CREATE TABLE #MaxConcurrentItems
(
	WorkItemID INT PRIMARY KEY CLUSTERED,
	CustomerID INT,
	MaxConcurrentWorkItems INT
);

DECLARE
	@CustomerID INT = 1,
	@MaxConcurrentWorkItems INT = 0;

WITH StartStopPoints AS
(
	SELECT
		wi.WorkItemID,
		wi.CustomerID,
		wi.WorkItemStart AS TimeUTC,
		1 AS IsStartingPoint,
		ROW_NUMBER() OVER (ORDER BY wi.WorkItemStart) AS StartOrdinal
	FROM #WorkItems wi
	WHERE
		wi.CustomerID = @CustomerID

	UNION ALL

	SELECT
		wi.WorkItemID,
		wi.CustomerID,
		wi.WorkItemEnd AS TimeUTC,
		0 AS IsStartingPoint,
		NULL AS StartOrdinal
	FROM #WorkItems wi
	WHERE
		wi.CustomerID = @CustomerID
),
StartStopOrder AS
(
	SELECT
		s.WorkItemID,
		s.CustomerID,
		s.TimeUTC,
		s.IsStartingPoint,
		s.StartOrdinal,
		ROW_NUMBER() OVER (ORDER BY s.TimeUTC, s.IsStartingPoint) AS StartOrEndOrdinal
	FROM StartStopPoints s
),
MaxConcurrency AS
(
	SELECT
		MAX(2 * s.StartOrdinal - s.StartOrEndOrdinal) AS MaxConcurrentWorkItems
	FROM StartStopOrder s
	WHERE
		s.IsStartingPoint = 1
)
INSERT INTO #MaxConcurrentItems
(
	WorkItemID,
	CustomerID,
	MaxConcurrentWorkItems
)
SELECT
	s.WorkItemID,
	s.CustomerID,
	M.MaxConcurrentWorkItems
FROM StartStopOrder s
	CROSS JOIN MaxConcurrency m
WHERE
	2 * s.StartOrdinal - s.StartOrEndOrdinal = m.MaxConcurrentWorkItems;

SELECT
	@MaxConcurrentWorkItems = MAX(mci.MaxConcurrentWorkItems)
FROM #MaxConcurrentItems mci;

WITH L0 AS(SELECT 1 AS c UNION ALL SELECT 1),
L1 AS(SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B),
L2 AS(SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B),
Nums AS(SELECT ROW_NUMBER() OVER(ORDER BY (SELECT 0)) AS n FROM L2),
WorkItems AS
(
	SELECT
		wi.WorkItemID,
		wi.CustomerID,
		wi.WorkItemStart,
		wi.WorkItemEnd,
		ROW_NUMBER() OVER (ORDER BY wi.WorkItemStart) AS StartOrdinal
	FROM #WorkItems wi
	WHERE
		wi.CustomerID = @CustomerID
),
MaxWorkItems AS
(
	SELECT
		wi.WorkItemID,
		wi.CustomerID,
		wi.WorkItemStart,
		wi.WorkItemEnd,
		wi.StartOrdinal,
		ROW_NUMBER() OVER (ORDER BY mci.WorkItemID) AS WorkItemGrouping
	FROM #MaxConcurrentItems mci
		INNER JOIN WorkItems wi
			ON mci.WorkItemID = wi.WorkItemID
)
SELECT
	wi.WorkItemID,
	wi.CustomerID,
	wi.WorkItemStart,
	wi.WorkItemEnd,
	M.WorkItemGrouping
FROM MaxWorkItems M
	CROSS JOIN Nums n
	INNER JOIN WorkItems wi
		ON M.StartOrdinal = (wi.StartOrdinal + n.n - 1)
WHERE
	n.n <= @MaxConcurrentWorkItems
ORDER BY
	wi.WorkItemID ASC;

And here’s the results for customer 1:

Groupings
In-terrrns!!!

For bonus wackiness, the same work item can be in multiple groupings.  That makes sense:  we can see from this mess that 194483 finished at 13:51:22 and 197740 began at 13:52:35, but 194736 and 194513 were both active during that entire stretch, so our pattern of active work items was 1-2-3-2-3.

So we get our results, but at what cost?  Here’s the execution plan for the whole thing:

Groupings Execution Plan

Also blame the interns for a couple of awful queries.

Despite this awful query plan, this ran in about 1.5 seconds, but had a worktable with 129K reads.  I could definitely tune this query by saving my slice of work items with ordinals off in its own temp table and querying that…but my editor says I have to wrap up this section now, so let’s jump straight to the moral of the story.

The moral of the story?  I already told you:  have lots of interns so you always have someone to blame.

What Is My Overlap Per Customer?

Now I’d like to see what the overlap is across all of my 150 customers.  Here goes:

WITH StartStopPoints AS
(
	SELECT
		wi.CustomerID,
		wi.WorkItemStart AS TimeUTC,
		1 AS IsStartingPoint,
		ROW_NUMBER() OVER (PARTITION BY wi.CustomerID ORDER BY wi.WorkItemStart) AS StartOrdinal
	FROM #WorkItems wi

	UNION ALL

	SELECT
		wi.CustomerID,
		wi.WorkItemEnd AS TimeUTC,
		0 AS IsStartingPoint,
		NULL AS StartOrdinal
	FROM #WorkItems wi
),
StartStopOrder AS
(
	SELECT
		s.CustomerID,
		s.TimeUTC,
		s.IsStartingPoint,
		s.StartOrdinal,
		ROW_NUMBER() OVER (PARTITION BY s.CustomerID ORDER BY s.TimeUTC, s.IsStartingPoint) AS StartOrEndOrdinal
	FROM StartStopPoints s
)
SELECT
	s.CustomerID,
	MAX(2 * s.StartOrdinal - s.StartOrEndOrdinal) AS MaxConcurrentWorkItems
FROM StartStopOrder s
WHERE
	s.IsStartingPoint = 1
GROUP BY
	s.CustomerID
ORDER BY
	MaxConcurrentWorkItems DESC,
	CustomerID ASC;

At our worst, we have 5 concurrent work items for customer 139:

Concurrent Items.png
The neediest customer of all.

This query took several seconds to run, so let’s check out the execution plan:

Group Plan

If thick arrows are your idea of a good time, this is going great.

We have two separate scans of the WorkItems table, and we can see the Segment-Sequence combo which represents a window function show up twice.  The biggest pain point is a major Sort operation which in my case spilled at level 1.

So let’s create an index:

CREATE NONCLUSTERED INDEX [IX_WorkItems] ON #WorkItems
(
	CustomerID,
	WorkItemStart,
	WorkItemEnd
) WITH(DATA_COMPRESSION = PAGE);

As a quick note, there’s nothing here which prevents users from running two files at the same time, so I can’t make this a unique index.

Doing this didn’t change the execution plan much, but did change IO from 53,368 reads to 26,233 reads and reduced overall execution time on this server from 6925ms (17,406ms CPU time) down to 3650ms (8203ms CPU time) so that’s not awful.

Wrapping Up

Over the course of this two-post series, we have generated a lot of artificial data in order to find overlapping time ranges. Along the way, we learned many important things, such as having plenty of cannon fodder interns around.

Finding Max Concurrent Operations With T-SQL (Part 1)

Not too long ago, a co-worker had an issue that he asked me about.  The gist of it is, we can have multiple concurrent work items processing for a customer, but we want to limit that number to 2.  The development team wanted to make sure that their code was working as expected, but they weren’t sure how they could test it, as work items don’t all start and stop at the same time.

Build Some Sample Data

In order to show you the solution, I want to build up a reasonable sized sample.  Any solution looks great when reading five records, but let’s kick that up a notch.  Or, more specifically, a million notches:  I’m going to use a CTE tally table and load 5 million rows.

I want some realistic looking data, so I’ve adapted Dallas Snider’s strategy to build a data set which approximates a normal distribution.

Because this is a little complicated, I wanted to take the time and explain the data load process in detail in its own post, and then apply it in the follow-up post.  We’ll start with a relatively small number of records for this demonstration:  50,000.  The reason is that you can generate 50K records almost instantaneously but once you start getting a couple orders of magnitude larger, things slow down some.

Give Me Rows

Getting an arbitrary number of records is easy with a CTE-based tally table.  Here we get 50,000 uniquely numbered rows in a trivial amount of time:

DECLARE
	@NumberOfRecords INT = 50000;

WITH
L0 AS(SELECT 1 AS c UNION ALL SELECT 1),
L1 AS(SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B),
L2 AS(SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B),
L3 AS(SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B),
L4 AS(SELECT 1 AS c FROM L3 AS A CROSS JOIN L3 AS B),
L5 AS(SELECT 1 AS c FROM L4 AS A CROSS JOIN L4 AS B),
Nums AS(SELECT ROW_NUMBER() OVER(ORDER BY (SELECT 0)) AS n FROM L5)
SELECT TOP(@NumberOfRecords)
	n
FROM Nums;

The output of this is 50,000 rows with one column called n.  It’s not the most creative name but does the trick.

Give Me Random Values

The next thing I need is two random numbers for each of the rows.  We’ll get into why I need them in just a moment.  For now, I want to point out that calling RAND() twice isn’t a safe bet:

RAND Duplicates

Calling RAND() for each record in a result set doesn’t quite work the way you’d expect.

Although that’s not workable, you can still generate random values following a uniform distribution between 0 and 1 using the combination RAND(CHECKSUM(NEWID())):

RAND Unique

GUIDs are occasionally useful for something.

Because NEWID() gives us a new value for each row, we have a different seed for reach row and a result which looks much better for us.

Converting Uniform To Normal

As mentioned above, Dallas Snider has a great technique for generating random values pulled from something which approximates a normal distribution using T-SQL.  But that technique uses a loop and generates two values per loop iteration.  In the end, I’m going to want to do this 5 million times, so I’d much rather build a set-based solution.

First, let’s look at Dallas’s solution and see what we need.  I’ve simplified the code a bit and formatted it the way I like:

DECLARE
	@randNum1 FLOAT = RAND(),
	@randNum2 FLOAT = RAND(),
	@mean FLOAT = 75.0,
	@stdDev FLOAT = 5.0, --standard deviation
	@precision INT = 1; --number of places to the right of the decimal point

SELECT
	ROUND((SQRT(-2.0 * LOG(@randNum1)) * COS(2 * PI() * @randNum2)) * @stdDev, @precision) + @mean AS Value1,
	ROUND((SQRT(-2.0 * LOG(@randNum1)) * SIN(2 * PI() * @randNum2)) * @stdDev, @precision) + @mean AS Value2;

We end up with something like the following:

Nearly Normal

Generating values off of a (nearly) normal distribution.

Your numbers will, of course differ, but you knew that because you get the idea of random numbers.

So how can we adapt this to our code?  Let’s see how.

Getting To The Crux Of Our Problem

Here’s the gist:  we have customers with customer IDs.  These customers have some data processed.  Each time they ask for processing, it takes some number of seconds, but that number can change (maybe the files are different sizes, maybe there’s resource contention, whatever).  There will also be delays between executions for each customer, but here’s the important part:  a customer can have more than one process running at a time.

So how do we model this idea?  With several variables.

DECLARE
	@NumberOfRecords INT = 50000,
	@NumberOfCustomers INT = 150,
	@MeanRunLengthInSeconds DECIMAL(5,2) = 90.0,
	@StdDevRunLengthInSeconds DECIMAL(5,2) = 18.8,
	@MeanLengthBetweenRunsInSeconds DECIMAL(5,2) = 128.0,
	@StdDevLengthBetweenRunsInSeconds DECIMAL(5,2) = 42.3,
	@Precision INT = 1;

In addition to our 50K records, we’ve created 150 customers out of thin air.  We have, through studious analysis of our data and totally not just picking a number out of thin air, determined that the mean number of seconds it takes to do this data processing is exactly 90.0, and our standard deviation is 18.8.  This data approximates a normal distribution.

In addition, the mean length between runs is 128.0 seconds (again, people in lab coats with Very Serious Glasses and beakers and test tubes and stuff helped us determine this) and the standard deviation for time between runs is 42.3 seconds, and once more, this approximates a normal distribution.  Finally, we will ask for precision down to 1 spot after the decimal.

Once I’ve defined those variables and collected those values, I can write a query which gives us realistic-looking values:

DECLARE
	@NumberOfRecords INT = 50000,
	@NumberOfCustomers INT = 150,
	@StartDate DATETIME2(0) = '2018-12-18 15:00:00',
	@MeanRunLengthInSeconds DECIMAL(5,2) = 90.0,
	@StdDevRunLengthInSeconds DECIMAL(5,2) = 18.8,
	@MeanLengthBetweenRunsInSeconds DECIMAL(5,2) = 128.0,
	@StdDevLengthBetweenRunsInSeconds DECIMAL(5,2) = 42.3,
	@Precision INT = 1;

WITH
L0 AS(SELECT 1 AS c UNION ALL SELECT 1),
L1 AS(SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B),
L2 AS(SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B),
L3 AS(SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B),
L4 AS(SELECT 1 AS c FROM L3 AS A CROSS JOIN L3 AS B),
L5 AS(SELECT 1 AS c FROM L4 AS A CROSS JOIN L4 AS B),
Nums AS(SELECT ROW_NUMBER() OVER(ORDER BY (SELECT 0)) AS n FROM L5)
SELECT TOP (@NumberOfRecords)
	n,
	r.CustomerID,
	s.NumberOfSecondsToRun,
	s.NumberOfSecondsBeforeNextRunBegins
FROM Nums
	CROSS APPLY
	(
		SELECT
			RAND(CHECKSUM(NEWID())) AS rand1,
			RAND(CHECKSUM(NEWID())) AS rand2,
			CAST(@NumberOfCustomers * RAND(CHECKSUM(NEWID())) AS INT) + 1 AS CustomerID
	) r
	CROSS APPLY
	(
		SELECT
			ROUND((SQRT(-2.0 * LOG(r.rand1)) * COS(2 * PI() * r.rand2)) * @StdDevRunLengthInSeconds, @Precision) + @MeanRunLengthInSeconds AS NumberOfSecondsToRun,
			ROUND((SQRT(-2.0 * LOG(r.rand1)) * SIN(2 * PI() * r.rand2)) * @StdDevLengthBetweenRunsInSeconds, @Precision) + @MeanLengthBetweenRunsInSeconds AS NumberOfSecondsBeforeNextRunBegins
	) s

And here are some sample results:

Numbers Of Seconds

This rabbit hole is certainly starting to get deep.

The complexity jumped up just a little bit, so let’s walk our way through this.  First, I used the CROSS APPLY operator to give me values for rand1 and rand2, as well as a third random value for our Customer ID.  I don’t mind Customer ID following a uniform distribution—that might not be perfectly realistic, but it’s close enough for our work.  Given 50K records and 150 customers, we’d expect about 333 rows per customer.

Next, I chain the first CROSS APPLY with a second because one good function deserves another.  This second function takes Dallas’s calculations and converts them to my own nefarious purposes.  Note that instead of having one mean and one standard deviation like in his example, I have two means and two standard deviations.

At this point, I have a few variables for each customer:  an implicit ordering (based on n), the number of seconds this particular job will run (NumberOfSecondsToRun), and the number of seconds from now before the next job begins (NumberOfSecondsBeforeNextRunBegins).  So now we need to convert this into a stream of time rather than thinking of things as just individual points in time.

Turning Numbers To Time Streams

Let’s pretend that each customer started at time 0.  Again, that’s not totally realistic but for our purposes it’ll work just fine.  Here’s a quick way to generate a time stream for customer 1:

DECLARE
	@NumberOfRecords INT = 50000,
	@NumberOfCustomers INT = 150,
	@StartDate DATETIME2(0) = '2018-12-18 15:00:00',
	@MeanRunLengthInSeconds DECIMAL(5,2) = 90.0,
	@StdDevRunLengthInSeconds DECIMAL(5,2) = 18.8,
	@MeanLengthBetweenRunsInSeconds DECIMAL(5,2) = 128.0,
	@StdDevLengthBetweenRunsInSeconds DECIMAL(5,2) = 42.3,
	@Precision INT = 1;

WITH
L0 AS(SELECT 1 AS c UNION ALL SELECT 1),
L1 AS(SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B),
L2 AS(SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B),
L3 AS(SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B),
L4 AS(SELECT 1 AS c FROM L3 AS A CROSS JOIN L3 AS B),
L5 AS(SELECT 1 AS c FROM L4 AS A CROSS JOIN L4 AS B),
Nums AS(SELECT ROW_NUMBER() OVER(ORDER BY (SELECT 0)) AS n FROM L5),
Vals AS
(
	SELECT TOP (@NumberOfRecords)
		n,
		r.CustomerID,
		s.NumberOfSecondsToRun,
		s.NumberOfSecondsBeforeNextRunBegins
	FROM Nums
		CROSS APPLY (
			SELECT
				RAND(CHECKSUM(NEWID())) AS rand1,
				RAND(CHECKSUM(NEWID())) AS rand2,
				CAST(@NumberOfCustomers * RAND(CHECKSUM(NEWID())) AS INT) + 1 AS CustomerID
		) r
		CROSS APPLY
		(
			SELECT
				ROUND((SQRT(-2.0 * LOG(r.rand1)) * COS(2 * PI() * r.rand2)) * @StdDevRunLengthInSeconds, @Precision) + @MeanRunLengthInSeconds AS NumberOfSecondsToRun,
				ROUND((SQRT(-2.0 * LOG(r.rand1)) * SIN(2 * PI() * r.rand2)) * @StdDevLengthBetweenRunsInSeconds, @Precision) + @MeanLengthBetweenRunsInSeconds AS NumberOfSecondsBeforeNextRunBegins
		) s
)
SELECT
	v.n,
	v.CustomerID,
	v.NumberOfSecondsToRun,
	v.NumberOfSecondsBeforeNextRunBegins
FROM Vals v
WHERE
	v.CustomerID = 1
ORDER BY
	n ASC

And here are some sample results so we can follow along:

Customer 1 Times.png

The sordid history of Customer 1.

Let’s start at time t=0.  If you want to imagine t=0 as a date like 2018-12-18 08:00:00, that’s also fine, but I’m going to stick with numbers representing seconds for the moment.

So we’re at t=0.  Then event 20 happens.  Customer 1 wants us to process a data file.  It’s going to end up taking us 118.7 seconds to process that first data file.

In the meantime, at t=49 seconds, event 88 happens and we process a second file.  This one takes 58.4 seconds to run, so it completes at time 107.4.

Then, at t=(49+187.2) or t=236.2, the third file starts.  If it helps, I’ve built a number line image below to visualize this:

Customer 1 Line

Visualizing the first three executions for Customer 1.

The main thing we want to see is if there is overlap, and if so, how many concurrent executions we see.  In this case, we can see the red and blue lines overlap, but no overlap with green.  Therefore, our max number of concurrent executions is 2.  We’d want to look at this over the entire stream of time.

The way we can get this is to add one more level of common table expression and introduce a window function with a cutoff:

DECLARE
	@NumberOfRecords INT = 50000,
	@NumberOfCustomers INT = 150,
	@StartDate DATETIME2(0) = '2018-12-18 15:00:00',
	@MeanRunLengthInSeconds DECIMAL(5,2) = 90.0,
	@StdDevRunLengthInSeconds DECIMAL(5,2) = 18.8,
	@MeanLengthBetweenRunsInSeconds DECIMAL(5,2) = 128.0,
	@StdDevLengthBetweenRunsInSeconds DECIMAL(5,2) = 42.3,
	@Precision INT = 1;

WITH
L0 AS(SELECT 1 AS c UNION ALL SELECT 1),
L1 AS(SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B),
L2 AS(SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B),
L3 AS(SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B),
L4 AS(SELECT 1 AS c FROM L3 AS A CROSS JOIN L3 AS B),
L5 AS(SELECT 1 AS c FROM L4 AS A CROSS JOIN L4 AS B),
Nums AS(SELECT ROW_NUMBER() OVER(ORDER BY (SELECT 0)) AS n FROM L5),
Vals AS
(
	SELECT TOP (@NumberOfRecords)
		n,
		r.CustomerID,
		s.NumberOfSecondsToRun,
		s.NumberOfSecondsBeforeNextRunBegins
	FROM Nums
		CROSS APPLY (
			SELECT
				RAND(CHECKSUM(NEWID())) AS rand1,
				RAND(CHECKSUM(NEWID())) AS rand2,
				CAST(@NumberOfCustomers * RAND(CHECKSUM(NEWID())) AS INT) + 1 AS CustomerID
		) r
		CROSS APPLY
		(
			SELECT
				ROUND((SQRT(-2.0 * LOG(r.rand1)) * COS(2 * PI() * r.rand2)) * @StdDevRunLengthInSeconds, @Precision) + @MeanRunLengthInSeconds AS NumberOfSecondsToRun,
				ROUND((SQRT(-2.0 * LOG(r.rand1)) * SIN(2 * PI() * r.rand2)) * @StdDevLengthBetweenRunsInSeconds, @Precision) + @MeanLengthBetweenRunsInSeconds AS NumberOfSecondsBeforeNextRunBegins
		) s
)
SELECT
	v.n AS WorkItemID,
	v.CustomerID,
	v.NumberOfSecondsToRun,
	v.NumberOfSecondsBeforeNextRunBegins,
	SUM(v.NumberOfSecondsBeforeNextRunBegins)
		OVER
		(
			PARTITION BY v.CustomerID
			ORDER BY v.n
			ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
		) - v.NumberOfSecondsBeforeNextRunBegins AS TotalNumberOfSeconds
FROM vals v

We’ve now included a SUM of NumberOfSecondsBeforeNextRunBegins over a window from the beginning of time up to the current row.  But because we’re getting the next run’s start time, we need to subtract out the current row’s value, and that gives us the TotalNumberOfSeconds result which represents where things begin.

Here’s a picture, though note that all of the numbers changed because it’s randomly generated data:

Total Number Of Seconds

Customer 1 has always been at war with Eastasia.

Tying It All Together

So now, armed with these details, we can take the total number of seconds and add it to our start date to get when a work item begins.  We can take the work item begin time and add the number of seconds to run, which gives us the work item’s end time.

DROP TABLE IF EXISTS #WorkItems;
CREATE TABLE #WorkItems
(
	WorkItemID INT NOT NULL PRIMARY KEY CLUSTERED,
	CustomerID INT NOT NULL,
	WorkItemStart DATETIME2(0) NOT NULL,
	WorkItemEnd DATETIME2(0) NULL
);

DECLARE
	@NumberOfRecords INT = 50000,
	@NumberOfCustomers INT = 150,
	@StartDate DATETIME2(0) = '2018-12-18 15:00:00',
	@MeanRunLengthInSeconds DECIMAL(5,2) = 90.0,
	@StdDevRunLengthInSeconds DECIMAL(5,2) = 18.8,
	@MeanLengthBetweenRunsInSeconds DECIMAL(5,2) = 128.0,
	@StdDevLengthBetweenRunsInSeconds DECIMAL(5,2) = 42.3,
	@Precision INT = 1;

WITH
L0 AS(SELECT 1 AS c UNION ALL SELECT 1),
L1 AS(SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B),
L2 AS(SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B),
L3 AS(SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B),
L4 AS(SELECT 1 AS c FROM L3 AS A CROSS JOIN L3 AS B),
L5 AS(SELECT 1 AS c FROM L4 AS A CROSS JOIN L4 AS B),
Nums AS(SELECT ROW_NUMBER() OVER(ORDER BY (SELECT 0)) AS n FROM L5),
Vals AS
(
	SELECT TOP (@NumberOfRecords)
		n,
		r.CustomerID,
		s.NumberOfSecondsToRun,
		s.NumberOfSecondsBeforeNextRunBegins
	FROM Nums
		CROSS APPLY (
			SELECT
				RAND(CHECKSUM(NEWID())) AS rand1,
				RAND(CHECKSUM(NEWID())) AS rand2,
				CAST(@NumberOfCustomers * RAND(CHECKSUM(NEWID())) AS INT) + 1 AS CustomerID
		) r
		CROSS APPLY
		(
			SELECT
				ROUND((SQRT(-2.0 * LOG(r.rand1)) * COS(2 * PI() * r.rand2)) * @StdDevRunLengthInSeconds, @Precision) + @MeanRunLengthInSeconds AS NumberOfSecondsToRun,
				ROUND((SQRT(-2.0 * LOG(r.rand1)) * SIN(2 * PI() * r.rand2)) * @StdDevLengthBetweenRunsInSeconds, @Precision) + @MeanLengthBetweenRunsInSeconds AS NumberOfSecondsBeforeNextRunBegins
		) s
),
records AS
(
	SELECT
		v.n AS WorkItemID,
		v.CustomerID,
		v.NumberOfSecondsToRun,
		v.NumberOfSecondsBeforeNextRunBegins,
		SUM(v.NumberOfSecondsBeforeNextRunBegins) OVER (PARTITION BY v.CustomerID ORDER BY v.n ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) - v.NumberOfSecondsBeforeNextRunBegins AS TotalNumberOfSeconds
	FROM vals v
)
INSERT INTO #WorkItems
(
	WorkItemID,
	CustomerID,
	WorkItemStart,
	WorkItemEnd
)
SELECT
	r.WorkItemID,
	r.CustomerID,
	s.WorkItemStart,
	DATEADD(SECOND, r.NumberOfSecondsToRun, s.WorkItemStart) AS WorkItemEnd
FROM records r
	CROSS APPLY
	(
		SELECT
			DATEADD(SECOND, r.TotalNumberOfSeconds, @StartDate) AS WorkItemStart
	) s;

SELECT
	wi.WorkItemID,
	wi.CustomerID,
	wi.WorkItemStart,
	wi.WorkItemEnd
FROM #WorkItems wi
ORDER BY
	wi.CustomerID,
	wi.WorkItemID;

There are two things of note here.  First, I added a temp table called #WorkItems.  Note that if you are not running SQL Server 2016 SP1 or later, you’ll want to comment out the first line.  Then, I added an insert statement at the bottom.  Here, I start with my records CTE (which we saw above) and used CROSS APPLY to get the work item’s start date.  I did that in a CROSS APPLY​ operation so that I could make the WorkItemEnd calculation easier to follow, saying that we run a certain number of seconds from each work item’s starting point.

This gives us a data set that looks a bit like the following:

Customer 1 Dates

Trying to schedule a date with customer 1.

In this run of the job, we can see that work item 296 ends at 3:06 PM, but work item 403 started at 3:05 PM, meaning that there is some overlap.  If you briefly glance through these, you’ll see some occasions of overlap but nothing absurd.  However, glancing through isn’t enough.  That’s where Part 2 comes into play.

Next Time On Matlock

Up to this point, we’ve spent all of our time building up a realistic-enough data set.  In tomorrow’s post, we will take this structure, expand it to 5 million rows, and look at a way of finding the max level of concurrency by customer.  Stay tuned!

PASS Summit 2018 Evaluation Ratings & Comments

Following up on Brent Ozar’s post on the topic, I figured I’d post my own ratings (mostly because they’re not awful!).  This was my first PASS Summit at which I was a speaker, so I don’t have a good comp for scores except what other speakers publish.  I had the privilege of giving three presentations at PASS Summit this year, and I’m grateful for everyone who decided to sit in on these rather than some other talk.  All numeric responses are on a 1-5 scale with 5 being the best.

Applying Forensic Accounting Techniques Using SQL and R

This was a talk that I’ve given a few times and even have an extra-long director’s cut version available.  I had 71 attendees and 14 responses.

Eval Question
Avg Rating
Rate the value of the session content.
4.21
How useful and relevant is the session content to your job/career?
3.43
How well did the session’s track, audience, title, abstract, and level align with what was presented?
4.21
Rate the speaker’s knowledge of the subject matter.
4.86
Rate the overall presentation and delivery of the session content.
4.29
Rate the balance of educational content versus that of sales, marketing, and promotional subject matter.
4.64

The gist of the talk is, here are techniques that forensic accountants use to find fraud; you can use them to learn more about your data.  I fell flat in making that connection, as the low “useful” score shows.  That’s particularly bad because I think this is probably the most “immediately useful” talk that I did.

Event Logistics Comments

  • Room was very cold
  • very cold rooms, very aggressive air-con
  • The stage was squeaky and made banging noises when the speaker was trying to present. Not their fault! The stage just didn’t seem very stable. Also the room had a really unpleasant smell.
  • Everything was great!

The squeak was something I noticed before the talk.  I thought about staying in place to avoid the squeak, but this is a talk where I want to gesticulate to emphasize points—like moving from one side of the stage to the other to represent steps in a process.  My hope was that the squeak wouldn’t be too noticeable but the microphone may have picked it up.

Speaker Comments

  • I was just curious about the topic, but the speaker inspired me with many smaller, but very feasible tips and tricks of how to look at a data! Thank You!
  • The Jupyter notebooks were awesome. I felt the speaker really knew their stuff.  But the downsides were that the analysis methods discussed weren’t really shown to us, or were so far out of context I didn’t quite see how to use them or how they related to the demos. Multiple data sets were used and maybe just focusing all the methods on one of them may have worked better? I just felt overall it was a really interesting topic with a lot of work done but it just didn’t come together for me. Sorry.
  • I like understanding how fraud got uncovered by looking at data. Thanks
  • Very interesting session. Speak made the subject very interesting. I’ve picked up a few ideas I can use in my job.
  • Some examples of discovery of fraud would have been more effective.
  • One of my favorite sessions at PASS. Thank you for making the jupyter notebook available for download.
  • Great speaker/content, would attend again.

The second comment is exactly the kind of comment I want.  My ego loves the rest of the comments, but #2 makes me want to tear this talk apart and rebuild it better.  The biggest problem that I have with the talk is that my case study involved actual fraud, but none of the data sets I have really show fraud.  I’m thinking of rebuilding this talk using just one data set where I seed in fraudulent activities and expose it with various techniques.  Ideally, I’d get a copy of the case study’s data, but I never found it anywhere.  Maybe I could do a FOIA request or figure out some local government contact.

Getting Started with Apache Spark

My second session was Friday morning.  I had 100 somewhat-awake attendees but only 5 responses, so take the ratings with a grain of salt.

Eval Question
Avg Rating
Rate the value of the session content.
4.80
How useful and relevant is the session content to your job/career?
4.60
How well did the session’s track, audience, title, abstract, and level align with what was presented?
4.80
Rate the speaker’s knowledge of the subject matter.
5.00
Rate the overall presentation and delivery of the session content.
4.80
Rate the balance of educational content versus that of sales, marketing, and promotional subject matter.
5.00

This is a talk that I created specifically for PASS Summit.  I’m happy that it turned out well, considering that there was a good chance of complete demo failure:  my Portable Hadoop Cluster was finicky that morning and wanted to connect to the Internet to grab updates before it would let me run anything.  Then I had to restart the Apache Zeppelin service mid-talk to run any notebooks, but once that restarted successfully, the PHC ran like a champ.

Event Logistics Comments

  • Good

Speaker Comments

  • Session was 100, but I would say 200-300
  • Great presentation!

Getting rating levels right is always tricky.  In this case, I chose 100 rather than 200 because I spent the first 30+ minutes going through the history of Hadoop & Spark and a fair amount of the remaining time looking at Spark SQL.  But I did have a stretch where I get into RDD functions and most T-SQL developers will be unfamiliar with map, reduce, aggregate, and other functions.  So that’s a fair point—calling it a 200 level talk doesn’t bother me.

Cleaning is Half the Battle:  Launching a Data Science Project

This was my last PASS Summit talk, which I presented in the last session slot on Friday.  I had 31 attendees and 7 responses.

Eval Question
Avg Rating
Rate the value of the session content.
4.71
How useful and relevant is the session content to your job/career?
4.71
How well did the session’s track, audience, title, abstract, and level align with what was presented?
4.86
Rate the speaker’s knowledge of the subject matter.
4.86
Rate the overall presentation and delivery of the session content.
4.57
Rate the balance of educational content versus that of sales, marketing, and promotional subject matter.
4.71

Again, small sample size bias applies.

Event Logistics Comments

  • Good
  • Great

Speaker Comments

  • You have a lot content, leading to a rushed talk. Also, your jokes have potential, if u slowed down and sold them better
  • Good presentation. I expected more on getting the project off the ground, but enjoyed the info.
  • Funny and informative–I truly enjoyed your presentation!
  • Great
  • Kevin knows how to present, especially for getting stuck w/ the last session of the last day. He brought a lot of energy to the room. Content was on key too, he helped me understand more about handling data that we don’t want to model.

The slowing down comment is on point.  This is a 90-minute talk by its nature.  I did drop some content (like skipping slides on data cleansing and analysis and just showing the demos) so that I could spend a little more time on the neural network portion of the show, but I had to push to keep on time and technically went over by a minute or two.  I was okay with the overage because it was the final session, so I wasn’t going to block anybody.

Synthesis and Commentary

The ratings numbers are something to take with several grains of salt:  26 ratings over 3 sessions isn’t nearly a large enough sample to know for sure how these turned out.  But here are my thoughts from the speaker’s podium.

  • I speak fast.  I know it and embrace it—I know of the (generally good) advice that you want to go so slow that it feels painful, but I’ll never be that person.  In the Ben Stein — Billy Mays continuum, I’d rather err on the Oxyclean side.
  • I need to cut down on material.  In the first and last talks, they could both be better with less.  The problem with cutting material in the data science process talk is that I’d like to cover the whole process with a realistic-enough example and that takes time.  So this is something I’ll have to work on.
  • I might need to think of a different title for my data science process talk.  I explicitly call out that it’s about launching a data science project, but as I was sitting in a different session, I overheard a couple of people mention the talk and one person said something along the lines of not being interested because he’s already seen data cleansing talks.  The title is a bit jokey and has a punchline in the middle of the session, so I like it, but maybe something as simple as swapping the order of the segments to “Launching a Data Science Project:  Cleaning is Half the Battle” would be enough.
  • Using a timer was a really good idea.  I normally don’t use timers and instead go by feel at SQL Saturdays and user group meetings, and that leads to me sometimes running short on time.  I tend to practice talks with a timer to get an idea of how long they should last, but rarely re-time myself later, so they tend to shift in length as I do them.  Having the timer right in front of me helped keep me on track.
  • For the Spark talk, I think when I create my normal slide deck, I’m going to include the RDD (or “Spark 1.0”) examples as inline code segments and walk through them more carefully.  For an example of what I mean, I have a section in my Classification with Naive Bayes talk where I walk through classification of text based on word usage.  Normally, I’d make mention of the topic and go to a notebook where I walk through the code.  But that might have been a little jarring for people brand new to Spark.
  • I tend to have a paucity of images in talks, making up for it by drawing a lot on the screen.  I personally like the effect because action and animation keep people interested and it’s a lot easier for me to do that by drawing than by creating the animations myself…  It does come with the downside of making the slides a bit more turgid and making it harder for people to review the slides later as they lose some of that useful information.  As I’ve moved presentations to GitPitch I’ve focused on adding interesting but not too obtrusive backgrounds in the hopes that this helps.  Still, some of the stuff that I regularly draw should probably show up as images.

So it’s not perfect, but I didn’t have people hounding me with pitchforks and torches after any of the sessions.  I have some specific areas of focus and intend to take a closer look at most of my talks to improve them.

What Comes After Go-Live?

This is part eight of a series on launching a data science project.

At this point in the data science process, we’ve launched a product into production.  Now it’s time to kick back and hibernate for two months, right?  Yeah, about that…

Just because you’ve got your project in production doesn’t mean you’re done.  First of all, it’s important to keep checking the efficacy of your models.  Shift happens, where a model might have been good at one point in time but becomes progressively worse as circumstances change.  Some models are fairly stable, where they can last for years without significant modification; others have unstable underlying trends, to the point that you might need to retrain such a model continuously.  You might also find out that your training and testing data was not truly indicative of real-world data, especially that the real world is a lot messier than what you trained against.

The best way to guard against unbeknownst model shift is to take new production data and retrain the model.  This works best if you can keep track of your model’s predictions versus actual outcomes; that way, you can tell the actual efficacy of the model, figuring out how frequently and by how much your model was wrong.

Depending upon your choice of algorithm, you might be able to update the existing model with this new information in real time.  Models like neural networks and online passive-aggressive algorithms allow for continuous training, and when you’ve created a process which automatically feeds learned data back into your continuously-training model, you now have true machine learning. Other algorithms, however, require you to retrain from scratch.  That’s not a show-stopper by any means, particularly if your underlying trends are fairly stable.

Regardless of model selection, efficacy, and whether you get to call what you’ve done machine learning, you will want to confer with your stakeholders and ensure that your model actually fits their needs; as I mentioned before, you can have the world’s best regression, but if the people with the sacks of cash want a recommendation engine, you’re not getting the goods.  But that doesn’t mean you should try to solve all the problems all at once; instead, you want to start with a Minimum Viable Product (MVP) and gauge interest.  You’ve developed a model which solves the single most pressing need, and from there, you can make incremental improvements.  This could include relaxing some of the assumptions you made during initial model development, making more accurate predictions, improving the speed of your service, adding new functionality, or even using this as an intermediate engine to derive some other result.

Using our data platform survey results, assuming the key business personnel were fine with the core idea, some of the specific things we could do to improve our product would be:

  • Make the model more accurate.  Our MAE was about $19-20K, and reducing that error makes our model more useful for others.  One way to do this would be to survey more people.  What we have is a nice starting point, but there are too many gaps to go much deeper than a national level.
  • Introduce intra-regional cost of living.  We all know that $100K in Manhattan, NY and $100K in Manhattan, KS are quite different.  We would want to take into account cost of living, assuming we have enough data points to do this.
  • Use this as part of a product helping employers find the market rate for a new data professional, where we’d ask questions about the job location, relative skill levels, etc. and gin up a reasonable market estimate.

There are plenty of other things we could do over time to add value to our model, but I think that’s a nice stopping point.

What’s Old Is New Again

Once we get to this phase, the iterative nature of this process becomes clear.

The Team Data Science Project Lifecycle (Source)

On the micro level, we bounce around within and between steps in the process.  On the macro level, we iterate through this process over and over again as we develop and refine our models.  There’s a definite end game (mostly when the sacks of cash empty), but how long that takes and how many times you cycle through the process will depend upon how accurate and how useful your models are.

In wrapping up this series, if you want to learn more, check out my Links and Further Information on the topic.

Deploying A Model: The Microservice Approach

This is part seven of a series on launching a data science project.

Up to this point, we’ve worked out a model which answers important business questions.  Now our job is to get that model someplace where people can make good use of it.  That’s what today’s post is all about:  deploying a functional model.

Back in the day (by which I mean, say, a decade ago), one team would build a solution using an analytics language like R, SAS, Matlab, or whatever, but you’d almost never take that solution directly to production.  These were analytical Domain-Specific Languages with a set of assumptions that could work well for a single practitioner but wouldn’t scale to a broad solution.  For example, R had historically made use of a single CPU core and was full of memory leaks.  Those didn’t bother analysts too much because desktops tended to be single-core and you could always reboot the machine or restart R.  But that doesn’t work so well for a server—you need something more robust.

So instead of using the analytics DSL directly in production, you’d use it indirectly.  You’d use R (or SAS or whatever) to figure out the right algorithm and determine weights and construction and toss those values over the wall to an implementation team, which would rewrite your model in some other language like C.  The implementation team didn’t need to understand all of the intricacies of the problem, but did need to have enough practical statistics knowledge to understand what the researchers meant and translate their code to fast, efficient C (or C++ or Java or whatever).  In this post, we’ll look at a few changes that have led to a shift in deployment strategy, and then cover what this shift means for practitioners.

Production-Quality Languages

The first shift is the improvement in languages.  There are good libraries for Java, C#, and other “production” languages, so that’s a positive.  But that’s not one of the two positives I want to focus on today.  The first positive is the general improvement in analytical DSLs like R.  We’ve gone from R being not so great when running a business to being production-quality (although not without its foibles) over the past several years.  Revolution Analytics (now owned by Microsoft) played a nice-sized role in that, focusing on building a stable, production-ready environment with multi-core support.  The same goes for RStudio, another organization which has focused on making R more useful in the enterprise.

The other big positive is the introduction of Python as a key language for data science.  With libraries like NumPy, scikit-learn, and Pandas, you can build quality models.  And with Cython, a data scientist can compile those models down to C to make them much faster.  I think the general acceptance of Python in this space has helped spur on developers around other languages (whether open-source like R or closed-source commercial languages like SAS) to get better.

The Era Of The Microservice

The other big shift is a shift away from single, large services which try to solve all of the problems.  Instead, we’ve entered the era of the microservice:  a small service dedicated to providing a single answer to a single problem.  A microservice architecture lets us build smaller applications geared toward solving the domain problem rather than trying to solve the integration problem.  Although you can definitely configure other forms of interoperation, most microservices typically are exposed via web calls and that’s the scenario I’ll discuss today.  The biggest benefit to setting up a microservice this way is that I can write my service in R, you can call it from your Python service, and then some .NET service could call yours, and nobody cares about the particular languages used because they all speak over a common, known protocol.

One concern here is that you don’t want to waste your analysts time learning how to build web services, and that’s where data science workbenches and deployment tools like DeployR come into play.  These make it easier to deploy scalable predictive services, allowing practitioners to build their R scripts, push them to a service, and let that service host the models and turn function calls into API calls automatically.

But if you already have application development skills on your team, you can make use of other patterns.  Let me give two examples of patterns that my team has used to solve specific problems.

Machine Learning Services

The first pattern involves using SQL Server Machine Learning Services as the core engine.  We built a C# Web API which calls ML Services, passing in details on what we want to do (e.g., generate predictions for a specific set of inputs given an already-existing model).  A SQL Server stored procedure accepts the inputs and calls ML Services, which farms out the request to a service which understands how to execute R code.  The service returns results, which we interpret as a SQL Server result set, and we can pass that result set back up to C#, creating a return object for our users.

In this case, SQL Server is doing a lot of the heavy lifting, and that works well for a team with significant SQL Server experience.  This also works well if the input data lives on the same SQL Server instance, reducing data transit time.

APIs Everywhere

The second pattern that I’ll cover is a bit more complex.  We start once again with a C# Web API service.  On the opposite end, we’re using Keras in Python to make predictions against trained neural network models.  To link the two together, we have a couple more layers:  first, a Flask API (and Gunicorn as the production implementation).  Then, we stand nginx in front of it to handle load balancing.  The C# API makes requests to nginx, which feeds the request to Gunicorn, which runs the Keras code, returning results back up the chain.

So why have the C# service if we’ve already got nginx running?  That way I can cache prediction results (under the assumption that those results aren’t likely to change much given the same inputs) and integrate easily with the C#-heavy codebase in our environment.

Notebooks

If you don’t need to run something as part of an automated system, another deployment option is to use notebooks like JupyterZeppelin, or knitr.  These notebooks tend to work with a variety of languages and offer you the ability to integrate formatted text (often through Markdown), code, and images in the same document.  This makes them great for pedagogical purposes and for reviewing your work six months later, when you’ve forgotten all about it.

Using a Jupyter notebook to review Benford’s Law.

Interactive Visualization Products

Another good way of getting your data into users’ hands is Shiny, a package which lets you use Javascript libraries like D3 to visualize your data.  Again, this is not the type of technology you’d use to integrate with other services, but if you have information that you want to share directly with end users, it’s a great choice.

Conclusion

Over the course of this post, I’ve looked at a few different ways of getting model results and data into the hands of end users, whether via other services (like using the microservice deployment model) or directly (using notebooks or interactive applications).  For most scenarios, I think that we’re beyond the days of needing to have an implementation team rewrite models for production, and whether you’re using R or Python, there are good direct-to-production routes available.

The Basics Of Data Modeling

This is part five of a series on launching a data science project.

At this point, we have done some analysis and cleanup on a data set.  It might not be perfect, but it’s time for us to move on to the next step in the data science process:  modeling.

Modeling has five major steps, and we’ll look at each step in turn.  Remember that, like the rest of the process, I may talk about “steps” but these are iterative and you’ll bounce back and forth between them.

Feature Engineering

Feature engineering involves creating relevant features from raw data.  A few examples of feature engineering include:

  • Creating indicator flags, such as IsMinimumAge: Age >= 21, or IsManager: NumberOfEmployeesManaged > 0.  These are designed to help you slice observations and simplify model logic, particularly if you’re building something like a decision tree.
  • Calculations, such as ClickThroughRate = Clicks / Impressions.  Note that this definition doesn’t imply multicollinearity, though, as ClickThroughRate isn’t linearly related to either Clicks or Impressions.
  • Geocoding latitude and longitude from a street address.
  • Aggregating data.  That could be aggregation by day, by week, by hour, by 36-hour period, whatever.
  • Text processing:  turning words into arbitrary numbers for numeric analysis.  Common techniques for this include TF-IDF and word2vec.

Feature Selection

Once we’ve engineered interesting features, we want to use feature selection to winnow down the available set, removing redundant, unnecessary, or highly correlated features.  There are a few reasons that we want to perform feature selection:

  1. If one explanatory variable can predict another, we have multicollinearity, which can make it harder to give credit to the appropriate variable.
  2. Feature selection makes it easier for a human to understand the model by removing irrelevant or redundant features.
  3. We can perform more efficient training with fewer variables.
  4. We reduce the risk of an irrelevant or redundant feature causing spurious correlation.

For my favorite example of spurious correlation:

The only question here is, which causes which?

Model Training

Now that we have some data and a clue of what we’re going to feed into an algorithm, it’s time to step up our training regimen.  First up, we’re going to take some percentage of our total data and designate it for training and validation, leaving the remainder for evaluation (aka, test).  There are no hard rules on percentages, but a typical reserve rate is about 70-80% for training/validation and 20-30% for test.  We ideally want to select the data randomly but also include the relevant spreads and distributions of observations by pertinent variables in our training set; fortunately, there are tools available which can help us do just this, and we’ll look at them in a bit.

First up, though, I want to cover the four major branches of algorithms.

Supervised Learning

The vast majority of problems are supervised learning problems.  The idea behind a supervised learning problem is that we have some set of known answers (labels).  We then train a model to map input data to those labels in order to have the model predict the correct answer for unlabeled records.

Going back to the first post in this series, I pointed out that you have to listen to the questions people ask.  Here’s where that pays off:  the type of algorithm we want to choose depends in part on the nature of those questions.  Major supervised learning classes and their pertinent driving questions include:

  • Regression — How many / how much?
  • Classification — Which?
  • Recommendation — What next?

For example, in our salary survey, we have about 3000 labeled records:  3000(ish) cases where we know the salary in USD based on what people have reported.  My goal is to train a model which can then take some new person’s inputs and spit out a reasonable salary prediction.  Because my question is “How much money should we expect a data professional will make?” we will solve this using regression techniques.

Unsupervised Learning

With unsupervised learning, we do not know the answers beforehand, so we’re trying to derive answers within the data.  Typically, we’ll use unsupervised learning to gain more insight about the data set, which can hopefully give us some labels we can use to convert this into a relevant supervised learning problem.  The top forms of unsupervised learning include:

  • Clustering — How can we segment?
  • Dimensionality reduction — What of this data is useful?

Typically your business users won’t know or care about dimensionality reduction (that is, techniques like Principal Component Analysis) but we as analysts can use dimensionality reduction to narrow down on useful features.

Self-Supervised Learning

Wait, isn’t self-supervised learning just a subset of supervised learning?  Sure, but it’s pretty useful to look at on its own.  Here, we use heuristics to guesstimate labels and train the model based on those guesstimates.  For example, let’s say that we want to train a neural network or Markov chain generator to read the works of Shakespeare and generate beautiful prose for us.  The way the recursive model would work is to take what words have already been written and then predict the most likely next word or punctuation character.

We don’t have “labeled” data within the works of Shakespeare, though; instead, our training data’s “label” is the next word in the play or sonnet.  So we train our model based on the chains of words, treating the problem as interdependent rather than a bunch of independent words just hanging around.

Reinforcement Learning

Reinforcement learning is where we train an agent to observe its environment and use those environmental clues to make a decision.  For example, there’s a really cool video from SethBling about MariFlow:

The idea, if you don’t want to watch the video, is that he trained a recurrent neural network based on hours and hours of his Mario Kart gameplay.  The neural network has no clue what a Mario Kart is, but the screen elements below show how it represents the field of play and state of the game, and uses those inputs to determine which action to take next.

“No, mom, I’m playing this game strictly for research purposes!”

Choose An Algorithm

Once you understand the nature of the problem, you can choose the form of your destructor algorithm.  There are often several potential algorithms which can solve your problem, so you will want to try different algorithms and compare.  There are a few major trade-offs between algorithms, so each one will have some better or worse combination of the following features:

  • Accuracy and susceptibility to overfitting
  • Training time
  • Ability for a human to be able to understand the result
  • Number of hyperparameters
  • Number of features allowed.  For example, a model like ARIMA doesn’t give you many features—it’s just the label behavior over time.

Microsoft has a nice algorithm cheat sheet that I recommend checking out:

It is, of course, not comprehensive, but it does set you in the right direction.  For example, we already know that we want to predict values, and so we’re going into the Regression box in the bottom-left.  From there, we can see some of the trade-offs between different algorithms.  If we use linear regression, we get fast training, but the downside is that if our dependent variable is not a linear function of the independent variables, then we won’t end up with a good result.

By contrast, a neural network regression tends to be fairly accurate, but can take a long time to finish or require expensive hardware to finish in any reasonable time.

Once you have an algorithm, features, and labels (if this is a supervised learning problem), you can train the model.  Training a model involves solving a system of equations, minimizing a loss function.  For example, here is an example of a plot with a linear regression thrown in:

This plot might look familiar if you’re read my ggplot2 series.

In this chart, I have a straight line which represents the best fitting line for our data points, where best fit is defined as the line which minimizes the sum of the squares of errors (i.e., the sum of the square of the distance between the dot and our line).  Computers are great at this kind of math, so as long as we set up the problem the right way and tell the computer what we want it to do, it can give us back an answer.

But we’ve got to make sure it’s a good answer.  That’s where the next section helps.

Validate The Model

Instead of using up all of our data for training, we typically want to perform some level of validation within our training data set to ensure that we are on the right track and are not overfitting our model.  Overfitting happens when a model latches onto the particulars of a data set, leaving it at risk of not being able to generalize to new data.  The easiest way to tell if you are overfitting is to test your model against unseen data.  If there is a big dropoff in model accuracy between the training and testing phases, you are likely overfitting.

Here’s one technique for validation:  let’s say that we reserved 70% of our data for training.  Of the 70%, we might want to slice off 10% for validation, leaving 60% for actual training.  We feed the 60% of the data to our algorithm, generating a model.  Then we predict the outcomes for our validation data set and see how close we were to reality, and how far off the accuracy rates are for our validation set versus our training set.

Another technique is called cross-validation.  Cross-validation is a technique where we slice and dice the training data, training our model with different subsets of the total data.  The purpose here is to find a model which is fairly robust to the particulars of a subset of training data, thereby reducing the risk of overfitting.  Let’s say that we cross-validate with 4 slices.  In the first step, we train with the first 3/4 of the data, and then validate with the final 1/4.  In the second step, we train with slices 1, 2, and 4 and validate against slice 3.  In the third step, we train with 1, 3, and 4 and validate against slice 2.  Finally, we train with 2, 3, and 4 and validate against slice 1.  We’re looking to build up a model which is good at dealing with each of these scenarios, not just a model which is great at one of the four but terrible at the other three.

Often times, we won’t get everything perfect on the first try.  That’s when we move on to the next step.

Tune The Model

Most models have hyperparameters.  For example, a neural network has a few hyperparameters, including the number of training epochs, the number of layers, the density of each layer, and dropout rates.  For another example, random forests have hyperparameters like the maximum size of each decision tree and the total number of decision trees in the forest.

We tune our model’s hyperparameters using the validation data set.  With cross-validation, we’re hoping that our tuning will not accidentally lead us down the road to spurious correlation, but we have something a bit better than hope:  we have secret data.

Evaluate The Model

Model evaluation happens when we send new, never before seen data to the model.  Remember that 20-30% that we reserved early on?  This is where we use it.

Now, we want to be careful and make sure not to let any information leak into the training data.  That means that we want to split this data out before normalizing or aggregating the training data set, and then we want to apply those same rules to the test data set.  Otherwise, if we normalize the full data set and then split into training and test, a smart model can surreptitiously  learn things about the test data set’s distribution and could train toward that, leading to overfitting our model to the test data and leaving it less suited for the real world.

Another option, particularly useful for unlabeled or self-learning examples, is to build a fitness function to evaluate the model.  Genetic algorithms (for a refresher, check out my series) are a common tool for this.  For example, MarI/O uses a genetic algorithm to train a neural network how to play Super Mario World.

He’s no Evander Holyfield, but Mario’s worth a genetic algorithm too.

Conclusion

Just like with data processing, I’m going to split this into two parts.  Today, we’ve looked at some of the theory behind modeling.  Next time around, we’re going to implement a regression model to try to predict salaries.

Data Processing: An Example

This is part four of a series on launching a data science project.

An Example Of Data Processing

Last time around, I spent a lot of time talking about data acquisition, data cleansing, and basic data analysis.  Today, we’re going to walk through a little bit of it with the data professional salary survey.

First, let’s install some packages:

if(!require(tidyverse)) {
  install.packages("tidyverse", repos = "http://cran.us.r-project.org")
  library(tidyverse)
}

if(!require(XLConnect)) {
  install.packages("XLConnect", repos = "http://cran.us.r-project.org")
  library(XLConnect)
}

if(!require(caret)) {
  install.packages("caret", repos = "http://cran.us.r-project.org")
  library(caret)
}

if(!require(recipes)) {
  install.packages("recipes", repos = "http://cran.us.r-project.org")
  library(recipes)
}

if(!require(data.table)) {
  install.packages("data.table", repos = "http://cran.us.r-project.org")
  library(data.table)
}

if(!require(devtools)) {
  install.packages("devtools", repos = "http://cran.us.r-project.org")
  library(devtools)
}

if(!require(keras)) {
  devtools::install_github("rstudio/keras")
  library(keras)
  install_keras(method = "auto", conda = "auto", tensorflow = "default", extra_packages = NULL)
}

The tidyverse package is a series of incredibly useful libraries in R, and I can’t think of doing a data science project in R without it. The XLConnectpackage lets me read an Excel workbook easily and grab the salary data without much hassle. The caret library provides some helpful tooling for working with data, including splitting out test versus training data, like we’ll do below. The recipes package will be useful for normalizing data later, and we will use data.table to get a glimpse at some of our uneven data. We need the devtools package to install keras from GitHub. Keras is a deep learning library which implements several neural network libraries, including TensorFlow, which we will use later in this series. We need to install TensorFlow on our machine. Because this is a small data set, and because I want this to run on machines without powerful GPUs, I am using the CPU-based version of TensorFlow. Performance should still be adequate for our purposes.

Once we have the required packages loaded, we will then load the Excel workbook. I have verified the Excel worksheet and data region are correct, so we can grab the survey from the current directory and load it into salary_data.

wb <- XLConnect::loadWorkbook("2018_Data_Professional_Salary_Survey_Responses.xlsx")
salary_data <- XLConnect::readWorksheet(wb, sheet = "Salary Survey", region = "A4:Z6015")

We can use the glimpse function inside the tidyverse to get a quick idea of what our salary_data dataframe looks like. In total, we have 6011 observations of 26 variables, but this covers two survey years: 2017 and 2018. Looking at the variable names, we can see that there are some which don’t matter very much (like Timestamp, which is when the user filled out the form; and Counter, which is just a 1 for each record.

glimpse(salary_data)
Observations: 6,011
Variables: 26
$ Survey.Year                <dbl> 2017, 2017, 2017, 2017, 2017, 2017, 2017...
$ Timestamp                  <dttm> 2017-01-05 05:10:20, 2017-01-05 05:26:2...
$ SalaryUSD                  <chr> "200000", "61515", "95000", "56000", "35...
$ Country                    <chr> "United States", "United Kingdom", "Germ...
$ PostalCode                 <chr> "Not Asked", "Not Asked", "Not Asked", "...
$ PrimaryDatabase            <chr> "Microsoft SQL Server", "Microsoft SQL S...
$ YearsWithThisDatabase      <dbl> 10, 15, 5, 6, 10, 15, 16, 4, 3, 8, 4, 22...
$ OtherDatabases             <chr> "MySQL/MariaDB", "Oracle, PostgreSQL", "...
$ EmploymentStatus           <chr> "Full time employee", "Full time employe...
$ JobTitle                   <chr> "DBA", "DBA", "Other", "DBA", "DBA", "DB...
$ ManageStaff                <chr> "No", "No", "Yes", "No", "No", "No", "No...
$ YearsWithThisTypeOfJob     <dbl> 5, 3, 25, 2, 10, 15, 11, 1, 2, 10, 4, 8,...
$ OtherPeopleOnYourTeam      <chr> "2", "1", "2", "None", "None", "None", "...
$ DatabaseServers            <dbl> 350, 40, 100, 500, 30, 101, 20, 25, 3, 5...
$ Education                  <chr> "Masters", "None (no degree completed)",...
$ EducationIsComputerRelated <chr> "No", "N/A", "Yes", "No", "Yes", "No", "...
$ Certifications             <chr> "Yes, and they're currently valid", "No,...
$ HoursWorkedPerWeek         <dbl> 45, 35, 45, 40, 40, 35, 40, 36, 40, 45, ...
$ TelecommuteDaysPerWeek     <chr> "1", "2", "None, or less than 1 day per ...
$ EmploymentSector           <chr> "Private business", "Private business", ...
$ LookingForAnotherJob       <chr> "Yes, but only passively (just curious)"...
$ CareerPlansThisYear        <chr> "Not Asked", "Not Asked", "Not Asked", "...
$ Gender                     <chr> "Not Asked", "Not Asked", "Not Asked", "...
$ OtherJobDuties             <chr> "Not Asked", "Not Asked", "Not Asked", "...
$ KindsOfTasksPerformed      <chr> "Not Asked", "Not Asked", "Not Asked", "...
$ Counter                    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...

Our first data cleansing activity will be to filter our data to include just 2018 results, which gives us a sample size of 3,113 participants. There are also results for 2017, but they asked a different set of questions and we don’t want to complicate the analysis or strip out the new 2018 questions.

survey_2018 <- filter(salary_data, Survey.Year == 2018)
nrow(survey_2018) # << 3113 records returned

Looking at the survey, there are some interesting data points that we want:

  • SalaryUSD (our label, that is, what we are going to try to predict)
  • Country
  • YearsWithThisDatabase
  • EmploymentStatus
  • JobTitle
  • ManageStaff
  • YearsWithThisTypeOfJob
  • OtherPeopleOnYourTeam
  • DatabaseServers
  • Education
  • EducationIsComputerRelated
  • Certifications
  • HoursWorkedPerWeek
  • TelecommuteDaysPerWeek
  • EmploymentSector
  • LookingForAnotherJob
  • CareerPlansThisYear
  • Gender

For each of these variables, we want to see the range of options and perform any necessary cleanup. The first thing I’d look at is the cardinality of each variable, followed by a detailed anlaysis of the smaller ones.

PrimaryDatabase is another variable which looks interesting, but it skews so heavily toward SQL Server that there’s more noise than signal to it. Because there are so many platforms with 10 or fewer entries and about 92% of entrants selected SQL Server, we’ll throw it out.

rapply(survey_2018, function(x) { length(unique(x)) })

Survey.Year – 1
Timestamp – 3112
SalaryUSD – 865
Country – 73
PostalCode – 1947
[… continue for a while]

unique(survey_2018$Country)
  1. ‘United States’
  2. ‘Australia’
  3. ‘Spain’
  4. ‘United Kingdom’
    [… continue for a while]

 

unique(survey_2018$EmploymentStatus)
  1. ‘Full time employee’
  2. ‘Full time employee of a consulting/contracting company’
  3. ‘Independent consultant, contractor, freelancer, or company owner’
  4. ‘Part time’

We can use the setDT function on data.table to see just how many records we have for each level of a particular factor. For example, we can see the different entries for PrimaryDatabase and EmploymentSector below. Both of these are troublesome for our modeling because they both have a number of levels with 1-2 entries. This makes it likely that we will fail to collect a relevant record in our training data set, and that will mess up our model later. To rectify this, I am going to remove PrimaryDatabase as a feature and remove the two students from our sample.

data.table::setDT(survey_2018)[, .N, keyby=PrimaryDatabase]

To the three MongoDB users: you have my sympathy.

data.table::setDT(survey_2018)[, .N, keyby=EmploymentSector]

In a way, aren’t we all students? No. Only two of us are.

Most of these columns came from dropdown lists, so they’re already fairly clean. But there are some exceptions to the rule. They are:

  • SalaryUSD
  • YearsWithThisDatabase
  • YearsWithThisTypeOfJob
  • DatabaseServers
  • HoursWorkedPerWeek
  • Gender

All of these were text fields, and whenever a user gets to enter text, you can assume that something will go wrong. For example:

survey_2018 %>%
  distinct(YearsWithThisDatabase) %>%
  arrange(desc(YearsWithThisDatabase)) %>%
  slice(1:10)

Some are older than they seem.

Someone with 53,716 years working with their primary database of choice? That’s commitment! You can also see a couple of people who clearly put in the year they started rather than the number of years working with it, and someone who maybe meant 10 years? But who knows, people type in weird stuff.

Anyhow, let’s see how much that person with at least 10 thousand years of experience makes:

survey_2018 %>%
  filter(YearsWithThisDatabase > 10000)

Experience doesn’t pay after the first century or two.

That’s pretty sad, considering their millennia of work experience. $95-98K isn’t even that great a number.

Looking at years of experience with their current job roles, people tend to be more reasonable:

survey_2018 %>%
  distinct(YearsWithThisTypeOfJob) %>%
  arrange(desc(YearsWithThisTypeOfJob)) %>%
  slice(1:10)

Next up, we want to look at the number of database servers owned. 500,000+ database servers is a bit excessive. Frankly, I’m suspicious about any numbers greater than 5000, but because I can’t prove it otherwise, I’ll leave them be.

survey_2018 %>%
  distinct(DatabaseServers) %>%
  arrange(desc(DatabaseServers)) %>%
  slice(1:5)

survey_2018 %>%
  filter(DatabaseServers >= 5000) %>%
  arrange(desc(DatabaseServers))

500K servers is a lot of servers.

The first entry looks like bogus data: a $650K salary, a matching postal code, and 500K database servers, primarily in RDS? Nope, I don’t buy it.

The rest don’t really look out of place, except that I think they put in the number of databases and not servers. For these entrants, I’ll change the number of servers to the median to avoid distorting things.

Now let’s look at hours per week:

survey_2018 %>%
  distinct(HoursWorkedPerWeek) %>%
  arrange(desc(HoursWorkedPerWeek)) %>%
  slice(1:10)

One of these numbers is not like the others.  The rest of them are just bad.

To the person who works 200 hours per week: find a new job. Your ability to pack more than 7*24 hours of work into 7 days is too good to waste on a job making just $120K per year.

survey_2018 %>%
  filter(HoursWorkedPerWeek >= 168) %>%
  arrange(desc(HoursWorkedPerWeek))

What would I do with an extra day and a half per week? Sleep approximately an extra day and a half per week.

As far as Gender goes, there are only three with enough records to be significant: Male, Female, and Prefer not to say. We’ll take Male and Female and bundle the rest under “Other” to get a small but not entirely insignificant set there.

survey_2018 %>%
  group_by(Gender) %>%
  summarize(n = n())

To the one Reptilian in the survey, I see you and I will join forces with Rowdy Roddy Piper to prevent you from taking over our government.

survey_2018 %>%
  group_by(Country) %>%
  summarize(n = n()) %>%
  filter(n >= 20)

Probably the most surprising country on this list is The Netherlands.  India is a close second, but for the opposite reason.

There are only fifteen countries with at least 20 data points and just eight with at least 30. This means that we won’t get a great amount of information from cross-country comparisons outside of the sample. Frankly, I might want to limit this to just the US, UK, Canada, and Australia, as the rest are marginal, but for this survey analysis, I’ll keep the other eleven.

Building Our Cleaned-Up Data Set

Now that we’ve performed some basic analysis, we will clean up the data set. I’m doing most of the cleanup in a single operation, but I do have some comment notes here, particularly around the oddities with SalaryUSD. The SalaryUSD column has a few problems:

  • Some people put in pennies, which aren’t really that important at the level we’re discussing. I want to strip them out.
  • Some people put in delimiters like commas or decimal points (which act as commas in countries like Germany). I want to strip them out, particularly because the decimal point might interfere with my analysis, turning 100.000 to $100 instead of $100K.
  • Some people included the dollar sign, so remove that, as well as any spaces.

It’s not a perfect regex, but it did seem to fix the problems in this data set at least.

valid_countries <- survey_2018 %>%
                    group_by(Country) %>%
                    summarize(n = n()) %>%
                    filter(n >= 20)

# Data cleanup
survey_2018 <- salary_data %>%
  filter(Survey.Year == 2018) %>%
  filter(HoursWorkedPerWeek < 200) %>%
  # There were only two students in the survey, so we will exclude them here.
  filter(EmploymentSector != "Student") %>%
  inner_join(valid_countries, by="Country") %>%
  mutate(
    SalaryUSD = stringr::str_replace_all(SalaryUSD, "\\$", "") %>%
      stringr::str_replace_all(., ",", "") %>%
      stringr::str_replace_all(., " ", "") %>%
      # Some people put in pennies.  Let's remove anything with a decimal point and then two numbers.
      stringr::str_replace_all(., stringr::regex("\\.[0-9]{2}$"), "") %>%
      # Now any decimal points remaining are formatting characters.
      stringr::str_replace_all(., "\\.", "") %>%
      as.numeric(.),
    # Some people have entered bad values here, so set them to the median.
    YearsWithThisDatabase = case_when(
      (YearsWithThisDatabase > 32) ~ median(YearsWithThisDatabase),
      TRUE ~ YearsWithThisDatabase
    ),
    # Some people apparently entered number of databases rather than number of servers.
    DatabaseServers = case_when(
      (DatabaseServers >= 5000) ~ median(DatabaseServers),
      TRUE ~ DatabaseServers
    ),
    EmploymentStatus = as.factor(EmploymentStatus),
    JobTitle = as.factor(JobTitle),
    ManageStaff = as.factor(ManageStaff),
    OtherPeopleOnYourTeam = as.factor(OtherPeopleOnYourTeam),
    Education = as.factor(Education),
    EducationIsComputerRelated = as.factor(EducationIsComputerRelated),
    Certifications = as.factor(Certifications),
    TelecommuteDaysPerWeek = as.factor(TelecommuteDaysPerWeek),
    EmploymentSector = as.factor(EmploymentSector),
    LookingForAnotherJob = as.factor(LookingForAnotherJob),
    CareerPlansThisYear = as.factor(CareerPlansThisYear),
    Gender = as.factor(case_when(
      (Gender == "Male") ~ "Male",
      (Gender == "Female") ~ "Female",
      TRUE ~ "Other"
    ))
  ) 

Now we can pare out variables we don’t need. Some of these, like postal code, are interesting but we just don’t have enough data for it to make sense. Others, like Kinds of Tasks Performed or Other Job Duties, have too many varieties for us to make much sense with a first pass. They might be interesting in a subsequent analysis, though.

survey_2018 <- survey_2018 %>%
  # One person had a salary of zero.  That's just not right.
  filter(SalaryUSD > 0) %>%
  select(-Counter, -KindsOfTasksPerformed, -OtherJobDuties, -OtherDatabases, -Timestamp, -Survey.Year, 
         -PostalCode, -n, -PrimaryDatabase)

Now that we have our salary data fixed, we can finally look at outliers. I’d consider a salary of $500K a year to be a bit weird for this field. It’s not impossible, but I am a little suspicious. I am very suspicious of the part-timer making $1.375 million, the federal employee making $1 million, or the New Zealander making $630K at a non-profit.

I’m kind of taking a risk by removing these, but they’re big enough outliers that they can have a real impact on our analysis if they’re bad data.

survey_2018 %>%
  filter(SalaryUSD > 500000) %>%
  arrange(desc(SalaryUSD))

I think I’d be willing to accept $1.4 million a year to be a manager of none.

On the other side, there are 12 people who say they earned less than $5K a year. Those also seem wrong. Some of them look like dollars per hour, and maybe some are monthly salary. I’m going to strip those out.

survey_2018 %>%
  filter(SalaryUSD < 5000) %>%
  arrange(desc(SalaryUSD))

For just over a dollar a week, you can hire a data architect.

survey_2018 <- filter(survey_2018, SalaryUSD >= 5000 & SalaryUSD <= 500000)

Data Analysis

We did some of the data analysis up above. We can do additional visualization and correlation studies. For example, let’s look at a quick distribution of salaries after our cleanup work:

summary(survey_2018$SalaryUSD)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   5000   70000   92000   95186  115000  486000

We can also build a histogram pretty easily using the ggplot2 library. This shows the big clump of database professionals earning beween $70K and $115K per year. This salary distribution does skew right a bit, as you can see.

ggplot(data = survey_2018, mapping = aes(x = SalaryUSD)) +
  geom_histogram() +
  theme_minimal() +
  scale_x_log10(label = scales::dollar)

Not including that guy making $58 a year.

We can also break this down to look by primary job title, though I’ll limit to a couple of summaries instead of showing a full picture.

survey_2018 %>% filter(JobTitle == "Data Scientist") %>% select(SalaryUSD) %>% summary(.)
   SalaryUSD     
 Min.   : 45000  
 1st Qu.: 76250  
 Median :111000  
 Mean   :102000  
 3rd Qu.:122000  
 Max.   :160000
survey_2018 %>% filter(JobTitle == "Developer: App code (C#, JS, etc)") %>% select(SalaryUSD) %>% summary(.)
   SalaryUSD     
 Min.   : 22000  
 1st Qu.: 60000  
 Median : 84000  
 Mean   : 84341  
 3rd Qu.:105000  
 Max.   :194000
survey_2018 %>% filter(JobTitle == "Developer: T-SQL") %>% select(SalaryUSD) %>% summary(.)
   SalaryUSD     
 Min.   : 12000  
 1st Qu.: 66000  
 Median : 87000  
 Mean   : 88026  
 3rd Qu.:110000  
 Max.   :300000

This fit pretty well to my biases, although the max Data Scientist salary seems rather low.

Conclusions

This is only a tiny sample of what I’d want to do with a real data set, but it gives you an idea of the kinds of things we look at and the kinds of things we need to fix before a data set becomes useful.

In the next post, we will get started with the wide world of modeling.