Solving Naive Bayes By Hand

This is part two in a series on classification with Naive Bayes.

Learning Naive Bayes Through Repeated Interceptions

On the whole, the Naive Bayes class of algorithms tends to be pretty easy to understand, which is a part of that class’s popularity. So let’s see how easy it really is by solving a problem with Naive Bayes.

The 2018 NFL season is officially over (playoffs? What are those?) so let’s take a look at a team which is most glad the Cleveland Browns exist so that they don’t get all of the smack talk: the Buffalo Bills.

This year, the Bills went 6-10. I’m going to use Naive Bayes techniques to build a predictor for whether the Bills will lose yet again. This will be useful if ever I spend an eternity having to re-watch the 2018 NFL season; I have already resolved to clean up my life to avoid this fate and I haven’t even built the predictor yet…

To make this easy, I’m going to pick four variables:

• Who was the starting quarterback? My set is { Josh Allen, Somebody Else } as that gives a fairly nice split of 11 games versus 5 games.
• Was this a home game or an away game? My set is { Home, Away } and naturally breaks down 8-8.
• Did the team score at least 14 points? My set is { Yes, No } and sadly, 13-14 points was the median points scored.
• Who was the top Bills receiver in terms of yardage? My potential set is { Zay Jones, Chris Ivory, LeSean McCoy, Charles Clay, Kelvin Benjamin, Robert Foster }. Yep, on three separate occasions, running backs led the team in receiving yardage, and we aren’t talking about elite receiving backs like Alvin Kamara. But I’m not going with that full set because 3 of those 6 top receivers were 100% winners or (generally) 100% losers—I’ll explain why this matters later. So let’s bundle Ivory + McCoy and Clay + Benjamin, leaving Jones and Foster alone.

So with these four variables in mind, we’re going to try to predict whether the Bills would be favored to win or lose a game. Here’s how things stack up:

 Game QB H/A 14 Points? Top Receiver W/L 1 Other A N Zay Jones L 2 Allen H Y Zay Jones L 3 Allen A Y Chris Ivory W 4 Allen A N Charles Clay L 5 Allen H N LeSean McCoy W 6 Allen A N Kelvin Benjamin L 7 Other A N Kelvin Benjamin L 8 Other H N LeSean McCoy L 9 Other H N Kelvin Benjamin L 10 Other A Y Robert Foster W 11 Allen H Y Robert Foster W 12 Allen A Y Zay Jones L 13 Allen A Y Robert Foster L 14 Allen H Y Robert Foster W 15 Allen A N Zay Jones L 16 Allen H Y Zay Jones W

Now that we have our data, it’s time to solve the problem. At least the problem of predicting victory, not the problem of scoring two touchdowns per game.

Trust the Process

There are three steps to the process of solving the simplest of Naive Bayes algorithms. They are:

1. Find the probability of winning a game (that is, our prior probability).
2. Find the probability of winning given each input variable: whether Josh Allen starts the game, whether the team is home or away, whether the team scores 14 points, and who the top receiver was.
3. Plug in values from our new data into the formula to obtain the posterior probability.

So let’s get to it!

Prior Probability

Our prior probability is the likelihood of a win or a loss independent of any other conditions. The Bills went 6-10, so their probability of a win was 6/16 or 0.375.

Per-Variable Probabilities

Now things get a little busier, but we’ll look at it step by step. We want to get the probability of a victory for each value of each variable independent of all other variables. It sounds complicated but it really isn’t. Let’s see how easy it is with an example.

By Quarterback

 QB W L P(W) P(L) Allen 5 6 5/6 = .833 6/10 = .600 Other 1 4 1/6 = .167 4/10 = .400 Total: 6 10 100% 100%

Home or Away?

 Location W L P(W) P(L) Home 4 4 4/6 = .667 4/10 = .400 Away 2 6 2/6 = .333 6/10 = .600 Total: 6 10 100% 100%

Scored 14+ Points?

 “Big” Offense W L P(W) P(L) 14+ Points 5 3 5/6 = .833 3/10 = .300 < 14 Points 1 7 1/6 = .167 7/10 = .700 Total: 6 10 100% 100%

 Top Receiver W L P(W) P(L) Clay + Benjamin 0 4 0/6 = .000 4/10 = .400 Ivory + McCoy 2 1 2/6 = .333 1/10 = .100 Robert Foster 3 1 3/6 = .500 1/10 = .100 Zay Jones 1 4 1/6 = .167 4/10 = .400 Total: 6 10 100% 100%

As I mentioned above, I combined some receivers together so that I didn’t end up with 100% probabilities. Well, Clay + Benjamin work well together as “guys the team gave up on and/or vice versa” and I think it’s fitting they belong together. Meanwhile, Ivory & McCoy were the running backs, so there’s something that feels right about combining them. Foster and Jones had both wins and losses so I could leave them be.

Plug In Some Values

We now have our independent probability tables, so we can estimate whether the team will win or lose given these relevant inputs. Our formula for victory is:

$P(W|x) = \dfrac{P(QB_x|W) \cdot P(LOC_x|W) \cdot P(OFF_x|W) \cdot P(TR_x|W) \cdot P(W)}{P(x)}$

We have a formula for victory, but we also need a formula for a loss. Technically a team can tie, but the Bills didn’t have any ties this year, so I’m treating it as a two-class problem to make things easier to follow.

$P(L|x) = \dfrac{P(QB_x|L) \cdot P(LOC_x|L) \cdot P(OFF_x|L) \cdot P(TR_x|L) \cdot P(L)}{P(x)}$

There’s one more thing I need to bring up here: We don’t really know the true probability of our particular set of variables coming true, the P(x) in our examples. But when we’re classifying, we technically don’t need to know P(x) because that part of the formula cancels out when doing an inequality comparison between P(W|x) and P(L|x). This is great for us because it makes our problem tractable, but it comes at the cost that our “probabilities” that we output aren’t truly probabilities, so there’s another step if we really want to get those values.

With those formulas in mind, let’s test some situations.

Scenario 1: The Big Win

I’m going to throw out my ringer lineup here.

• Josh Allen is the starter.
• The team is playing at home.
• The team scores at least 14 points.
• Robert Foster is the leading receiver.

So we plug in the values of our two formulas based on the tables above. Let’s start with winning. With appropriate subscripts we have:

$P(W|x) = \dfrac{P(QB_a|W) \cdot P(LOC_h|W) \cdot P(OFF_y|W) \cdot P(TR_r|W) \cdot P(W)}{P(x)}$

Plugging in values from the table we have our partial probability for victory:

$P(W|x_1) = \dfrac{5}{6} \cdot \dfrac{4}{6} \cdot \dfrac{5}{6} \cdot \dfrac{3}{6} \cdot \dfrac{6}{16} = 0.0868$

And for a loss:

$P(L|x_1) = \dfrac{4}{10} \cdot \dfrac{4}{10} \cdot \dfrac{3}{10} \cdot \dfrac{1}{10} \cdot \dfrac{10}{16} = 0.003$

As you can see, a win is much more likely than a loss in this scenario. As I mentioned above, the two outcomes are not really probabilities (even though I still call it “P”), but we can calculate that the probability of a win is approximately 97% by taking the partial probability of victory (0.0868) and divide by the total pool of partial probabilities (0.0868 + 0.003). Most of the time, though, we don’t need to know the percentages—we just need to know that the Bills are likely to win this game, and it’s not close.

Scenario 2: The Big Push

In this scenario, we’ll change the inputs a little bit:

• Nathan Barkerson is the quarterback.
• The team is still playing at home.
• The team does not score 14 points.
• A running back is the leading receiver.

I won’t do the LaTeX formulas for each step in the process, just the probabilities. Hopefully you get it at this point. Here’s the partial probability of victory:

$P(W|x_2) = \dfrac{1}{6} \cdot \dfrac{4}{6} \cdot \dfrac{1}{6} \cdot \dfrac{2}{6} \cdot \dfrac{6}{16} = 0.0023$

And again, for a loss:

$P(L|x_2) = \dfrac{4}{10} \cdot \dfrac{4}{10} \cdot \dfrac{7}{10} \cdot \dfrac{1}{10} \cdot \dfrac{10}{16} = 0.007$

This is a Buffalo Push: a 75% chance of losing. Speaking of losing…

Scenario 3: The Big Loser

This final scenario will hit on an issue with Naive Bayes that we’ll solve in a future post.

• Josh Allen is the quarterback.
• The team is playing at home.
• The team scores 14 or more points.
• Charles Clay and Kelvin Benjamin fight over the ball like two junkyard dogs, combining for an awe-inspiring 35 yards of receiving between the two of them, with Benjamin’s 18 yards of receiving good enough for first place.

Let’s plug the values into our formula once more, starting with a victory:

$P(W|x_3) = \dfrac{5}{6} \cdot \dfrac{4}{6} \cdot \dfrac{5}{6} \cdot \dfrac{0}{6} \cdot \dfrac{6}{16} = 0.000$

And for a loss:

$P(L|x_3) = \dfrac{4}{10} \cdot \dfrac{4}{10} \cdot \dfrac{3}{10} \cdot \dfrac{4}{10} \cdot \dfrac{10}{16} = 0.012$

So this is pretty interesting. The likelihood of victory was looking great, but Benjamin and Clay never led the team in receiving for a victory, so our expected probability of success is 0. Because Naive Bayes has us perform a cross product of these independent probabilities, if one of the component probabilities is 0, the whole thing is 0.

That’s an interesting problem, and one we’ll look at in our next post as I move from predicting Bills victories to classifying words as belonging to particular categories.

Conclusion

In today’s post, we created a Naive Bayes model by hand and populated it with several scenarios. We also discovered that when a component has an outcome probability of 0—particularly common in sparse data sets like the one we have—we can end up with an unexpected result.

In next week’s installment, we will take this algorithm one step further and classify text and figure out this zero-probability outcome problem in the process.

Upcoming Events: SQL Saturday Nashville

I’ve decided to do some quick blogging about upcoming events a few days out. I’ll keep doing this until I forget or the townspeople with pitchforks and torches storm my manor.

Key Details

What: SQL Saturday Nashville
Where: Middle Tennessee State University, Murfreesboro, Tennessee
When: Saturday, January 12th, all day
Admission is free. Sign up at the SQL Saturday website.

What I’m Presenting

9:45 AM — 10:45 AM — Launching a Data Science Project: Cleaning is Half the Battle

Classification With Naive Bayes: An Introduction

This is part one in a series on classification with Naive Bayes.

What Is Naive Bayes?

Let me fix that right away: Naive Bayes isn’t a thing; it’s a class of things.

You do realize that collective nouns are typically referred to in the singular, right?

You probably shouldn’t do editorializing in the headings, me. Nonetheless, let’s talk about the Naive Bayes class of algorithms and whistle past the pluralization of collective nouns. If you want, every time I write “Naive Bayes,” assume I wrote “the class of algorithms which we have defined as Naive Bayes.”

So what is Naive Bayes? [Isn’t that what I asked? – ed]

Naive Bayes is a class of algorithms designed to solve classification problems. Naive Bayes algorithms share three primary characteristics:

1. They are probabilistic. This means that we calculate the probability of each output category, and to determine the best, we choose the one with the highest likelihood.
2. Probabilities are derived from Bayes’ Theorem. That’s the “Bayes” in Naive Bayes, and we’ll get into it in a bit.
3. Features are independent. That’s the “naive” part of “Naive Bayes.”

The plan in this post is to cover the basics of this class of algorithms, and in follow-up posts, we will look at implementations by hand and in code. But first, the \$64,000 question:

Why Should We Use Naive Bayes? Is It the Best Classifier Out There?

Probably not, no. In fact, it’s typically a mediocre classifier—it’s the one you strive to beat with your fancy algorithm. So why even care about this one?

Because it’s fast, easy to understand, and it works reasonably well. In other words, this is the classifier you start with to figure out if it’s worth investing your time on a problem. If you need to hit 90% category accuracy and Naive Bayes is giving you 70%, you’re probably in good shape; if it’s giving you 20% accuracy, you might need to take another look at whether you have a viable solution given your data.

The Foundation of Naive Bayes: Bayes’ Theorem

Bayes’ theorem is pretty simple:

$P(B|A) = \dfrac{P(A|B) * (B)}{P(A)}$

In case you’re not familiar with the notation, here’s a quick breakdown:

• P(B|A) is called the posterior probability. It represents the probability of hypothesis B being correct given input data A. Let’s cover the other elements and loop back to this at the end.
• P(A|B) is the probability of us seeing input data A if we know that hypothesis B is true. In other words, if Bob always wears red hats on Thursdays, P(Bob is wearing a red hat | Today is Thursday) is 1. If Bob wears red hats on alternating Thursdays, P(Bob is wearing a red hat | Today is Thursday) is 0.5. And so on.
• P(B) is called the prior probability. It represents the likelihood that hypothesis B is true regardless of any specific input data. Continuing with the example above, it is the probability that today is Thursday, which is 1/7 unless you are re-living Groundhogs Day.
• P(A) is the probability of us seeing data A in our sample. In our simple example, it is the likelihood that Bob will wear a red hat.

Now that I’ve gone into the elements, let me wrap back around to the term posterior probability. Continuing with the example, our goal is to figure out what day it is based solely on Bob’s headgear. In other words, finding P(Today is Thursday | Bob is wearing a red hat), read as “the probability that today is Thursday given that Bob is wearing a red hat.”

We started out with a prior: the information we expect (or know) beforehand. Our prior is that, knowing nothing else, is it equally likely to be any day of the week, so P(Today is Thursday) is 1/7.

From there, we update our prior by adding new information into the mix. We know that the probability of Bob wearing a red hat is 50% when it is Thursday—for lore-building, let’s say that Bob is a high school football nut and always wears his kids’ high school cap for some Thursday night action. Oh, and that there are 26 games a year because shaddup. So P(Bob is wearing a red hat | Today is Thursday) is 0.5, or 50%.

But knowing that Bob wears red hats on half of all Thursdays isn’t enough for us—we need a bit more. We need to know how often Bob wears his red hat in general. So let’s say that we have 52*7=364 data points and Bob wore his red hat 36 times. That means P(Bob is wearing a red hat) is 0.099 in our data set.

Armed with this information, we can calculate the posterior probability:

$\dfrac{(0.5)(0.14)}{(0.099)} = .722$

In other words, given what we know, there is a 72% chance that, if we see Bob wearing a red hat, it is Thursday.

If you decided to ignore all of the rest of the details and just focus on the fact that Bob wore his hat on 26 Thursdays and 36 times in total, you could also take 26/36 and get to .722 as well. So this seems like a lot of calculation for nothing, at least until we introduce the real kicker.

One Naive Assumption

What makes Naive Bayes so useful is something which seems absurd at first blush: we assume that all inputs are independent. This sounds like a bad idea: there are complex interactions between different inputs, so simply ignoring them seems like we’re throwing away valuable information. But it turns out that this simplification mechanism still retains most of our valuable information while making it easier for us to calculate.

Here’s the new formula:

$P(B|A) = \dfrac{P(x_1|B) * P(x_2|B) * \ldots * P(x_n|B) * P(B)}{P(A)}$

Now A is a specific collection of {x1, x2, …, xn} inputs. Let’s say this could be things like Bob’s head covering, how surly your co-workers are today (on a scale from 1 to 5 so now it’s totally scientific), and the amount of junk mail you received the day before. Some of these things may have some overlap, but for the purposes of Naive Bayes, we assume that all away and calculate these probabilities (red hat, blue hat, no hat for Bob; 1-5 on the Surl Scale; pounds of junk mail) in our data set separately.

We’ve also changed the nature of B a little bit too. Now B is a collection of {B1, B2, … BN} different choices. So now, instead of asking if today is Thursday, we try to figure out which day of the week it is based on the specific values of our inputs. We might end up with a 45% chance of it being Thursday, 29% chance of Tuesday, 17% chance of Wednesday, 6% chance of it being Friday, and 3% chance of it being Monday. In that case, our model predicts that a day with those particular inputs is Thursday. That’s what we mean by the model being probabilistic: even though we end up with a result like “The model predicts that the day is Thursday given those inputs,” we really end up with a series of likelihoods and it’s up to us to figure out any cutoffs between “This day is Thursday” versus “This day could be Tuesday or Thursday” to “I have no idea what day it is and basing that off of some random garbage you pulled together is absurd.” Naturally, these algorithms tend to be a bit too polite to give you the third answer directly, but you can infer it if you look hard enough.

Conclusion

In today’s post, we looked at the basics of a Naive Bayes model. We walked through Bayes’ Theorem and solved a simple probability question. Finally, we introduced the concept of independence. In the next post (which actually will drop on a Thursday), we will solve a simple but realistic two-class classification problem by hand with Naive Bayes.

R Training In Cleveland CANCELLED

UPDATE

Unfortunately, the R training has been cancelled due to venue limitations. I’ll still be at SQL Saturday Cleveland but will not be able to give my full-day training. To the tens of thousands of people (that estimate might be slightly high) who signed up, SQL Saturday Cleveland crew will be in touch. I’ve taken the link down from this page (and deleted the rest of the text just because) to make sure that nobody accidentally signs up.

A Sneak Preview

One of my presentation goals for 2019 is to get into video recording. I did a couple of recordings in 2017 but wasn’t that happy with them. Since then, I’ve upped the amount of equipment (mostly lights—you can never have enough lights, apparently) and am getting prepared. Here’s a sneak preview:

I still have a long way to go with this, but soon enough I will have this down. Then, my ultimate goal will come true: become a television meteorologist push free content to a couple dozen people.

Finding Max Concurrent Operations With T-SQL (Part 2)

Last Time On 36 Chambers

To give you a quick reminder of the problem we’re trying to solve, we can have multiple concurrent work items processing for a customer, but we want to limit that number to 2.  The development team put in a check to throttle customers and want us to write a query to ensure that they wrote their check correctly.

*Narrator’s voice*:  They didn’t.

Meanwhile, Back At The Wrench

Because this is a difficult problem, I immediately looked for an Itzik Ben-Gan solution and found one.  Here’s my take on his solution, but definitely read his article.

Based on yesterday’s work, we can generate an arbitrary number of values which approximate a normal distribution, so we’re going to build 5 million rows worth of start and stop times for customers.  Here’s the code we will use:

DROP TABLE IF EXISTS #WorkItems;
CREATE TABLE #WorkItems
(
WorkItemID INT NOT NULL PRIMARY KEY CLUSTERED,
CustomerID INT NOT NULL,
WorkItemStart DATETIME2(0) NOT NULL,
WorkItemEnd DATETIME2(0) NULL
);

DECLARE
@NumberOfRecords INT = 5000000,
@NumberOfCustomers INT = 150,
@StartDate DATETIME2(0) = '2018-12-18 15:00:00',
@MeanRunLengthInSeconds DECIMAL(5,2) = 90.0,
@StdDevRunLengthInSeconds DECIMAL(5,2) = 18.8,
@MeanLengthBetweenRunsInSeconds DECIMAL(5,2) = 128.0,
@StdDevLengthBetweenRunsInSeconds DECIMAL(5,2) = 42.3,
@Precision INT = 1;

WITH
L0 AS(SELECT 1 AS c UNION ALL SELECT 1),
L1 AS(SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B),
L2 AS(SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B),
L3 AS(SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B),
L4 AS(SELECT 1 AS c FROM L3 AS A CROSS JOIN L3 AS B),
L5 AS(SELECT 1 AS c FROM L4 AS A CROSS JOIN L4 AS B),
Nums AS(SELECT ROW_NUMBER() OVER(ORDER BY (SELECT 0)) AS n FROM L5),
Vals AS
(
SELECT TOP (@NumberOfRecords)
n,
r.CustomerID,
s.NumberOfSecondsToRun,
s.NumberOfSecondsBeforeNextRunBegins
FROM Nums
CROSS APPLY (
SELECT
RAND(CHECKSUM(NEWID())) AS rand1,
RAND(CHECKSUM(NEWID())) AS rand2,
CAST(@NumberOfCustomers * RAND(CHECKSUM(NEWID())) AS INT) + 1 AS CustomerID
) r
CROSS APPLY
(
SELECT
ROUND((SQRT(-2.0 * LOG(r.rand1)) * COS(2 * PI() * r.rand2)) * @StdDevRunLengthInSeconds, @Precision) + @MeanRunLengthInSeconds AS NumberOfSecondsToRun,
ROUND((SQRT(-2.0 * LOG(r.rand1)) * SIN(2 * PI() * r.rand2)) * @StdDevLengthBetweenRunsInSeconds, @Precision) + @MeanLengthBetweenRunsInSeconds AS NumberOfSecondsBeforeNextRunBegins
) s
),
records AS
(
SELECT
v.n AS WorkItemID,
v.CustomerID,
v.NumberOfSecondsToRun,
v.NumberOfSecondsBeforeNextRunBegins,
SUM(v.NumberOfSecondsBeforeNextRunBegins) OVER (PARTITION BY v.CustomerID ORDER BY v.n ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) - v.NumberOfSecondsBeforeNextRunBegins AS TotalNumberOfSeconds
FROM vals v
)
INSERT INTO #WorkItems
(
WorkItemID,
CustomerID,
WorkItemStart,
WorkItemEnd
)
SELECT
r.WorkItemID,
r.CustomerID,
s.WorkItemStart,
DATEADD(SECOND, r.NumberOfSecondsToRun, s.WorkItemStart) AS WorkItemEnd
FROM records r
CROSS APPLY
(
SELECT
DATEADD(SECOND, r.TotalNumberOfSeconds, @StartDate) AS WorkItemStart
) s;

On the ridiculously overpowered server that I’m abusing to do this blog post, it took about 32 seconds to generate 5 million rows.  We are generating data for 150 customers using a uniform distribution, so we’d expect somewhere around 33,333 records per customer.  In my sample, customer work item counts range from 32,770 to 33,747 so there’s a little bit of variance, but not much.

Building A Solution:  A Single Customer

Just like before, we’re going to squeeze a few hundred more words out of this post build up a query step by step.  Our first step will be to get the starting times and ending times for each of our selected customer’s work items.  Each starting point and each ending point will take a row, so we’ll use UNION ALL to unpivot this data.

DECLARE
@CustomerID INT = 1;

SELECT
wi.CustomerID,
wi.WorkItemStart AS TimeUTC,
1 AS IsStartingPoint,
ROW_NUMBER() OVER (ORDER BY wi.WorkItemStart) AS StartOrdinal
FROM #WorkItems wi
WHERE
wi.CustomerID = @CustomerID

UNION ALL

SELECT
wi.CustomerID,
wi.WorkItemEnd AS TimeUTC,
0 AS IsStartingPoint,
NULL AS StartOrdinal
FROM #WorkItems wi
WHERE
wi.CustomerID = @CustomerID

For my data set, I get something like the following:

As a quick note, the start times are all in order but the end times are arbitrarily ordered, as ordering won’t matter where we’re going.

Then, we want to build out step two, where we add an ordering based on time for each of the start and stop points:

DECLARE
@CustomerID INT = 1;

WITH StartStopPoints AS
(
SELECT
wi.CustomerID,
wi.WorkItemStart AS TimeUTC,
1 AS IsStartingPoint,
ROW_NUMBER() OVER (ORDER BY wi.WorkItemStart) AS StartOrdinal
FROM #WorkItems wi
WHERE
wi.CustomerID = @CustomerID
UNION ALL
SELECT
wi.CustomerID,
wi.WorkItemEnd AS TimeUTC,
0 AS IsStartingPoint,
NULL AS StartOrdinal
FROM #WorkItems wi
WHERE
wi.CustomerID = @CustomerID
)
SELECT
s.CustomerID,
s.TimeUTC,
s.IsStartingPoint,
s.StartOrdinal,
ROW_NUMBER() OVER (ORDER BY s.TimeUTC, s.IsStartingPoint) AS StartOrEndOrdinal
FROM StartStopPoints s;

Running this gives us the following results for my data:

You can probably see by this point how the pieces are coming together:  each time frame has a starting point and an ending point.  If there were no overlap at all, we’d see in the fourth column a number followed by a NULL, followed by a number followed by a NULL, etc.  But we clearly don’t see that:  we see work item ordinals 3 and 4 share some overlap:  item 3 started at 3:06:15 PM and ended after item 4’s start of 3:07:20 PM.  This means that those two overlapped to some extent.  Then we see two NULL values, which means they both ended before 5 began.  So far so good for our developers!

The final calculation looks like this:

DECLARE
@CustomerID INT = 1;

WITH StartStopPoints AS
(
SELECT
wi.CustomerID,
wi.WorkItemStart AS TimeUTC,
1 AS IsStartingPoint,
ROW_NUMBER() OVER (ORDER BY wi.WorkItemStart) AS StartOrdinal
FROM #WorkItems wi
WHERE
wi.CustomerID = @CustomerID

UNION ALL

SELECT
wi.CustomerID,
wi.WorkItemEnd AS TimeUTC,
0 AS IsStartingPoint,
NULL AS StartOrdinal
FROM #WorkItems wi
WHERE
wi.CustomerID = @CustomerID
),
StartStopOrder AS
(
SELECT
s.CustomerID,
s.TimeUTC,
s.IsStartingPoint,
s.StartOrdinal,
ROW_NUMBER() OVER (ORDER BY s.TimeUTC, s.IsStartingPoint) AS StartOrEndOrdinal
FROM StartStopPoints s
)
SELECT
MAX(2 * s.StartOrdinal - s.StartOrEndOrdinal) AS MaxConcurrentWorkItems
FROM StartStopOrder s
WHERE
s.IsStartingPoint = 1;

Because we know that each start (probably) has an end, we need to multiply StartOrdinal by 2.  Then, we want to compare the result of that calculation to StartOrEndOrdinal.  If you go back up to the previous image, you can see how this gives us 1 concurrent item for the first three work items, but as soon as we add in work item 4, 2*4-6 = 2, so we now have two concurrent work items.  By the way, this ran in about 26ms for me and took 182 reads if you use an index that I create below.  But don’t skip that far ahead yet; we have more to do!

Voila.  Or viola, whichever.

But wait, there’s more!

Where Am I Overlapping The Most?

Naturally, if I know that my max overlap window is 3, I’d be curious about when that happened—maybe I can correlate it with logs and figure out which intern to blame (protip:  have interns in different time zones so you always have a likely culprit).

Getting the results is easy, for some definition of “easy” which doesn’t really fit with the normal definition of “easy.”

First, we will create a temp table called #MaxConcurrentItems.  This will hold all of the work items which are equal to the max concurrent item count.

DROP TABLE IF EXISTS #MaxConcurrentItems;
CREATE TABLE #MaxConcurrentItems
(
WorkItemID INT PRIMARY KEY CLUSTERED,
CustomerID INT,
MaxConcurrentWorkItems INT
);

DECLARE
@CustomerID INT = 1,
@MaxConcurrentWorkItems INT = 0;

WITH StartStopPoints AS
(
SELECT
wi.WorkItemID,
wi.CustomerID,
wi.WorkItemStart AS TimeUTC,
1 AS IsStartingPoint,
ROW_NUMBER() OVER (ORDER BY wi.WorkItemStart) AS StartOrdinal
FROM #WorkItems wi
WHERE
wi.CustomerID = @CustomerID

UNION ALL

SELECT
wi.WorkItemID,
wi.CustomerID,
wi.WorkItemEnd AS TimeUTC,
0 AS IsStartingPoint,
NULL AS StartOrdinal
FROM #WorkItems wi
WHERE
wi.CustomerID = @CustomerID
),
StartStopOrder AS
(
SELECT
s.WorkItemID,
s.CustomerID,
s.TimeUTC,
s.IsStartingPoint,
s.StartOrdinal,
ROW_NUMBER() OVER (ORDER BY s.TimeUTC, s.IsStartingPoint) AS StartOrEndOrdinal
FROM StartStopPoints s
),
MaxConcurrency AS
(
SELECT
MAX(2 * s.StartOrdinal - s.StartOrEndOrdinal) AS MaxConcurrentWorkItems
FROM StartStopOrder s
WHERE
s.IsStartingPoint = 1
)
INSERT INTO #MaxConcurrentItems
(
WorkItemID,
CustomerID,
MaxConcurrentWorkItems
)
SELECT
s.WorkItemID,
s.CustomerID,
M.MaxConcurrentWorkItems
FROM StartStopOrder s
CROSS JOIN MaxConcurrency m
WHERE
2 * s.StartOrdinal - s.StartOrEndOrdinal = m.MaxConcurrentWorkItems;

So let’s explain this step by step:

1. StartStopPoints and StartStopOrder are the same as before.
2. Instead of returning MaxConcurrentWorkItems, we’re dropping that into a CTE called MaxConcurrency.
3. Once we have the max level of concurrency, we want to go back to StartStopOrder and get all of the cases where the number of concurrent work items is equal to the maximum number of concurrent work items.  We insert this into #MaxConcurrentWorkItems.

From there, I populate @MaxConcurrentWorkItems with the max value we retrieve above (so I don’t need to calculate it again) and I want to get the record which pushed us to the max level of concurrency as well as all of the prior records in that grouping.  We already know a technique for turning one row into multiple rows:  a tally table.

I do need to go back and get the start ordinals for each row in #WorkItems for that customer ID; that way, I can join against the tally table and get prior records, defined where MaxRow.StartOrdinal = WorkItem.StartOrdinal + n - 1.  The -1 at the end is because our tally table starts from 1 instead of 0; if yours has a 0 value, then you can change the inequality operation in the WHERE clause below and include just the n rows without any subtraction involved.

Here is the solution in all its glory:

DROP TABLE IF EXISTS #MaxConcurrentItems;
CREATE TABLE #MaxConcurrentItems
(
WorkItemID INT PRIMARY KEY CLUSTERED,
CustomerID INT,
MaxConcurrentWorkItems INT
);

DECLARE
@CustomerID INT = 1,
@MaxConcurrentWorkItems INT = 0;

WITH StartStopPoints AS
(
SELECT
wi.WorkItemID,
wi.CustomerID,
wi.WorkItemStart AS TimeUTC,
1 AS IsStartingPoint,
ROW_NUMBER() OVER (ORDER BY wi.WorkItemStart) AS StartOrdinal
FROM #WorkItems wi
WHERE
wi.CustomerID = @CustomerID

UNION ALL

SELECT
wi.WorkItemID,
wi.CustomerID,
wi.WorkItemEnd AS TimeUTC,
0 AS IsStartingPoint,
NULL AS StartOrdinal
FROM #WorkItems wi
WHERE
wi.CustomerID = @CustomerID
),
StartStopOrder AS
(
SELECT
s.WorkItemID,
s.CustomerID,
s.TimeUTC,
s.IsStartingPoint,
s.StartOrdinal,
ROW_NUMBER() OVER (ORDER BY s.TimeUTC, s.IsStartingPoint) AS StartOrEndOrdinal
FROM StartStopPoints s
),
MaxConcurrency AS
(
SELECT
MAX(2 * s.StartOrdinal - s.StartOrEndOrdinal) AS MaxConcurrentWorkItems
FROM StartStopOrder s
WHERE
s.IsStartingPoint = 1
)
INSERT INTO #MaxConcurrentItems
(
WorkItemID,
CustomerID,
MaxConcurrentWorkItems
)
SELECT
s.WorkItemID,
s.CustomerID,
M.MaxConcurrentWorkItems
FROM StartStopOrder s
CROSS JOIN MaxConcurrency m
WHERE
2 * s.StartOrdinal - s.StartOrEndOrdinal = m.MaxConcurrentWorkItems;

SELECT
@MaxConcurrentWorkItems = MAX(mci.MaxConcurrentWorkItems)
FROM #MaxConcurrentItems mci;

WITH L0 AS(SELECT 1 AS c UNION ALL SELECT 1),
L1 AS(SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B),
L2 AS(SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B),
Nums AS(SELECT ROW_NUMBER() OVER(ORDER BY (SELECT 0)) AS n FROM L2),
WorkItems AS
(
SELECT
wi.WorkItemID,
wi.CustomerID,
wi.WorkItemStart,
wi.WorkItemEnd,
ROW_NUMBER() OVER (ORDER BY wi.WorkItemStart) AS StartOrdinal
FROM #WorkItems wi
WHERE
wi.CustomerID = @CustomerID
),
MaxWorkItems AS
(
SELECT
wi.WorkItemID,
wi.CustomerID,
wi.WorkItemStart,
wi.WorkItemEnd,
wi.StartOrdinal,
ROW_NUMBER() OVER (ORDER BY mci.WorkItemID) AS WorkItemGrouping
FROM #MaxConcurrentItems mci
INNER JOIN WorkItems wi
ON mci.WorkItemID = wi.WorkItemID
)
SELECT
wi.WorkItemID,
wi.CustomerID,
wi.WorkItemStart,
wi.WorkItemEnd,
M.WorkItemGrouping
FROM MaxWorkItems M
CROSS JOIN Nums n
INNER JOIN WorkItems wi
ON M.StartOrdinal = (wi.StartOrdinal + n.n - 1)
WHERE
n.n <= @MaxConcurrentWorkItems
ORDER BY
wi.WorkItemID ASC;

And here’s the results for customer 1:

For bonus wackiness, the same work item can be in multiple groupings.  That makes sense:  we can see from this mess that 194483 finished at 13:51:22 and 197740 began at 13:52:35, but 194736 and 194513 were both active during that entire stretch, so our pattern of active work items was 1-2-3-2-3.

So we get our results, but at what cost?  Here’s the execution plan for the whole thing:

Despite this awful query plan, this ran in about 1.5 seconds, but had a worktable with 129K reads.  I could definitely tune this query by saving my slice of work items with ordinals off in its own temp table and querying that…but my editor says I have to wrap up this section now, so let’s jump straight to the moral of the story.

The moral of the story?  I already told you:  have lots of interns so you always have someone to blame.

What Is My Overlap Per Customer?

Now I’d like to see what the overlap is across all of my 150 customers.  Here goes:

WITH StartStopPoints AS
(
SELECT
wi.CustomerID,
wi.WorkItemStart AS TimeUTC,
1 AS IsStartingPoint,
ROW_NUMBER() OVER (PARTITION BY wi.CustomerID ORDER BY wi.WorkItemStart) AS StartOrdinal
FROM #WorkItems wi

UNION ALL

SELECT
wi.CustomerID,
wi.WorkItemEnd AS TimeUTC,
0 AS IsStartingPoint,
NULL AS StartOrdinal
FROM #WorkItems wi
),
StartStopOrder AS
(
SELECT
s.CustomerID,
s.TimeUTC,
s.IsStartingPoint,
s.StartOrdinal,
ROW_NUMBER() OVER (PARTITION BY s.CustomerID ORDER BY s.TimeUTC, s.IsStartingPoint) AS StartOrEndOrdinal
FROM StartStopPoints s
)
SELECT
s.CustomerID,
MAX(2 * s.StartOrdinal - s.StartOrEndOrdinal) AS MaxConcurrentWorkItems
FROM StartStopOrder s
WHERE
s.IsStartingPoint = 1
GROUP BY
s.CustomerID
ORDER BY
MaxConcurrentWorkItems DESC,
CustomerID ASC;

At our worst, we have 5 concurrent work items for customer 139:

This query took several seconds to run, so let’s check out the execution plan:

We have two separate scans of the WorkItems table, and we can see the Segment-Sequence combo which represents a window function show up twice.  The biggest pain point is a major Sort operation which in my case spilled at level 1.

So let’s create an index:

CREATE NONCLUSTERED INDEX [IX_WorkItems] ON #WorkItems
(
CustomerID,
WorkItemStart,
WorkItemEnd
) WITH(DATA_COMPRESSION = PAGE);

As a quick note, there’s nothing here which prevents users from running two files at the same time, so I can’t make this a unique index.

Doing this didn’t change the execution plan much, but did change IO from 53,368 reads to 26,233 reads and reduced overall execution time on this server from 6925ms (17,406ms CPU time) down to 3650ms (8203ms CPU time) so that’s not awful.

Wrapping Up

Over the course of this two-post series, we have generated a lot of artificial data in order to find overlapping time ranges. Along the way, we learned many important things, such as having plenty of cannon fodder interns around.

Finding Max Concurrent Operations With T-SQL (Part 1)

Not too long ago, a co-worker had an issue that he asked me about.  The gist of it is, we can have multiple concurrent work items processing for a customer, but we want to limit that number to 2.  The development team wanted to make sure that their code was working as expected, but they weren’t sure how they could test it, as work items don’t all start and stop at the same time.

Build Some Sample Data

In order to show you the solution, I want to build up a reasonable sized sample.  Any solution looks great when reading five records, but let’s kick that up a notch.  Or, more specifically, a million notches:  I’m going to use a CTE tally table and load 5 million rows.

I want some realistic looking data, so I’ve adapted Dallas Snider’s strategy to build a data set which approximates a normal distribution.

Because this is a little complicated, I wanted to take the time and explain the data load process in detail in its own post, and then apply it in the follow-up post.  We’ll start with a relatively small number of records for this demonstration:  50,000.  The reason is that you can generate 50K records almost instantaneously but once you start getting a couple orders of magnitude larger, things slow down some.

Give Me Rows

Getting an arbitrary number of records is easy with a CTE-based tally table.  Here we get 50,000 uniquely numbered rows in a trivial amount of time:

DECLARE
@NumberOfRecords INT = 50000;

WITH
L0 AS(SELECT 1 AS c UNION ALL SELECT 1),
L1 AS(SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B),
L2 AS(SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B),
L3 AS(SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B),
L4 AS(SELECT 1 AS c FROM L3 AS A CROSS JOIN L3 AS B),
L5 AS(SELECT 1 AS c FROM L4 AS A CROSS JOIN L4 AS B),
Nums AS(SELECT ROW_NUMBER() OVER(ORDER BY (SELECT 0)) AS n FROM L5)
SELECT TOP(@NumberOfRecords)
n
FROM Nums;


The output of this is 50,000 rows with one column called n.  It’s not the most creative name but does the trick.

Give Me Random Values

The next thing I need is two random numbers for each of the rows.  We’ll get into why I need them in just a moment.  For now, I want to point out that calling RAND() twice isn’t a safe bet:

Calling RAND() for each record in a result set doesn’t quite work the way you’d expect.

Although that’s not workable, you can still generate random values following a uniform distribution between 0 and 1 using the combination RAND(CHECKSUM(NEWID())):

GUIDs are occasionally useful for something.

Because NEWID() gives us a new value for each row, we have a different seed for reach row and a result which looks much better for us.

Converting Uniform To Normal

As mentioned above, Dallas Snider has a great technique for generating random values pulled from something which approximates a normal distribution using T-SQL.  But that technique uses a loop and generates two values per loop iteration.  In the end, I’m going to want to do this 5 million times, so I’d much rather build a set-based solution.

First, let’s look at Dallas’s solution and see what we need.  I’ve simplified the code a bit and formatted it the way I like:

DECLARE
@randNum1 FLOAT = RAND(),
@randNum2 FLOAT = RAND(),
@mean FLOAT = 75.0,
@stdDev FLOAT = 5.0, --standard deviation
@precision INT = 1; --number of places to the right of the decimal point

SELECT
ROUND((SQRT(-2.0 * LOG(@randNum1)) * COS(2 * PI() * @randNum2)) * @stdDev, @precision) + @mean AS Value1,
ROUND((SQRT(-2.0 * LOG(@randNum1)) * SIN(2 * PI() * @randNum2)) * @stdDev, @precision) + @mean AS Value2;


We end up with something like the following:

Generating values off of a (nearly) normal distribution.

Your numbers will, of course differ, but you knew that because you get the idea of random numbers.

So how can we adapt this to our code?  Let’s see how.

Getting To The Crux Of Our Problem

Here’s the gist:  we have customers with customer IDs.  These customers have some data processed.  Each time they ask for processing, it takes some number of seconds, but that number can change (maybe the files are different sizes, maybe there’s resource contention, whatever).  There will also be delays between executions for each customer, but here’s the important part:  a customer can have more than one process running at a time.

So how do we model this idea?  With several variables.

DECLARE
@NumberOfRecords INT = 50000,
@NumberOfCustomers INT = 150,
@MeanRunLengthInSeconds DECIMAL(5,2) = 90.0,
@StdDevRunLengthInSeconds DECIMAL(5,2) = 18.8,
@MeanLengthBetweenRunsInSeconds DECIMAL(5,2) = 128.0,
@StdDevLengthBetweenRunsInSeconds DECIMAL(5,2) = 42.3,
@Precision INT = 1;


In addition to our 50K records, we’ve created 150 customers out of thin air.  We have, through studious analysis of our data and totally not just picking a number out of thin air, determined that the mean number of seconds it takes to do this data processing is exactly 90.0, and our standard deviation is 18.8.  This data approximates a normal distribution.

In addition, the mean length between runs is 128.0 seconds (again, people in lab coats with Very Serious Glasses and beakers and test tubes and stuff helped us determine this) and the standard deviation for time between runs is 42.3 seconds, and once more, this approximates a normal distribution.  Finally, we will ask for precision down to 1 spot after the decimal.

Once I’ve defined those variables and collected those values, I can write a query which gives us realistic-looking values:

DECLARE
@NumberOfRecords INT = 50000,
@NumberOfCustomers INT = 150,
@StartDate DATETIME2(0) = '2018-12-18 15:00:00',
@MeanRunLengthInSeconds DECIMAL(5,2) = 90.0,
@StdDevRunLengthInSeconds DECIMAL(5,2) = 18.8,
@MeanLengthBetweenRunsInSeconds DECIMAL(5,2) = 128.0,
@StdDevLengthBetweenRunsInSeconds DECIMAL(5,2) = 42.3,
@Precision INT = 1;

WITH
L0 AS(SELECT 1 AS c UNION ALL SELECT 1),
L1 AS(SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B),
L2 AS(SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B),
L3 AS(SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B),
L4 AS(SELECT 1 AS c FROM L3 AS A CROSS JOIN L3 AS B),
L5 AS(SELECT 1 AS c FROM L4 AS A CROSS JOIN L4 AS B),
Nums AS(SELECT ROW_NUMBER() OVER(ORDER BY (SELECT 0)) AS n FROM L5)
SELECT TOP (@NumberOfRecords)
n,
r.CustomerID,
s.NumberOfSecondsToRun,
s.NumberOfSecondsBeforeNextRunBegins
FROM Nums
CROSS APPLY
(
SELECT
RAND(CHECKSUM(NEWID())) AS rand1,
RAND(CHECKSUM(NEWID())) AS rand2,
CAST(@NumberOfCustomers * RAND(CHECKSUM(NEWID())) AS INT) + 1 AS CustomerID
) r
CROSS APPLY
(
SELECT
ROUND((SQRT(-2.0 * LOG(r.rand1)) * COS(2 * PI() * r.rand2)) * @StdDevRunLengthInSeconds, @Precision) + @MeanRunLengthInSeconds AS NumberOfSecondsToRun,
ROUND((SQRT(-2.0 * LOG(r.rand1)) * SIN(2 * PI() * r.rand2)) * @StdDevLengthBetweenRunsInSeconds, @Precision) + @MeanLengthBetweenRunsInSeconds AS NumberOfSecondsBeforeNextRunBegins
) s


And here are some sample results:

This rabbit hole is certainly starting to get deep.

The complexity jumped up just a little bit, so let’s walk our way through this.  First, I used the CROSS APPLY operator to give me values for rand1 and rand2, as well as a third random value for our Customer ID.  I don’t mind Customer ID following a uniform distribution—that might not be perfectly realistic, but it’s close enough for our work.  Given 50K records and 150 customers, we’d expect about 333 rows per customer.

Next, I chain the first CROSS APPLY with a second because one good function deserves another.  This second function takes Dallas’s calculations and converts them to my own nefarious purposes.  Note that instead of having one mean and one standard deviation like in his example, I have two means and two standard deviations.

At this point, I have a few variables for each customer:  an implicit ordering (based on n), the number of seconds this particular job will run (NumberOfSecondsToRun), and the number of seconds from now before the next job begins (NumberOfSecondsBeforeNextRunBegins).  So now we need to convert this into a stream of time rather than thinking of things as just individual points in time.

Turning Numbers To Time Streams

Let’s pretend that each customer started at time 0.  Again, that’s not totally realistic but for our purposes it’ll work just fine.  Here’s a quick way to generate a time stream for customer 1:

DECLARE
@NumberOfRecords INT = 50000,
@NumberOfCustomers INT = 150,
@StartDate DATETIME2(0) = '2018-12-18 15:00:00',
@MeanRunLengthInSeconds DECIMAL(5,2) = 90.0,
@StdDevRunLengthInSeconds DECIMAL(5,2) = 18.8,
@MeanLengthBetweenRunsInSeconds DECIMAL(5,2) = 128.0,
@StdDevLengthBetweenRunsInSeconds DECIMAL(5,2) = 42.3,
@Precision INT = 1;

WITH
L0 AS(SELECT 1 AS c UNION ALL SELECT 1),
L1 AS(SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B),
L2 AS(SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B),
L3 AS(SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B),
L4 AS(SELECT 1 AS c FROM L3 AS A CROSS JOIN L3 AS B),
L5 AS(SELECT 1 AS c FROM L4 AS A CROSS JOIN L4 AS B),
Nums AS(SELECT ROW_NUMBER() OVER(ORDER BY (SELECT 0)) AS n FROM L5),
Vals AS
(
SELECT TOP (@NumberOfRecords)
n,
r.CustomerID,
s.NumberOfSecondsToRun,
s.NumberOfSecondsBeforeNextRunBegins
FROM Nums
CROSS APPLY (
SELECT
RAND(CHECKSUM(NEWID())) AS rand1,
RAND(CHECKSUM(NEWID())) AS rand2,
CAST(@NumberOfCustomers * RAND(CHECKSUM(NEWID())) AS INT) + 1 AS CustomerID
) r
CROSS APPLY
(
SELECT
ROUND((SQRT(-2.0 * LOG(r.rand1)) * COS(2 * PI() * r.rand2)) * @StdDevRunLengthInSeconds, @Precision) + @MeanRunLengthInSeconds AS NumberOfSecondsToRun,
ROUND((SQRT(-2.0 * LOG(r.rand1)) * SIN(2 * PI() * r.rand2)) * @StdDevLengthBetweenRunsInSeconds, @Precision) + @MeanLengthBetweenRunsInSeconds AS NumberOfSecondsBeforeNextRunBegins
) s
)
SELECT
v.n,
v.CustomerID,
v.NumberOfSecondsToRun,
v.NumberOfSecondsBeforeNextRunBegins
FROM Vals v
WHERE
v.CustomerID = 1
ORDER BY
n ASC


And here are some sample results so we can follow along:

The sordid history of Customer 1.

Let’s start at time t=0.  If you want to imagine t=0 as a date like 2018-12-18 08:00:00, that’s also fine, but I’m going to stick with numbers representing seconds for the moment.

So we’re at t=0.  Then event 20 happens.  Customer 1 wants us to process a data file.  It’s going to end up taking us 118.7 seconds to process that first data file.

In the meantime, at t=49 seconds, event 88 happens and we process a second file.  This one takes 58.4 seconds to run, so it completes at time 107.4.

Then, at t=(49+187.2) or t=236.2, the third file starts.  If it helps, I’ve built a number line image below to visualize this:

Visualizing the first three executions for Customer 1.

The main thing we want to see is if there is overlap, and if so, how many concurrent executions we see.  In this case, we can see the red and blue lines overlap, but no overlap with green.  Therefore, our max number of concurrent executions is 2.  We’d want to look at this over the entire stream of time.

The way we can get this is to add one more level of common table expression and introduce a window function with a cutoff:

DECLARE
@NumberOfRecords INT = 50000,
@NumberOfCustomers INT = 150,
@StartDate DATETIME2(0) = '2018-12-18 15:00:00',
@MeanRunLengthInSeconds DECIMAL(5,2) = 90.0,
@StdDevRunLengthInSeconds DECIMAL(5,2) = 18.8,
@MeanLengthBetweenRunsInSeconds DECIMAL(5,2) = 128.0,
@StdDevLengthBetweenRunsInSeconds DECIMAL(5,2) = 42.3,
@Precision INT = 1;

WITH
L0 AS(SELECT 1 AS c UNION ALL SELECT 1),
L1 AS(SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B),
L2 AS(SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B),
L3 AS(SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B),
L4 AS(SELECT 1 AS c FROM L3 AS A CROSS JOIN L3 AS B),
L5 AS(SELECT 1 AS c FROM L4 AS A CROSS JOIN L4 AS B),
Nums AS(SELECT ROW_NUMBER() OVER(ORDER BY (SELECT 0)) AS n FROM L5),
Vals AS
(
SELECT TOP (@NumberOfRecords)
n,
r.CustomerID,
s.NumberOfSecondsToRun,
s.NumberOfSecondsBeforeNextRunBegins
FROM Nums
CROSS APPLY (
SELECT
RAND(CHECKSUM(NEWID())) AS rand1,
RAND(CHECKSUM(NEWID())) AS rand2,
CAST(@NumberOfCustomers * RAND(CHECKSUM(NEWID())) AS INT) + 1 AS CustomerID
) r
CROSS APPLY
(
SELECT
ROUND((SQRT(-2.0 * LOG(r.rand1)) * COS(2 * PI() * r.rand2)) * @StdDevRunLengthInSeconds, @Precision) + @MeanRunLengthInSeconds AS NumberOfSecondsToRun,
ROUND((SQRT(-2.0 * LOG(r.rand1)) * SIN(2 * PI() * r.rand2)) * @StdDevLengthBetweenRunsInSeconds, @Precision) + @MeanLengthBetweenRunsInSeconds AS NumberOfSecondsBeforeNextRunBegins
) s
)
SELECT
v.n AS WorkItemID,
v.CustomerID,
v.NumberOfSecondsToRun,
v.NumberOfSecondsBeforeNextRunBegins,
SUM(v.NumberOfSecondsBeforeNextRunBegins)
OVER
(
PARTITION BY v.CustomerID
ORDER BY v.n
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) - v.NumberOfSecondsBeforeNextRunBegins AS TotalNumberOfSeconds
FROM vals v


We’ve now included a SUM of NumberOfSecondsBeforeNextRunBegins over a window from the beginning of time up to the current row.  But because we’re getting the next run’s start time, we need to subtract out the current row’s value, and that gives us the TotalNumberOfSeconds result which represents where things begin.

Here’s a picture, though note that all of the numbers changed because it’s randomly generated data:

Customer 1 has always been at war with Eastasia.

Tying It All Together

So now, armed with these details, we can take the total number of seconds and add it to our start date to get when a work item begins.  We can take the work item begin time and add the number of seconds to run, which gives us the work item’s end time.

DROP TABLE IF EXISTS #WorkItems;
CREATE TABLE #WorkItems
(
WorkItemID INT NOT NULL PRIMARY KEY CLUSTERED,
CustomerID INT NOT NULL,
WorkItemStart DATETIME2(0) NOT NULL,
WorkItemEnd DATETIME2(0) NULL
);

DECLARE
@NumberOfRecords INT = 50000,
@NumberOfCustomers INT = 150,
@StartDate DATETIME2(0) = '2018-12-18 15:00:00',
@MeanRunLengthInSeconds DECIMAL(5,2) = 90.0,
@StdDevRunLengthInSeconds DECIMAL(5,2) = 18.8,
@MeanLengthBetweenRunsInSeconds DECIMAL(5,2) = 128.0,
@StdDevLengthBetweenRunsInSeconds DECIMAL(5,2) = 42.3,
@Precision INT = 1;

WITH
L0 AS(SELECT 1 AS c UNION ALL SELECT 1),
L1 AS(SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B),
L2 AS(SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B),
L3 AS(SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B),
L4 AS(SELECT 1 AS c FROM L3 AS A CROSS JOIN L3 AS B),
L5 AS(SELECT 1 AS c FROM L4 AS A CROSS JOIN L4 AS B),
Nums AS(SELECT ROW_NUMBER() OVER(ORDER BY (SELECT 0)) AS n FROM L5),
Vals AS
(
SELECT TOP (@NumberOfRecords)
n,
r.CustomerID,
s.NumberOfSecondsToRun,
s.NumberOfSecondsBeforeNextRunBegins
FROM Nums
CROSS APPLY (
SELECT
RAND(CHECKSUM(NEWID())) AS rand1,
RAND(CHECKSUM(NEWID())) AS rand2,
CAST(@NumberOfCustomers * RAND(CHECKSUM(NEWID())) AS INT) + 1 AS CustomerID
) r
CROSS APPLY
(
SELECT
ROUND((SQRT(-2.0 * LOG(r.rand1)) * COS(2 * PI() * r.rand2)) * @StdDevRunLengthInSeconds, @Precision) + @MeanRunLengthInSeconds AS NumberOfSecondsToRun,
ROUND((SQRT(-2.0 * LOG(r.rand1)) * SIN(2 * PI() * r.rand2)) * @StdDevLengthBetweenRunsInSeconds, @Precision) + @MeanLengthBetweenRunsInSeconds AS NumberOfSecondsBeforeNextRunBegins
) s
),
records AS
(
SELECT
v.n AS WorkItemID,
v.CustomerID,
v.NumberOfSecondsToRun,
v.NumberOfSecondsBeforeNextRunBegins,
SUM(v.NumberOfSecondsBeforeNextRunBegins) OVER (PARTITION BY v.CustomerID ORDER BY v.n ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) - v.NumberOfSecondsBeforeNextRunBegins AS TotalNumberOfSeconds
FROM vals v
)
INSERT INTO #WorkItems
(
WorkItemID,
CustomerID,
WorkItemStart,
WorkItemEnd
)
SELECT
r.WorkItemID,
r.CustomerID,
s.WorkItemStart,
DATEADD(SECOND, r.NumberOfSecondsToRun, s.WorkItemStart) AS WorkItemEnd
FROM records r
CROSS APPLY
(
SELECT
DATEADD(SECOND, r.TotalNumberOfSeconds, @StartDate) AS WorkItemStart
) s;

SELECT
wi.WorkItemID,
wi.CustomerID,
wi.WorkItemStart,
wi.WorkItemEnd
FROM #WorkItems wi
ORDER BY
wi.CustomerID,
wi.WorkItemID;


There are two things of note here.  First, I added a temp table called #WorkItems.  Note that if you are not running SQL Server 2016 SP1 or later, you’ll want to comment out the first line.  Then, I added an insert statement at the bottom.  Here, I start with my records CTE (which we saw above) and used CROSS APPLY to get the work item’s start date.  I did that in a CROSS APPLY​ operation so that I could make the WorkItemEnd calculation easier to follow, saying that we run a certain number of seconds from each work item’s starting point.

This gives us a data set that looks a bit like the following:

Trying to schedule a date with customer 1.

In this run of the job, we can see that work item 296 ends at 3:06 PM, but work item 403 started at 3:05 PM, meaning that there is some overlap.  If you briefly glance through these, you’ll see some occasions of overlap but nothing absurd.  However, glancing through isn’t enough.  That’s where Part 2 comes into play.

Next Time On Matlock

Up to this point, we’ve spent all of our time building up a realistic-enough data set.  In tomorrow’s post, we will take this structure, expand it to 5 million rows, and look at a way of finding the max level of concurrency by customer.  Stay tuned!