In the Papers: Data Sets Conforming to Benford’s Law

This is a review of R.C. Hall’s Properties of Data Sets that Conform to Benford’s Law. This is a math-heavy article which lays out proofs of certain properties you would expect from a dataset which follows Benford’s Law. I’ve found the principle of Benford’s Law interesting, with blog posts back in 2015 and 2019 on the topic, and so I was interested in reading the article primarily because I wanted to get a better idea of which distributions of datasets follow Benford’s Law and which don’t.

Hall lays this out on page 3: “If the probability density function of the mantissas of the logarithm of a data set is Uniform distribution then the data set conforms exactly to Benford’s law.” Page 4 follows with a proof. But, uh, if you don’t have a solid stats background, that’s a tough statement to follow. Fortunately, there are great examples of distributions which do follow Benford’s Law and which don’t.

For distributions which do, exponential distributions follow Benford exactly. Gamma, beta, and Weibull distributions follow pretty closely. Log-normal approaches Benford as the standard deviation increases and chi-squared approaches as the number of degrees of freedom increases. Distributions which emphatically do not follow Benford are Gaussian (normal) and uniform distributions.

Coming into this reading, I had an intuition that datasets are more likely to follow Benford’s Law if they span several orders of magnitude, and looking at the sets of distributions, that makes a lot of sense—especially if we consider the cases of log-normal and chi-squared, which only approach Benford under circumstances which increase the spread of values.

Starting on page 33, Hall also debunks “the summarization test,” which is the idea that “adding all numbers that begin with a particular first digit or first two digits” and expecting that this will be a uniform distribution. Hall proves that this is true for exponential distributions, but not for the others, meaning that a series can be very close to a Benford series yet fail the summarization test, making summarization a bad test. The reason for this is that most of the relevant real-world sets following Benford’s Law are actually log-normal, not exponential.

Overall, I recommend this paper even if you aren’t deeply familiar with statistics and have trouble following the proofs. Hall includes quite a few graphs and writes clearly. The one unfortunate thing is that I read this paper on a grayscale tablet, meaning that all of the charts were really difficult to follow due to the color choices Excel makes by default. For that reason, I’d recommend reading the paper on a color-enabled screen.

A Slice of Time: Statistical Functions

This is part five in a series on window functions in SQL Server.

Reviewing the Statistical Functions

The next set of window functions we will look at are the statistical functions. There are four of them: PERCENT_RANK(), CUME_DIST(), PERCENTILE_CONT(), and PERCENTILE_DISC().

PERCENT_RANK and CUME_DIST

PERCENT_RANK() is a fairly straightforward function: rank each window based on some ORDER BY clause and range values from 0 to 1, inclusive. Here’s an example of a query which generates rankings across 66,463 invoice line items:

WITH records AS
(
    SELECT
        i.InvoiceDate,
        i.CustomerID,
        SUM(il.LineProfit) AS DailyProfit
    FROM Sales.InvoiceLinesSmall il
        INNER JOIN Sales.InvoicesSmall i
            ON il.InvoiceID = i.InvoiceID
    GROUP BY
        i.InvoiceDate,
        i.CustomerID
)
SELECT
    r.InvoiceDate,
    r.CustomerID,
    r.DailyProfit,
    PERCENT_RANK() OVER(ORDER BY r.DailyProfit) AS PercentRank
FROM records r
ORDER BY
	PercentRank DESC;
Ranking daily profits.

Note that we are not explaining percentage of values, but rather percentage of rank. That’s why the second row has the value it does: it is in the 99.998th percentile, even though its value is a mere 77.2% of the largest.

Breaking it down by customer, we can see that everything works exactly as we’d expect:

WITH records AS
(
    SELECT
        i.InvoiceDate,
        i.CustomerID,
        SUM(il.LineProfit) AS DailyProfit
    FROM Sales.InvoiceLinesSmall il
        INNER JOIN Sales.InvoicesSmall i
            ON il.InvoiceID = i.InvoiceID
    GROUP BY
        i.InvoiceDate,
        i.CustomerID
)
SELECT
    r.InvoiceDate,
    r.CustomerID,
    r.DailyProfit,
    PERCENT_RANK() OVER(
        PARTITION BY r.CustomerID
        ORDER BY r.DailyProfit) AS PercentRank
FROM records r
ORDER BY
    r.CustomerID,
    PercentRank;
PERCENT_RANK also handles negative values, no problem.

Cumulative distribution, or CUME_DIST(), calculates the percentage of rows with a value less than or equal to the current row. There is one big difference between this function and PERCENT_RANK(). Let’s take a look at an example to see what it is:

WITH records AS
(
    SELECT
        i.InvoiceDate,
        i.CustomerID,
        SUM(il.LineProfit) AS DailyProfit
    FROM Sales.InvoiceLinesSmall il
        INNER JOIN Sales.InvoicesSmall i
            ON il.InvoiceID = i.InvoiceID
    GROUP BY
        i.InvoiceDate,
        i.CustomerID
)
SELECT
    r.InvoiceDate,
    r.CustomerID,
    r.DailyProfit,
    PERCENT_RANK() OVER(
        PARTITION BY r.CustomerID
        ORDER BY r.DailyProfit) AS PercentRank,
    CUME_DIST() OVER(
        PARTITION BY r.CustomerID
        ORDER BY r.DailyProfit) AS CumeDist
FROM records r
ORDER BY
    r.CustomerID,
    PercentRank;
Compare and contrast time!

CUME_DIST() doesn’t show 0 for the smallest record. The reason for this is in the definition: CUME_DIST() tells us how far along we are in describing the entire set—that is, what percentage of values have we covered so far. This percentage is always greater than 0. By contrast, PERCENT_RANK() forces the lowest value to be 0 and the highest value to be 1.

Another thing to note is ties. There are 117 values for customer 1 in my dataset. Rows 5 and 6 both have a percent rank of 0.0344, which is approximately rank 4 (remembering that we start from 0, not 1). Both rows 5 and 6 have the same rank of 4, and then we move up to a rank of 6. Meanwhile, for cumulative distribution, we see that rows 5 and 6 have a cumulative distribution of 6/117 = 0.5128. In other words, PERCENT_RANK() ties get the lowest possible value, whereas CUME_DIST() ties get the highest possible value.

As a quick side note, I’m not absolutely sure why the PERCENT_RANK() values are slightly different from CUME_DIST(); the differences are very small and I’m sure there’s something in the math behind how these are calculated, but I couldn’t tell you for sure.

Percentiles

The other two functions are PERCENTILE_CONT() and PERCENTILE_DISC(). These stand for continuous and discrete percentiles, respectively. Both of them have the same functional inputs, but have a slight difference, as we’ll see. Let’s start with continuous percentiles and build out a five-number summary. I want to see the minimum value, maximum value, median (50th percentile) value, and the first and third quartiles (25th and 75th percentiles, respectively). The T-SQL to do this is really easy:

WITH records AS
(
    SELECT
        i.InvoiceDate,
        i.CustomerID,
        SUM(il.LineProfit) AS DailyProfit
    FROM Sales.InvoiceLinesSmall il
        INNER JOIN Sales.InvoicesSmall i
            ON il.InvoiceID = i.InvoiceID
    GROUP BY
        i.InvoiceDate,
        i.CustomerID
)
SELECT
    r.CustomerID,
    r.InvoiceDate,
    r.DailyProfit,
    PERCENTILE_CONT(0.0) WITHIN GROUP (ORDER BY r.DailyProfit) OVER (PARTITION BY r.CustomerID) AS Min,
    PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY r.DailyProfit) OVER (PARTITION BY r.CustomerID) AS Q1,
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY r.DailyProfit) OVER (PARTITION BY r.CustomerID) AS Median,
    PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY r.DailyProfit) OVER (PARTITION BY r.CustomerID) AS Q3,
    PERCENTILE_CONT(1.0) WITHIN GROUP (ORDER BY r.DailyProfit) OVER (PARTITION BY r.CustomerID) AS Max
FROM records r
ORDER BY
    r.CustomerID DESC,
    r.InvoiceDate;

PERCENTILE_CONT() takes in one parameter, which represents the percentile you want to see, ranging from 0 to 1. We also have a WITHIN GROUP clause here that I want to talk about for a moment, as we haven’t seen one of these yet. When working with certain window functions, we need to perform an operation inside the individual window. Here, we’ve partitioned the data by customer ID, meaning that we’ll calculate percentiles for each customer independently. Within the partition, I want to order the data by daily profit so that we can calculate the percentiles.

Here’s how the data looks, sorted by customer descending because it’s time for a change.

Calculating percentiles.

This looks really cool: I can do the same type of work that I’d do in R within SQL Server and not have to worry about setting up ML Services. But wait! There’s a tiny catch: the execution plan.

Not a great beginning.

So yeah, we have two window aggregates…wait, my producer is telling me that the image isn’t complete. Let’s stand by for a live feed:

This looks fine.

Um…that’s a lot of operators for a pretty simple query. For those keeping track at home, there are 7 window aggregates and 8 sort operations for this query! Everything is using the same WITHIN GROUPS and PARTITION BY clauses, and yet we get a window aggregate for each one, plus a couple bonus window aggregates thrown in for free query bucks. Even asking for a single percentile like just the median necessitates 3 window aggregates and 4 sorts for this query. Performance for this query is fine when you’re talking about thousands of rows. Once you talk about millions of rows, forget about it. On my Cray supercomputer desktop at home, getting a 5-number summary against 66,463 invoice records takes 107ms to complete, with 383ms of CPU and 5,515 reads. Extend this to 509,219 records in a larger version of this table and we have a duration of 824ms with 2,446ms of CPU and 11,732 reads. In real-life production situations, PERCENTILE_X() functions can bog down badly and leave you hurting.

Discrete Percentiles

There is another type of percentile query: PERCENTILE_DISC(). The difference between the continuous version and the discrete version is that the continuous version will interpolate values, whereas discrete will guarantee a value that’s in your dataset, even if it’s not statistically accurate. Here’s what I mean:

WITH records AS
(
    SELECT
        i.InvoiceDate,
        i.CustomerID,
        SUM(il.LineProfit) AS DailyProfit
    FROM Sales.InvoiceLinesSmall il
        INNER JOIN Sales.InvoicesSmall i
            ON il.InvoiceID = i.InvoiceID
    GROUP BY
        i.InvoiceDate,
        i.CustomerID
)
SELECT
    r.CustomerID,
    r.InvoiceDate,
    r.DailyProfit,
    PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY r.DailyProfit) OVER (PARTITION BY r.CustomerID) AS MedianDisc,
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY r.DailyProfit) OVER (PARTITION BY r.CustomerID) AS Median
FROM records r
ORDER BY
    r.CustomerID DESC,
    r.InvoiceDate;
Calculating a median, the Goofus and Gallant way.

In case you’re not too familiar with the idea, calculating a median is pretty easy. It’s a lot like prison, really: you first find the biggest, meanest number in the dataset and shank it. Then, you take out the smallest one just to make sure it doesn’t do the same to you. Rinse and repeat until you have either 1 value or 2 values remaining. If you have one value left, congratulations: that’s the median! If you have two values left, you find the midpoint between those values and declare it the true median.

In the first example above, we can see three values for customer 1061. We remove $4520.10 because it is the largest; in a corresponding move, we remove $2006.65. That leaves us one value, $3628.10, and both the discrete and continuous versions agree on this.

The second example, however, brooks some disagreement. Just like before, we knock out the top dog, $1587.40. Then, we take out the little guy, $464. This leaves us two values. The discrete version takes the smaller of the two values and declares $486 the winner. The continuous version does things correctly and declares $604.50 the true median.

Note that I have a decided bias in my depiction of continuous and discrete percentiles. My recommendation is that you almost always use continuous percentiles, as those will give you the correct answer. The only time I’d use discrete percentiles is if you then need to join back to a dataset based on that calculated percentile and get more information. Otherwise, stick to the continuous versions.

Also note that PERCENTILE_DISC() is a dog for performance as well. You won’t really gain anything in bringing it in, though on the plus(?) side, it’s at least not slower than the continuous version.

Conclusion

Today we covered four more window functions. Two of them have to do with figuring out how far down a road you are, and the other two have to do with calculating percentages. I don’t tend to need percentage ranks or cumulative distributions that often, and although I’d like to use the percentiles more, they’re just too slow in SQL Server 2019. Admittedly, they were extremely fast in SQL Server 2008 R2 and before: you’d get a syntax error mighty fast, with hardly any server utilization.

Interesting Resources for Chapter 1

For each chapter in Finding Ghosts in Your Data, I’ll include a few resources that I found interesting. This isn’t a bibliography, strictly speaking, as I might not use all of these in the course of writing, but they were at least worth noting.

Books

  • Anomaly Detection Principles and Algorithms by Mehrotra, et al. I’ll reference this book frequently, as I think it’s a really good summary of the current state of anomaly detection in academic literature. The first 5-6 chapters are fairly “light” in the sense that an intelligent non-statistician can get a lot of information from them, though as you get deeper into the book, the math starts to pile up.
  • Outlier Analysis by Aggarwal. Unlike Mehrotra, et al, this is intended to be a textbook, and Aggarwal writes it as such. I don’t think it makes sense for most developers to read this book unless they’re really interested in the math behind anomaly detection and have enough of a background to make sense of it.

Articles

A Slice of Time: Offset Functions

This is part four in a series on window functions in SQL Server.

Offset Functions in SQL

Offset functions are another class of window function in SQL Server thanks to being part of the ANSI SQL standard. Think of a window of data, stretching over some number of rows. What we want to do is take the current row and then look back (or “lag” the data) or forward (“lead”).

There are four interesting offset window functions: LAG(), LEAD(), FIRST_VALUE(), and LAST_VALUE(). All four of these offset functions require an ORDER BY clause because of the nature of these functions. LAG() and LEAD() take two parameters, one of which is required. We need to know the value to lag/lead, so that’s a mandatory parameter. In addition, we have an optional parameter which indicates how many steps forward or backward we want to look. Let’s see this in action with a query:

WITH records AS
(
    SELECT
        i.InvoiceDate,
        SUM(il.LineProfit) AS DailyProfit
    FROM Sales.InvoiceLinesSmall il
        INNER JOIN Sales.InvoicesSmall i
            ON il.InvoiceID = i.InvoiceID
    GROUP BY
        i.InvoiceDate
)
SELECT
    r.InvoiceDate,
    r.DailyProfit,
    LAG(r.DailyProfit) OVER (ORDER BY r.InvoiceDate) AS LagDefaultProfit,
    LAG(r.DailyProfit, 1) OVER (ORDER BY r.InvoiceDate) AS PriorDayProfit,
    LAG(r.DailyProfit, 2) OVER (ORDER BY r.InvoiceDate) AS PriorDay2Profit,
    LAG(r.DailyProfit, 7) OVER (ORDER BY r.InvoiceDate) AS PriorDay7Profit,
    LAG(r.DailyProfit, 14) OVER (ORDER BY r.InvoiceDate) AS PriorDay14Profit,
    LEAD(r.DailyProfit) OVER (ORDER BY r.InvoiceDate) AS LeadDefaultProfit,
    LEAD(r.DailyProfit, 1) OVER (ORDER BY r.InvoiceDate) AS NextDayProfit,
    LEAD(r.DailyProfit, 2) OVER (ORDER BY r.InvoiceDate) AS NextDay2Profit,
    LEAD(r.DailyProfit, 7) OVER (ORDER BY r.InvoiceDate) AS NextDay7Profit,
    LEAD(r.DailyProfit, 14) OVER (ORDER BY r.InvoiceDate) AS NextDay14Profit
FROM records r
ORDER BY
    r.InvoiceDate;

We start out with a common table expression, calculating the sum of line profit by invoice date. Then, we’ll try out a variety of LAG() and LEAD() options. The default, when you do not pass in a second parameter, is 1, so LagDefaultProfit will return the same results as PriorDayProfit.

Lots of columns to lag and to lead.

The LagDefaultProfit and PriorDayProfit columns are NULL for the first row because there is no prior record. One important thing to note is that, even if there is data for 2012-12-31 (or any prior date), it will not appear in the result set. For LAG(2), the first two records will be NULL, and then the third row will include the value of daily profit from the first row.

I don’t show it in this image, but LEAD() works just like the inverse of LAG(), where the final row for LEAD(1) will be NULL.

LAG() and LEAD() are very useful analytical functions when working with business reports. You might want to show the current day, prior day, and difference between the two. This is extremely easy to do: SELECT r.DailyProfit - LAG(r.DailyProfit) OVER (ORDER BY r.InvoiceDate). Without the LAG() function, we’d need to create some terrible self-join to accomplish the same, especially if we wanted to include multiple prior and future periods.

First and Last Values

The other two offset functions of note are FIRST_VALUE() and LAST_VALUE(). Not surprisingly, these show the first and last values in a window, respectively. These functions accept a single parameter representing the value whose first/last value you want. We also need to include an ORDER BY statement in the OVER() clause.

WITH records AS
(
    SELECT
        i.InvoiceDate,
        SUM(il.LineProfit) AS DailyProfit
    FROM Sales.InvoiceLinesSmall il
        INNER JOIN Sales.InvoicesSmall i
            ON il.InvoiceID = i.InvoiceID
    GROUP BY
        i.InvoiceDate
)
SELECT
    r.InvoiceDate,
    r.DailyProfit,
    FIRST_VALUE(r.DailyProfit) OVER (ORDER BY r.InvoiceDate) AS DefaultFirstValue,
    LAST_VALUE(r.DailyProfit) OVER (ORDER BY r.InvoiceDate) AS DefaultLastValue
FROM records r
ORDER BY
    r.InvoiceDate;

Once again, we’re aggregating data by date first, and then we’re going to look at the first and last values. Let’s see the results:

The last value doesn’t seem to be very last-valuey.

The DefaultFirstValue column is exactly what we’d expect: the first DailyProfit row in our window, or $14,346.50. The last row in my dataset is $119,456.90, which is definitely not any of the DefaultLastValue records we see. What gives? This is a bug in SQL Server, right?

A Reminder of Frame Defaults

Remember that the default frame for a window is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. If we think about that, the DefaultLastValue makes perfect sense. For the first row, our entire window is 2013-01-01 itself, meaning that its first daily profit is $14,346.50 and its last daily profit is $14,346.50. Then, move to the second row and our first is $14,346.50 and last is $80,550.35. This probably isn’t what we expect with LAST_VALUE() and therefore we need to change the frame. Let’s change that to range from the beginning of our window to the end of the window, not just up to the current row.

WITH records AS
(
    SELECT
        i.InvoiceDate,
        SUM(il.LineProfit) AS DailyProfit
    FROM Sales.InvoiceLinesSmall il
        INNER JOIN Sales.InvoicesSmall i
            ON il.InvoiceID = i.InvoiceID
    GROUP BY
        i.InvoiceDate
)
SELECT
    r.InvoiceDate,
    r.DailyProfit,
    FIRST_VALUE(r.DailyProfit) OVER (ORDER BY r.InvoiceDate) AS DefaultFirstValue,
    LAST_VALUE(r.DailyProfit) OVER (
        ORDER BY r.InvoiceDate
        RANGE BETWEEN UNBOUNDED PRECEDING
            AND UNBOUNDED FOLLOWING) AS RealLastValue
FROM records r
ORDER BY
    r.InvoiceDate;
Now our last value looks right.

And of course, this is not just limited to LAST_VALUE(). We’re ordering by r.InvoiceDate ascending inside the two window functions, so that’s why I didn’t need to do anything special for FIRST_VALUE() here. If you include inside FIRST_VALUE() an ORDER BY descending, you’ll have to remember the frame for it too.

A Quick Reminder of Partitions

Over the last couple of posts, I’ve pretty much ignored partitions, not because they’re uninteresting, but because I wanted to get across the basic functionality of different window functions. Here’s an example of partitioning by customer ID and performing the same checks for first and last value. This is a more realistic use case for the two functions:

WITH records AS
(
    SELECT
        i.InvoiceDate,
        i.CustomerID,
        SUM(il.LineProfit) AS DailyProfit
    FROM Sales.InvoiceLinesSmall il
        INNER JOIN Sales.InvoicesSmall i
            ON il.InvoiceID = i.InvoiceID
    GROUP BY
        i.InvoiceDate,
        i.CustomerID
)
SELECT
    r.InvoiceDate,
    r.CustomerID,
    r.DailyProfit,
    FIRST_VALUE(r.DailyProfit) OVER (
        PARTITION BY r.CustomerID
        ORDER BY r.InvoiceDate) AS DefaultFirstValue,
    LAST_VALUE(r.DailyProfit) OVER (
        PARTITION BY r.CustomerID
        ORDER BY r.InvoiceDate
        RANGE BETWEEN UNBOUNDED PRECEDING
            AND UNBOUNDED FOLLOWING) AS RealLastValue
FROM records r
ORDER BY
    r.CustomerID;
Splitting between customers 888 and 889.

Conclusion

Offset functions are another great addition to SQL Server 2012, as they make common business analytics queries significantly easier.

A Slice of Time: Ranking Functions

This is part three in a series on window functions in SQL Server.

The Original Windows

Last time around, we looked at aggregate window functions, which I think are the easiest for someone brand new to the concept to understand. Today, we’ll shift to the first window functions introduced to SQL Server in 2005: ranking window functions.

They’ve Given You a Number and Take Away Your Name

The whole concept of ranking window functions is to assign some numeric ordering to a dataset. There are four ranking functions in SQL Server. Three of them are very similar to one another: ROW_NUMBER(), RANK(), DENSE_RANK(). The fourth one, NTILE(), is the odd cousin of the family.

Unlike aggregate window functions, all ranking window functions must have at least an ORDER BY clause in the OVER() operator. The reason is that you are attempting to bring order to the chaos of your data by assigning a number based on the order in which you join.

Row Numbers, Ranks, and Dense Ranks

Let’s show off what the three normal ranking functions can do:

SELECT TOP(250)
    il.InvoiceLineID,
    il.InvoiceID,
    i.CustomerID,
    ROW_NUMBER() OVER (ORDER BY il.InvoiceID) AS rownum,
    DENSE_RANK() OVER (ORDER BY il.InvoiceID) AS dr,
    RANK() OVER (ORDER BY il.InvoiceID) AS rk
FROM Sales.InvoiceLinesSmall il
    INNER JOIN Sales.InvoicesSmall i
        ON il.InvoiceID = i.InvoiceID
ORDER BY
    il.InvoiceLineID,
	rownum;
Order!

In this case, we’re ordering everything by invoice ID because it shows off the differences between the three functions quite nicely.

Monotonicity

If you ever hear me present and I talk about the ROW_NUMBER() function, ingrained habit has me speak of “monotonically increasing” yada yada because grad school does weird things to your brain. Let’s talk about what I mean by this with a quick discussion of monotonicity.

When I say that a variable is monotonic with respect to another variable, what that means is that when the value of one goes up, the other does as well; when the value of one goes down, the other does as well. Take this hastily-drawn example:

Monotonicity in action.

As one curve goes up, the other one does as well. As a curve goes down, so too does the other. They are in lock-step when it comes to direction. Important here is that monotonic says nothing about whether the two variables remain equidistant. You can see in this image that at some points, the difference between the two curves at some given X is greater than at other points. We can’t use monotonicity as a shortcut for “equivalent.”

To add a bit to the confusion, you can also talk about the function which generated these curves as monotonic or not. In that case, a function is monotonic if its results move in one direction. Because our curves move both up and down, the curves themselves are not monotonic, but they are monotonic with respect to each other.

Strict or Not?

Not pictured: the skinny line going TO THE MOON.

There are two variants of monotonicity we care about: strict and non-strict monotonicity. Non-strict monotonicity allows for “flat spots” in the curve. For example, the two lines in the following image exhibit non-strict monotonicity.

These two line segments are monotonic with respect to each other, even though there’s a flat spot in one but the other consistently increases. When I gave a broad explanation of monotonicity (one curve goes up, so does the other; one curve goes down, so does the other), that’s technically the depiction of non-strict monotonicity, as I don’t make any reference to these sorts of flat spots.

When we look at strict monotonicity, we include flat spots: as one curve goes up, so does the other; as one curve goes down, so does the other; as one curve stays flat, so does the other. The above image does not two variables in a strictly monotonic relationship, but the following image does:

Pictured: one curve going TO THE MOON! The other curve going TO THE MOON BUT SLIGHTLY SLOWER AND MAYBE IT WILL MISS ITS ANGLE OF ATTACK AND END UP FLYING OFF INTO SPACE FOREVER BUT I CAN’T TELL BASED SOLELY ON THE INFORMATION GIVEN!

These two curves are strictly monotonic with respect to one another: as one increases, so does the other.

Back to the Story

That cul-de-sac on monotonicity aside, let’s review our results. I’m including the results image again because who wants to scroll all the way back up there to see it? Not me, that’s for sure.

Finally we get to talk about ranking functions. That is, unless I go off on a rant about how “weakly monotonic” and “strongly monotonic” as economists use the terms feel like they’re backwards, but ultimately make sense if you spend a while thinking about them.

As a reminder, we are ordering everything based on InvoiceID. ROW_NUMBER() generates a unique, ascending number for each value, starting from 1. ROW_NUMBER(), along with the other ranking functions, is guaranteed to be non-strictly monotonic with respect to the ordering column(s). I can use ROW_NUMBER() to generate a unique identifier per row if I need one.

The RANK() and DENSE_RANK() functions are both very similar to one another and deal with ranking results. With ROW_NUMBER(), we generate a unique, ascending number. But when it comes to ranks, if we have a tie, we can’t differentiate between the two. Think of a sprint, in which we calculate how long it takes runners to make it a certain distance. If two competitors take exactly the same measured time, we can’t arbitrarily say that A wins over B, as the only thing we include (race time) was a tie. In that case, we count both as having the same place.

In the example above, we have one line for invoice 1, but then we have two lines for invoice 2. Because we are only ordering by invoice ID, we can’t differentiate invoice lines 2 and 3, and therefore our ranking functions assign both lines the value of 2 because of a tie. ROW_NUMBER() doesn’t care and arbitrarily assigned one the number 2 and the other the number 3 because that’s what ROW_NUMBER() does and if you wanted guarantees in life, you should have made the relationship strictly monotonic.

The brashness of ROW_NUMBER() aside, we can see that RANK() and DENSE_RANK() both agree on the course of action in the event of a tie, but where they differ is in the next value. Going back to the example of a sprint, if we have two people tie for second place, the next contestant doesn’t end up in third place; instead, he ends up in fourth. This is what the RANK() function does: it skips numbers based on the incidence of ties. We have two people in second place, so therefore we no longer have a third place. OR we have three people in eighth place, so we no longer have ninth or tenth places and move to 11. Competitive events typically use RANK() style rules for determining place.

But competitive events aren’t the only place in which we might rank. Sometimes, you want to collapse the numbers down and see how many ranks there are. That’s where DENSE_RANK() comes into play. We might use this for tiering: A, B, and C are in the top tier because they all score a 100. Then, D and E are in the second tier because they scored 95. F is in the third tier, with a score of 94.99999. Therefore, we have reasons to use both RANK() and DENSE_RANK().

A Simple Question

What if we ordered everything by InvoiceLineID? Well, then our ranking function results would be strictly monotonic with respect to the element we are using for ordering, and so all three of them would increase 1 at a time alongside InvoiceLineID. This would lead to a rather boring and non-illustrative example, as I couldn’t show you the differences between the three ranking functions as they’d all give exactly the same results. So let’s not do that.

I Am Not a Number; I am a Free Man!

It’s now becoming apparent that I actually created this talk on window functions so that I could riff on the weird way economists think about monotonicity and to include at least one reference to The Prisoner, one of the best television shows ever.

We’ve looked at three of the four ranking functions so far, and now it’s time to move on to the final ranking function, NTILE(). This allows you to create percentiles…ish.

RANK(), DENSE_RANK(), and ROW_NUMBER() are all functions which accept zero parameters. NTILE() is a function which requires a single parameter: the number of -tiles. NTILE(100) would give you percentiles, NTILE(4) gives you quartiles, and so on. Like other ranking functions, you need to include at least an ORDER BY clause and can also include PARTITION BY and RANGE/ROWS. Here’s an example of one, which breaks the data out into five separate listings:

WITH records AS
(
    SELECT TOP(250)
        il.InvoiceLineID,
        il.InvoiceID,
        i.CustomerID,
        il.LineProfit
    FROM Sales.InvoiceLinesSmall il
        INNER JOIN Sales.InvoicesSmall i
            ON il.InvoiceID = i.InvoiceID
    ORDER BY il.InvoiceLineID
)
SELECT
    il.InvoiceLineID,
    il.InvoiceID,
    il.CustomerID,
    il.LineProfit,
    NTILE(3) OVER (ORDER BY il.LineProfit) AS nt3,
    NTILE(5) OVER (ORDER BY il.LineProfit) AS nt5,
    NTILE(10) OVER (ORDER BY il.LineProfit) AS nt10,
    NTILE(20) OVER (ORDER BY il.LineProfit) AS nt20,
    NTILE(100) OVER (ORDER BY il.LineProfit) AS nt100
FROM records il
ORDER BY
    il.LineProfit;

Instead of looking at all of the rows, I’m only getting the first 250. That way, you can see changes pretty quickly. Before I show the results, knowing that we have 250 rows, that makes it easy to break down for things like NTILE(5), meaning that there’d be 250 / 5 or 50 rows per “-tile.” But what happens when you can’t cleanly divide your number of groupings into the total number of rows? What happens is that the remainder is spread from top to bottom. What this means is that, for 250 rows and NTILE(3), we have 3 groupings. 250 / 3 = 83 1/3. That means that each group (1, 2, and 3) will get 83 rows, and group 1 will get an extra row. For NTILE(20), each of the 20 groups will get 12 rows, and there are 10 rows left over (250 / 20 = 12 10/20), so the first 10 groups will get an extra row. Let’s see it in action!

All of the tiles!

We can see that with NTILE(3), the first 84 rows are in group 1, and then we move to group 2. For NTILE(100), we see 3 rows per grouping, and can expect that to continue for the first 50 groups, continuing from there with 2 rows per group.

Conclusion

Ranking functions are quite useful in SQL Server. We’ll see specific business uses of ranking functions later on in the series, but we can already see right now how useful it can be for creating order out of data and inciting me to long segues about monotonicity.

In the next post, we’re going to look at offset functions, another SQL Server 2012 addition.

Product Reviews: the Desklab Monitor

Back in 2019, I joined in on a kickstarter for the Desklab monitor, a portable touchscreen 4K monitor. This seemed like a great idea for me, as hey, I do a lot of travel and it’s really nice having a second monitor around. It wouldn’t be convenient while traveling, but at the hotel, having that second screen would make me a lot more productive, and so it’s definitely worth the money. With an arrival time of, oh, March of 2020, it’d arrive just in time for the spring conference season.

Then March of 2020 hit and travel went from travel to dozens of cities per year to dozens of steps walked per year. The monitor’s production was delayed to boot, so I didn’t actually get it until about April of 2021. That was not a great experience, but they’ve worked all of those logistical problems out so I won’t hold it against them.

What It Is

The Desklab monitor is a 15″ 4K touchscreen monitor, which supports USB-C and HDMI. If you want the touchscreen to work, you’ll need to use USB-C. Also, if you’re using HDMI, you’ll need to use the A/C adapter to draw power. When using USB-C against a laptop, no A/C power is necessary.

I bought the magnetic stand/cover as well and would recommend it. It feels at first like it’s not sturdy enough to hold the monitor up, but as long as you aren’t knocking it around too much, it’s quite stable. I’ve had a Desklab monitor sitting on my desk at home for the past 6 months and can count on one hand the number of times I’ve had to re-set the magnetic stand—and all of those times came because I bumped the monitor on accident or had to move it, not because the monitor slipped off of its own accord.

How I’ve Used It

Because I haven’t done so much travel in the past 18 months (he says, while typing this out in a hotel room), I decided to use it as a Voicemeeter board on my desktop PC, at which it performed admirably. I didn’t use the touch functionality very much because it “steals” the mouse pointer, but having a 15″ high-resolution screen nearby has been nice.

Ever since moving to a different audio input system, I haven’t needed Voicemeeter anymore. Now I keep my home and work e-mail on that screen, freeing up space on the main monitor for me to focus more on work and less on alt-tabbing to e-mail. It also forces the e-mail screen to be a little to the side and not right in front of my face, acting as a reminder that e-mail is asynchronous and if it waits another 30 seconds (or 30 minutes…), nobody will care.

I’ve also used it to play video streams, especially distractionary streams while I’m working. That way, I can keep the rest of my monitors available for work.

And Then Another

A couple months after getting my first Desklab, I ended up ordering a second one because I liked it so much. The original intent was to have a second monitor available for travel, but if I was going to have one strictly for my desktop setup (and frankly, that monitor’s not going anywhere), I’d want another for its intended purpose.

I can confirm that it “just works” with elementary OS and I’d expect it to work on other Linux distributions as well. As a 15″ monitor, just a little too big for my electronics backpack, but fits nicely in a carry-on bag.

Last Thoughts

The Desklab is a little expensive, but it’s a high-quality portable monitor. You can get a 1080p version for less, though honestly, I’d just save up for the 4K version. This is a niche product, but if you’re on the go or want a second monitor about the same size as your laptop monitor, it’s worth it. It does also support connecting to phones and the Nintendo Switch, but I haven’t had much of a need for either of those scenarios and haven’t tried it out.

I’ll end with two quick bits of commentary. First, I have noticed that sometimes it powers on with 20 brightness, making it hard to see. I tend to bump it up to 50-60 brightness to match my other monitors. This leads me to my second comment, which is that making these sorts of adjustments has too much of a “fumble around until you get it” feel to me. There is obviously a specific way to get to the menu, select menu items, and set values, but you do that through three buttons on the side and it’s a bit awkward, especially because the mean time between config screen interactions is high enough that I typically forget which button does what and end up fumbling around for a while until I get it.

A Slice of Time: Aggregate Functions

This is part two in a series on window functions in SQL Server.

Last Time on 36 Chambers

In the prior post, we looked at the anatomy of a window function and said that it looks kind of like this:

SELECT
	Col1,
	Col2,
	WINDOW_FUNCTION() OVER(
			PARTITION BY Col1
			ORDER BY Col2
			ROWS BETWEEN UNBOUNDED PRECEDING
				AND CURRENT ROW
		) AS wf
FROM dbo.Table;

Now we’re going to look at a specific class of window functions: aggregate functions.

I Know These!

Even if you aren’t at all familiar with window functions, you’ve seen aggregate functions before. Here’s an example using the WideWorldImporters database.

SELECT
    i.CustomerID,
    SUM(il.LineProfit) AS TotalProfit
FROM Sales.InvoiceLinesSmall il
    INNER JOIN Sales.InvoicesSmall i
        ON il.InvoiceID = i.InvoiceID
GROUP BY
    i.CustomerID;

Here, we get the sum of LineProfit by CustomerID. Because SUM() is an aggregate function, we need a GROUP BY clause for all non-aggregated columns. This is an aggregate function. The full set of them in T-SQL is available here, but you’ll probably be most familiar with MIN(), MAX(), SUM(), AVG(), and COUNT().

To turn this into a window function, we slap on an OVER() and boom! Note: “boom!” only works on SQL Server 2012 and later, so if you’re still on 2008 R2, it’s more of a fizzle than a boom.

SELECT
    i.InvoiceID,
    i.CustomerID,
    i.InvoiceDate,
    SUM(il.LineProfit) OVER () AS TotalProfit
FROM Sales.InvoiceLinesSmall il
    INNER JOIN Sales.InvoicesSmall i
        ON il.InvoiceID = i.InvoiceID;

So, uh, what’s going on here? Well, the answer is that the phrase OVER() translates to “over all rows, considering everything as one window, starting from the beginning of the window and progressing through the end of the window.” In other words, $85,729,180.90 is the sum of line profit over all of the invoice lines in the entire database.

Also note that there is no GROUP BY. This is because window aggregates behave a bit differently than normal aggreagtes. Here’s the query plan from the first operation:

We can see a pair of Hash Match aggregates. The first one gets us from 228,265 rows down to 70,510 (number of invoices), and the second gets us down to 663 (number of customers). Meanwhile, here’s the window function:

We’re still performing an aggregation, but that aggregate becomes an additional column rather than grouping together rows. When we turn an aggregate function into a window function, the number of rows remains the same as if we hadn’t included the window function at all.

Partitioning

Next up, let’s partition by customer ID.

SELECT
    i.InvoiceID,
    i.CustomerID,
    i.InvoiceDate,
    SUM(il.LineProfit) OVER (PARTITION BY i.CustomerID) AS TotalProfit
FROM Sales.InvoiceLinesSmall il
    INNER JOIN Sales.InvoicesSmall i
        ON il.InvoiceID = i.InvoiceID
ORDER BY CustomerID;

We’ll still get one row per invoice line, but now the TotalProfit is a total profit per customer. Note that it resets as we move from one customer to the next. That’s because our window is now “over all rows for a given customer ID, starting from the beginning of the window and progressing through the end of the window.” This is already pretty useful because we could slap on line profit as a separate column and show the profit of an individual invoice line versus the total profit across the customer. We could even do it by invoice, too:

SELECT
    i.InvoiceID,
    i.CustomerID,
    i.InvoiceDate,
    il.LineProfit,
    SUM(il.LineProfit) OVER (PARTITION BY i.CustomerID, i.InvoiceID) As InvoiceProfit,
    SUM(il.LineProfit) OVER (PARTITION BY i.CustomerID) AS TotalProfit
FROM Sales.InvoiceLinesSmall il
    INNER JOIN Sales.InvoicesSmall i
        ON il.InvoiceID = i.InvoiceID
ORDER BY CustomerID;

Running Totals

This gets us to our first interesting use case with window functions: the running total. Given a partition and order combination, we can calculate the running total of profitability. Let’s take a look:

SELECT
    i.InvoiceID,
    i.CustomerID,
    i.InvoiceDate,
    il.InvoiceLineID,
    il.LineProfit,
    SUM(il.LineProfit) OVER (
        PARTITION BY i.CustomerID
        ORDER BY i.InvoiceDate) AS RunningTotalProfit
FROM Sales.InvoiceLinesSmall il
    INNER JOIN Sales.InvoicesSmall i
        ON il.InvoiceID = i.InvoiceID
ORDER BY
    i.CustomerID,
    i.InvoiceDate;

This is technically correct, but check out the running totals: we start with $260 but our running total is $316. That’s because our ORDER BY clause is based on invoice date, and in the event of ties, we treat all ties equally. The reason for this is that the default window for an aggregate function with an ORDER BY clause is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. What this means is “start from the beginning of the window and continue until you get to this row or any row whose ordering causes a tie with this row.” RANGE is the reason for the bolded clause, and if we don’t want to have that bolded clause, we need to change the clause to ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.

SELECT
    i.InvoiceID,
    i.CustomerID,
    i.InvoiceDate,
    il.InvoiceLineID,
    il.LineProfit,
    SUM(il.LineProfit) OVER (
        PARTITION BY i.CustomerID
        ORDER BY i.InvoiceDate
        ROWS BETWEEN UNBOUNDED PRECEDING
            AND CURRENT ROW) AS RunningTotalProfit
FROM Sales.InvoiceLinesSmall il
    INNER JOIN Sales.InvoicesSmall i
        ON il.InvoiceID = i.InvoiceID
ORDER BY
    i.CustomerID,
    i.InvoiceDate;

Now our running total looks more like what we’d expect to see.

A Medley of Aggregates

Finally, I want to point out that you can have multiple aggregate functions over multiple windows in the same query. I showed one example with several SUM() operations, but you can certainly have more than that:

SELECT
    i.InvoiceID,
    i.CustomerID,
    i.InvoiceDate,
    il.InvoiceLineID,
    il.LineProfit,
    SUM(il.LineProfit) OVER (
        PARTITION BY i.CustomerID
        ORDER BY i.InvoiceDate
        ROWS BETWEEN UNBOUNDED PRECEDING
            AND CURRENT ROW) AS RunningTotalProfit,
    SUM(il.LineProfit) OVER () AS TotalProfit,
    MIN(il.Lineprofit) OVER(
        PARTITION BY i.CustomerID
        ORDER BY i.InvoiceDate
        ROWS BETWEEN UNBOUNDED PRECEDING
            AND CURRENT ROW
    ) AS MinLineProfit,
    MAX(il.Lineprofit) OVER(
        PARTITION BY i.CustomerID
        ORDER BY i.InvoiceDate
        ROWS BETWEEN UNBOUNDED PRECEDING
            AND CURRENT ROW
    ) AS MaxLineProfit,
    AVG(il.Lineprofit) OVER(
        PARTITION BY i.CustomerID
        ORDER BY i.InvoiceDate
        ROWS BETWEEN UNBOUNDED PRECEDING
            AND CURRENT ROW
    ) AS AvgLineProfit,
    STDEV(il.Lineprofit) OVER(
        PARTITION BY i.CustomerID
        ORDER BY i.InvoiceDate
        ROWS BETWEEN UNBOUNDED PRECEDING
            AND CURRENT ROW
    ) AS StdevLineProfit,
    COUNT(il.Lineprofit) OVER(
        PARTITION BY i.CustomerID
        ORDER BY i.InvoiceDate
        ROWS BETWEEN UNBOUNDED PRECEDING
            AND CURRENT ROW
    ) AS CountLineProfit
FROM Sales.InvoiceLinesSmall il
    INNER JOIN Sales.InvoicesSmall i
        ON il.InvoiceID = i.InvoiceID
ORDER BY
    i.CustomerID,
    i.InvoiceDate;

Given how complex this is, you might think that the execution plan will be horrible. Well, it’s…kind of not, maybe?

In spite of us having seven aggregate functions, we only have two window aggregate operations. The reason for this is that most of our window aggregates use the same frame: PARTITION BY i.CustomerID ORDER BY i.InvoiceDate ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. When you have multiple aggregate functions which use the same frame, SQL Server’s query optimizer is smart enough to realize that it only needs to do the work once and can apply the frame results to each function. There is one function which uses a different frame: SUM(il.LineProfit) OVER(). Because the OVER() clause is different, this function requires a different frame than its peers, and so we need a different window aggregate operator. But on the whole, this is quite optimal compared to alternatives such as creating a lazy spool or performing multiple scans of the tables. We have one scan of the InvoiceLinesSmall table, one scan of an index on InvoicesSmall, and some compute work tracking aggregate results as we stream the rows. Not bad for a quarter-second’s work.

Conclusion

Aggregate window functions were not the first type of window function introduced in SQL Server—that honor goes to ranking functions, which we’ll discuss next time. But they were one of the best things to come out of a great release in SQL Server 2012. We have already seen one excellent use of them in calculating running totals, and we’ll see a few more as this series progresses.

Updates for the Innovate and Modernize Apps with Data and AI MCW

A little while ago, I talked about updating the Innovate and modernize apps with Data and AI Microsoft Cloud Workshop. Here’s a quick update with an overview of the changes.

A Simpler Architecture

Here was the architecture diagram prior to the update:

This all makes perfect sense, trust me.

Here’s the new diagram:

This therefore makes perfecter sense?

You can see that the diagram itself is considerably simpler, and there are a few things which helped here:

  • Azure Database for PostgreSQL is gone from the lab. There’s a place for a lab with Postgres, but this wasn’t a great example of it.
  • “Informational” links have been removed, including Cosmos OLTP to OLAP.
  • Fewer moving parts. The old diagram had data movement going in all directions. In this case, the data flow is almost always going from left to right, with the exception being the microservice for metadata at the end.

…Leading to Fewer Exercises

By removing Postgres from the equation, I was able to eliminate one exercise entirely and strip down another one, saving a good amount of time.

The other big change was one which saddened me a bit: removing IoT Hub from the equation. I was really proud of the process for setting up IoT Hub and IoT Edge, assigning a device to Edge, and emulating an IoT device. In the lab itself, the factory and IoT Hub components are now “integrated” into the Azure Function, which generates artificial telemetry data and writes it to Cosmos DB. I thought the IoT part was an exciting component of the lab, but at the same time, it takes about 1 hour to get it going and the Powers That Be preferred that we get to the heart of the exercise (Cosmos, Synapse Link, and microservice deployment) sooner.

More Troubleshooting

One area people have had problems in the past is in creating and deploying the Azure Machine Learning model. For this reason, I’ve added some troubleshooting steps to give learners a chance to resolve issues. You can get most of the relevant information from the Azure Machine Learning logs, assuming the container gets built and logging catches the issue. But those are not always reasonable assumptions.

Conclusion

Even stripping out the sections on IoT Hub, this is still one of the most comprehensive end-to-end MCWs out there. This is free content that trainers can develop full-day trainings around, with a couple hours for whiteboard design and 3-4 hours in the lab (previously, it was about 6 hours if nothing went wrong). So give this MCW a try and if you have any suggestions or run into problems, please do add a GitHub issue.

A Slice of Time: Overview

This is part one in a series on window functions in SQL Server.

What is a Window Function?

A window function is, quite simply, a function applied over a window of data. If that doesn’t sound impressively reductionist, I don’t know what will. So let’s unpack this a little bit further.

A window function operates over a window of data. Here’s an example of a tumbling window.

An example of a tumbling window.

What we see here is a set of events running over time. We have a window of a fixed size, perhaps one minute in length. We capture all of the events during this stretch and operate on the elements within each window.

What is a Window?

Another way to think about a window is that it is the context of operation for a function call. Suppose we have a dataset and we want to perform some analysis for each invoice ID.

Breaking out windows by invoice ID.

The way that we define the context of our window is using the OVER() clause in SQL.

What Constitutes a Window Function?

Speaking of OVER() there are four things which are critical to a window function. They are:

  1. Use of the OVER() clause, which defines the window we use.
  2. A PARTITION BY clause, which allows us to filter rows in a window function.
  3. An ORDER BY clause provides sorting when we need an ordered set, but it also provides meaning to dataframes when aggregating data, as we’ll see in future posts.
  4. A frame, using ROWS or RANGE, to filter within a partition.

Here’s a sample of what a window function looks like:

SELECT
	Col1,
	Col2,
	WINDOW_FUNCTION() OVER(
			PARTITION BY Col1
			ORDER BY Col2
			ROWS BETWEEN UNBOUNDED PRECEDING
				AND CURRENT ROW
		) AS wf
FROM dbo.Table;

Our window function uses the OVER() clause to define that we want to PARTITION BY column 1, meaning that as column 1’s value changes, we create new windows. We sort within each window by column 2. Finally, we have a frame which includes all rows from the beginning of our window up to and including the current row.

Why Should I Care?

Window functions can solve a variety of use cases in an efficient manner. Over the course of this series, I’ll show you how we can use window functions to solve the following problems:

  • Deleting duplicate rows in a table
  • Calculating running totals
  • Calculating moving averages
  • Calculating percents of a total
  • Getting the latest N items from a table
  • Calculating the distinct number of customers who have purchased a product over time

In addition to these, we’ll cover a whole lot more.

Book Reviews: Fighting Churn with Data

Fighting Churn with Data by Carl Gold is an interesting concept with a pretty good implementation. I had a chance to read it while it was still in the Manning Early Access Program, an experience which always has its pros and cons. On the pro side, it was coming out right when I started to get into an initiative on customer churn at my company, so it was great timing. On the minus side, you could definitely see the chapters which had an editor’s touch versus the early drafts—there were typos, grammatical issues, and the things that authors simply won’t catch the first time around. To be clear, though, this isn’t a knock on the book itself; I haven’t checked the final product, but I’m sure that was cleaned up as the book went to print.

Gold does a good job of keeping the topic light. You can get into the weeds with churn, but his writing style keeps the book entertaining. Also, it helps that the book follows a few real companies as they try to fight churn, meaning that the metrics look realistic and the results are reasonable even if the actual numbers shown are randomly generated. Furthermore, they aren’t all in the same industry or using the same business model, so you do get some variety there.

The one thing that I think could have been better about the book is the sections on forecasting and machine learning. The author takes us through logistic regression and classification techniques, which are useful, but misses out on the chance to cover Kaplan-Meier survival curves or the Cox proportional hazard function. Those are definitely more complex to explain to business users, but they’re quite effective in forecasting, and I think the book would be better for having them or at least telling readers that these are additional techniques used in the field.

Other than that critique, I was happy with the book and recommend it to people looking for a way to understand churn risks and rates. The approach is pragmatic and I think data scientists and business analysts can both get something out of it.

Gold has a website dedicated to the book and additional resources, too.