A Slice of Time: Statistical Functions

This is part five in a series on window functions in SQL Server.

Reviewing the Statistical Functions

The next set of window functions we will look at are the statistical functions. There are four of them: PERCENT_RANK(), CUME_DIST(), PERCENTILE_CONT(), and PERCENTILE_DISC().

PERCENT_RANK and CUME_DIST

PERCENT_RANK() is a fairly straightforward function: rank each window based on some ORDER BY clause and range values from 0 to 1, inclusive. Here’s an example of a query which generates rankings across 66,463 invoice line items:

WITH records AS
(
    SELECT
        i.InvoiceDate,
        i.CustomerID,
        SUM(il.LineProfit) AS DailyProfit
    FROM Sales.InvoiceLinesSmall il
        INNER JOIN Sales.InvoicesSmall i
            ON il.InvoiceID = i.InvoiceID
    GROUP BY
        i.InvoiceDate,
        i.CustomerID
)
SELECT
    r.InvoiceDate,
    r.CustomerID,
    r.DailyProfit,
    PERCENT_RANK() OVER(ORDER BY r.DailyProfit) AS PercentRank
FROM records r
ORDER BY
	PercentRank DESC;
Ranking daily profits.

Note that we are not explaining percentage of values, but rather percentage of rank. That’s why the second row has the value it does: it is in the 99.998th percentile, even though its value is a mere 77.2% of the largest.

Breaking it down by customer, we can see that everything works exactly as we’d expect:

WITH records AS
(
    SELECT
        i.InvoiceDate,
        i.CustomerID,
        SUM(il.LineProfit) AS DailyProfit
    FROM Sales.InvoiceLinesSmall il
        INNER JOIN Sales.InvoicesSmall i
            ON il.InvoiceID = i.InvoiceID
    GROUP BY
        i.InvoiceDate,
        i.CustomerID
)
SELECT
    r.InvoiceDate,
    r.CustomerID,
    r.DailyProfit,
    PERCENT_RANK() OVER(
        PARTITION BY r.CustomerID
        ORDER BY r.DailyProfit) AS PercentRank
FROM records r
ORDER BY
    r.CustomerID,
    PercentRank;
PERCENT_RANK also handles negative values, no problem.

Cumulative distribution, or CUME_DIST(), calculates the percentage of rows with a value less than or equal to the current row. There is one big difference between this function and PERCENT_RANK(). Let’s take a look at an example to see what it is:

WITH records AS
(
    SELECT
        i.InvoiceDate,
        i.CustomerID,
        SUM(il.LineProfit) AS DailyProfit
    FROM Sales.InvoiceLinesSmall il
        INNER JOIN Sales.InvoicesSmall i
            ON il.InvoiceID = i.InvoiceID
    GROUP BY
        i.InvoiceDate,
        i.CustomerID
)
SELECT
    r.InvoiceDate,
    r.CustomerID,
    r.DailyProfit,
    PERCENT_RANK() OVER(
        PARTITION BY r.CustomerID
        ORDER BY r.DailyProfit) AS PercentRank,
    CUME_DIST() OVER(
        PARTITION BY r.CustomerID
        ORDER BY r.DailyProfit) AS CumeDist
FROM records r
ORDER BY
    r.CustomerID,
    PercentRank;
Compare and contrast time!

CUME_DIST() doesn’t show 0 for the smallest record. The reason for this is in the definition: CUME_DIST() tells us how far along we are in describing the entire set—that is, what percentage of values have we covered so far. This percentage is always greater than 0. By contrast, PERCENT_RANK() forces the lowest value to be 0 and the highest value to be 1.

Another thing to note is ties. There are 117 values for customer 1 in my dataset. Rows 5 and 6 both have a percent rank of 0.0344, which is approximately rank 4 (remembering that we start from 0, not 1). Both rows 5 and 6 have the same rank of 4, and then we move up to a rank of 6. Meanwhile, for cumulative distribution, we see that rows 5 and 6 have a cumulative distribution of 6/117 = 0.5128. In other words, PERCENT_RANK() ties get the lowest possible value, whereas CUME_DIST() ties get the highest possible value.

As a quick side note, I’m not absolutely sure why the PERCENT_RANK() values are slightly different from CUME_DIST(); the differences are very small and I’m sure there’s something in the math behind how these are calculated, but I couldn’t tell you for sure.

Percentiles

The other two functions are PERCENTILE_CONT() and PERCENTILE_DISC(). These stand for continuous and discrete percentiles, respectively. Both of them have the same functional inputs, but have a slight difference, as we’ll see. Let’s start with continuous percentiles and build out a five-number summary. I want to see the minimum value, maximum value, median (50th percentile) value, and the first and third quartiles (25th and 75th percentiles, respectively). The T-SQL to do this is really easy:

WITH records AS
(
    SELECT
        i.InvoiceDate,
        i.CustomerID,
        SUM(il.LineProfit) AS DailyProfit
    FROM Sales.InvoiceLinesSmall il
        INNER JOIN Sales.InvoicesSmall i
            ON il.InvoiceID = i.InvoiceID
    GROUP BY
        i.InvoiceDate,
        i.CustomerID
)
SELECT
    r.CustomerID,
    r.InvoiceDate,
    r.DailyProfit,
    PERCENTILE_CONT(0.0) WITHIN GROUP (ORDER BY r.DailyProfit) OVER (PARTITION BY r.CustomerID) AS Min,
    PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY r.DailyProfit) OVER (PARTITION BY r.CustomerID) AS Q1,
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY r.DailyProfit) OVER (PARTITION BY r.CustomerID) AS Median,
    PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY r.DailyProfit) OVER (PARTITION BY r.CustomerID) AS Q3,
    PERCENTILE_CONT(1.0) WITHIN GROUP (ORDER BY r.DailyProfit) OVER (PARTITION BY r.CustomerID) AS Max
FROM records r
ORDER BY
    r.CustomerID DESC,
    r.InvoiceDate;

PERCENTILE_CONT() takes in one parameter, which represents the percentile you want to see, ranging from 0 to 1. We also have a WITHIN GROUP clause here that I want to talk about for a moment, as we haven’t seen one of these yet. When working with certain window functions, we need to perform an operation inside the individual window. Here, we’ve partitioned the data by customer ID, meaning that we’ll calculate percentiles for each customer independently. Within the partition, I want to order the data by daily profit so that we can calculate the percentiles.

Here’s how the data looks, sorted by customer descending because it’s time for a change.

Calculating percentiles.

This looks really cool: I can do the same type of work that I’d do in R within SQL Server and not have to worry about setting up ML Services. But wait! There’s a tiny catch: the execution plan.

Not a great beginning.

So yeah, we have two window aggregates…wait, my producer is telling me that the image isn’t complete. Let’s stand by for a live feed:

This looks fine.

Um…that’s a lot of operators for a pretty simple query. For those keeping track at home, there are 7 window aggregates and 8 sort operations for this query! Everything is using the same WITHIN GROUPS and PARTITION BY clauses, and yet we get a window aggregate for each one, plus a couple bonus window aggregates thrown in for free query bucks. Even asking for a single percentile like just the median necessitates 3 window aggregates and 4 sorts for this query. Performance for this query is fine when you’re talking about thousands of rows. Once you talk about millions of rows, forget about it. On my Cray supercomputer desktop at home, getting a 5-number summary against 66,463 invoice records takes 107ms to complete, with 383ms of CPU and 5,515 reads. Extend this to 509,219 records in a larger version of this table and we have a duration of 824ms with 2,446ms of CPU and 11,732 reads. In real-life production situations, PERCENTILE_X() functions can bog down badly and leave you hurting.

Discrete Percentiles

There is another type of percentile query: PERCENTILE_DISC(). The difference between the continuous version and the discrete version is that the continuous version will interpolate values, whereas discrete will guarantee a value that’s in your dataset, even if it’s not statistically accurate. Here’s what I mean:

WITH records AS
(
    SELECT
        i.InvoiceDate,
        i.CustomerID,
        SUM(il.LineProfit) AS DailyProfit
    FROM Sales.InvoiceLinesSmall il
        INNER JOIN Sales.InvoicesSmall i
            ON il.InvoiceID = i.InvoiceID
    GROUP BY
        i.InvoiceDate,
        i.CustomerID
)
SELECT
    r.CustomerID,
    r.InvoiceDate,
    r.DailyProfit,
    PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY r.DailyProfit) OVER (PARTITION BY r.CustomerID) AS MedianDisc,
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY r.DailyProfit) OVER (PARTITION BY r.CustomerID) AS Median
FROM records r
ORDER BY
    r.CustomerID DESC,
    r.InvoiceDate;
Calculating a median, the Goofus and Gallant way.

In case you’re not too familiar with the idea, calculating a median is pretty easy. It’s a lot like prison, really: you first find the biggest, meanest number in the dataset and shank it. Then, you take out the smallest one just to make sure it doesn’t do the same to you. Rinse and repeat until you have either 1 value or 2 values remaining. If you have one value left, congratulations: that’s the median! If you have two values left, you find the midpoint between those values and declare it the true median.

In the first example above, we can see three values for customer 1061. We remove $4520.10 because it is the largest; in a corresponding move, we remove $2006.65. That leaves us one value, $3628.10, and both the discrete and continuous versions agree on this.

The second example, however, brooks some disagreement. Just like before, we knock out the top dog, $1587.40. Then, we take out the little guy, $464. This leaves us two values. The discrete version takes the smaller of the two values and declares $486 the winner. The continuous version does things correctly and declares $604.50 the true median.

Note that I have a decided bias in my depiction of continuous and discrete percentiles. My recommendation is that you almost always use continuous percentiles, as those will give you the correct answer. The only time I’d use discrete percentiles is if you then need to join back to a dataset based on that calculated percentile and get more information. Otherwise, stick to the continuous versions.

Also note that PERCENTILE_DISC() is a dog for performance as well. You won’t really gain anything in bringing it in, though on the plus(?) side, it’s at least not slower than the continuous version.

Conclusion

Today we covered four more window functions. Two of them have to do with figuring out how far down a road you are, and the other two have to do with calculating percentages. I don’t tend to need percentage ranks or cumulative distributions that often, and although I’d like to use the percentiles more, they’re just too slow in SQL Server 2019. Admittedly, they were extremely fast in SQL Server 2008 R2 and before: you’d get a syntax error mighty fast, with hardly any server utilization.

One thought on “A Slice of Time: Statistical Functions

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s