A Slice of Time: Ranking Functions

This is part three in a series on window functions in SQL Server.

The Original Windows

Last time around, we looked at aggregate window functions, which I think are the easiest for someone brand new to the concept to understand. Today, we’ll shift to the first window functions introduced to SQL Server in 2005: ranking window functions.

They’ve Given You a Number and Take Away Your Name

The whole concept of ranking window functions is to assign some numeric ordering to a dataset. There are four ranking functions in SQL Server. Three of them are very similar to one another: ROW_NUMBER(), RANK(), DENSE_RANK(). The fourth one, NTILE(), is the odd cousin of the family.

Unlike aggregate window functions, all ranking window functions must have at least an ORDER BY clause in the OVER() operator. The reason is that you are attempting to bring order to the chaos of your data by assigning a number based on the order in which you join.

Row Numbers, Ranks, and Dense Ranks

Let’s show off what the three normal ranking functions can do:

SELECT TOP(250)
    il.InvoiceLineID,
    il.InvoiceID,
    i.CustomerID,
    ROW_NUMBER() OVER (ORDER BY il.InvoiceID) AS rownum,
    DENSE_RANK() OVER (ORDER BY il.InvoiceID) AS dr,
    RANK() OVER (ORDER BY il.InvoiceID) AS rk
FROM Sales.InvoiceLinesSmall il
    INNER JOIN Sales.InvoicesSmall i
        ON il.InvoiceID = i.InvoiceID
ORDER BY
    il.InvoiceLineID,
	rownum;
Order!

In this case, we’re ordering everything by invoice ID because it shows off the differences between the three functions quite nicely.

Monotonicity

If you ever hear me present and I talk about the ROW_NUMBER() function, ingrained habit has me speak of “monotonically increasing” yada yada because grad school does weird things to your brain. Let’s talk about what I mean by this with a quick discussion of monotonicity.

When I say that a variable is monotonic with respect to another variable, what that means is that when the value of one goes up, the other does as well; when the value of one goes down, the other does as well. Take this hastily-drawn example:

Monotonicity in action.

As one curve goes up, the other one does as well. As a curve goes down, so too does the other. They are in lock-step when it comes to direction. Important here is that monotonic says nothing about whether the two variables remain equidistant. You can see in this image that at some points, the difference between the two curves at some given X is greater than at other points. We can’t use monotonicity as a shortcut for “equivalent.”

To add a bit to the confusion, you can also talk about the function which generated these curves as monotonic or not. In that case, a function is monotonic if its results move in one direction. Because our curves move both up and down, the curves themselves are not monotonic, but they are monotonic with respect to each other.

Strict or Not?

Not pictured: the skinny line going TO THE MOON.

There are two variants of monotonicity we care about: strict and non-strict monotonicity. Non-strict monotonicity allows for “flat spots” in the curve. For example, the two lines in the following image exhibit non-strict monotonicity.

These two line segments are monotonic with respect to each other, even though there’s a flat spot in one but the other consistently increases. When I gave a broad explanation of monotonicity (one curve goes up, so does the other; one curve goes down, so does the other), that’s technically the depiction of non-strict monotonicity, as I don’t make any reference to these sorts of flat spots.

When we look at strict monotonicity, we include flat spots: as one curve goes up, so does the other; as one curve goes down, so does the other; as one curve stays flat, so does the other. The above image does not two variables in a strictly monotonic relationship, but the following image does:

Pictured: one curve going TO THE MOON! The other curve going TO THE MOON BUT SLIGHTLY SLOWER AND MAYBE IT WILL MISS ITS ANGLE OF ATTACK AND END UP FLYING OFF INTO SPACE FOREVER BUT I CAN’T TELL BASED SOLELY ON THE INFORMATION GIVEN!

These two curves are strictly monotonic with respect to one another: as one increases, so does the other.

Back to the Story

That cul-de-sac on monotonicity aside, let’s review our results. I’m including the results image again because who wants to scroll all the way back up there to see it? Not me, that’s for sure.

Finally we get to talk about ranking functions. That is, unless I go off on a rant about how “weakly monotonic” and “strongly monotonic” as economists use the terms feel like they’re backwards, but ultimately make sense if you spend a while thinking about them.

As a reminder, we are ordering everything based on InvoiceID. ROW_NUMBER() generates a unique, ascending number for each value, starting from 1. ROW_NUMBER(), along with the other ranking functions, is guaranteed to be non-strictly monotonic with respect to the ordering column(s). I can use ROW_NUMBER() to generate a unique identifier per row if I need one.

The RANK() and DENSE_RANK() functions are both very similar to one another and deal with ranking results. With ROW_NUMBER(), we generate a unique, ascending number. But when it comes to ranks, if we have a tie, we can’t differentiate between the two. Think of a sprint, in which we calculate how long it takes runners to make it a certain distance. If two competitors take exactly the same measured time, we can’t arbitrarily say that A wins over B, as the only thing we include (race time) was a tie. In that case, we count both as having the same place.

In the example above, we have one line for invoice 1, but then we have two lines for invoice 2. Because we are only ordering by invoice ID, we can’t differentiate invoice lines 2 and 3, and therefore our ranking functions assign both lines the value of 2 because of a tie. ROW_NUMBER() doesn’t care and arbitrarily assigned one the number 2 and the other the number 3 because that’s what ROW_NUMBER() does and if you wanted guarantees in life, you should have made the relationship strictly monotonic.

The brashness of ROW_NUMBER() aside, we can see that RANK() and DENSE_RANK() both agree on the course of action in the event of a tie, but where they differ is in the next value. Going back to the example of a sprint, if we have two people tie for second place, the next contestant doesn’t end up in third place; instead, he ends up in fourth. This is what the RANK() function does: it skips numbers based on the incidence of ties. We have two people in second place, so therefore we no longer have a third place. OR we have three people in eighth place, so we no longer have ninth or tenth places and move to 11. Competitive events typically use RANK() style rules for determining place.

But competitive events aren’t the only place in which we might rank. Sometimes, you want to collapse the numbers down and see how many ranks there are. That’s where DENSE_RANK() comes into play. We might use this for tiering: A, B, and C are in the top tier because they all score a 100. Then, D and E are in the second tier because they scored 95. F is in the third tier, with a score of 94.99999. Therefore, we have reasons to use both RANK() and DENSE_RANK().

A Simple Question

What if we ordered everything by InvoiceLineID? Well, then our ranking function results would be strictly monotonic with respect to the element we are using for ordering, and so all three of them would increase 1 at a time alongside InvoiceLineID. This would lead to a rather boring and non-illustrative example, as I couldn’t show you the differences between the three ranking functions as they’d all give exactly the same results. So let’s not do that.

I Am Not a Number; I am a Free Man!

It’s now becoming apparent that I actually created this talk on window functions so that I could riff on the weird way economists think about monotonicity and to include at least one reference to The Prisoner, one of the best television shows ever.

We’ve looked at three of the four ranking functions so far, and now it’s time to move on to the final ranking function, NTILE(). This allows you to create percentiles…ish.

RANK(), DENSE_RANK(), and ROW_NUMBER() are all functions which accept zero parameters. NTILE() is a function which requires a single parameter: the number of -tiles. NTILE(100) would give you percentiles, NTILE(4) gives you quartiles, and so on. Like other ranking functions, you need to include at least an ORDER BY clause and can also include PARTITION BY and RANGE/ROWS. Here’s an example of one, which breaks the data out into five separate listings:

WITH records AS
(
    SELECT TOP(250)
        il.InvoiceLineID,
        il.InvoiceID,
        i.CustomerID,
        il.LineProfit
    FROM Sales.InvoiceLinesSmall il
        INNER JOIN Sales.InvoicesSmall i
            ON il.InvoiceID = i.InvoiceID
    ORDER BY il.InvoiceLineID
)
SELECT
    il.InvoiceLineID,
    il.InvoiceID,
    il.CustomerID,
    il.LineProfit,
    NTILE(3) OVER (ORDER BY il.LineProfit) AS nt3,
    NTILE(5) OVER (ORDER BY il.LineProfit) AS nt5,
    NTILE(10) OVER (ORDER BY il.LineProfit) AS nt10,
    NTILE(20) OVER (ORDER BY il.LineProfit) AS nt20,
    NTILE(100) OVER (ORDER BY il.LineProfit) AS nt100
FROM records il
ORDER BY
    il.LineProfit;

Instead of looking at all of the rows, I’m only getting the first 250. That way, you can see changes pretty quickly. Before I show the results, knowing that we have 250 rows, that makes it easy to break down for things like NTILE(5), meaning that there’d be 250 / 5 or 50 rows per “-tile.” But what happens when you can’t cleanly divide your number of groupings into the total number of rows? What happens is that the remainder is spread from top to bottom. What this means is that, for 250 rows and NTILE(3), we have 3 groupings. 250 / 3 = 83 1/3. That means that each group (1, 2, and 3) will get 83 rows, and group 1 will get an extra row. For NTILE(20), each of the 20 groups will get 12 rows, and there are 10 rows left over (250 / 20 = 12 10/20), so the first 10 groups will get an extra row. Let’s see it in action!

All of the tiles!

We can see that with NTILE(3), the first 84 rows are in group 1, and then we move to group 2. For NTILE(100), we see 3 rows per grouping, and can expect that to continue for the first 50 groups, continuing from there with 2 rows per group.

Conclusion

Ranking functions are quite useful in SQL Server. We’ll see specific business uses of ranking functions later on in the series, but we can already see right now how useful it can be for creating order out of data and inciting me to long segues about monotonicity.

In the next post, we’re going to look at offset functions, another SQL Server 2012 addition.