This is part three in a series on window functions in SQL Server.
The Original Windows
Last time around, we looked at aggregate window functions, which I think are the easiest for someone brand new to the concept to understand. Today, we’ll shift to the first window functions introduced to SQL Server in 2005: ranking window functions.
They’ve Given You a Number and Take Away Your Name
The whole concept of ranking window functions is to assign some numeric ordering to a dataset. There are four ranking functions in SQL Server. Three of them are very similar to one another:
DENSE_RANK(). The fourth one,
NTILE(), is the odd cousin of the family.
Unlike aggregate window functions, all ranking window functions must have at least an
ORDER BY clause in the
OVER() operator. The reason is that you are attempting to bring order to the chaos of your data by assigning a number based on the order in which you join.
Row Numbers, Ranks, and Dense Ranks
Let’s show off what the three normal ranking functions can do:
SELECT TOP(250) il.InvoiceLineID, il.InvoiceID, i.CustomerID, ROW_NUMBER() OVER (ORDER BY il.InvoiceID) AS rownum, DENSE_RANK() OVER (ORDER BY il.InvoiceID) AS dr, RANK() OVER (ORDER BY il.InvoiceID) AS rk FROM Sales.InvoiceLinesSmall il INNER JOIN Sales.InvoicesSmall i ON il.InvoiceID = i.InvoiceID ORDER BY il.InvoiceLineID, rownum;
In this case, we’re ordering everything by invoice ID because it shows off the differences between the three functions quite nicely.
If you ever hear me present and I talk about the
ROW_NUMBER() function, ingrained habit has me speak of “monotonically increasing” yada yada because grad school does weird things to your brain. Let’s talk about what I mean by this with a quick discussion of monotonicity.
When I say that a variable is monotonic with respect to another variable, what that means is that when the value of one goes up, the other does as well; when the value of one goes down, the other does as well. Take this hastily-drawn example:
As one curve goes up, the other one does as well. As a curve goes down, so too does the other. They are in lock-step when it comes to direction. Important here is that monotonic says nothing about whether the two variables remain equidistant. You can see in this image that at some points, the difference between the two curves at some given X is greater than at other points. We can’t use monotonicity as a shortcut for “equivalent.”
To add a bit to the confusion, you can also talk about the function which generated these curves as monotonic or not. In that case, a function is monotonic if its results move in one direction. Because our curves move both up and down, the curves themselves are not monotonic, but they are monotonic with respect to each other.
Strict or Not?
There are two variants of monotonicity we care about: strict and non-strict monotonicity. Non-strict monotonicity allows for “flat spots” in the curve. For example, the two lines in the following image exhibit non-strict monotonicity.
These two line segments are monotonic with respect to each other, even though there’s a flat spot in one but the other consistently increases. When I gave a broad explanation of monotonicity (one curve goes up, so does the other; one curve goes down, so does the other), that’s technically the depiction of non-strict monotonicity, as I don’t make any reference to these sorts of flat spots.
When we look at strict monotonicity, we include flat spots: as one curve goes up, so does the other; as one curve goes down, so does the other; as one curve stays flat, so does the other. The above image does not two variables in a strictly monotonic relationship, but the following image does:
These two curves are strictly monotonic with respect to one another: as one increases, so does the other.
Back to the Story
That cul-de-sac on monotonicity aside, let’s review our results. I’m including the results image again because who wants to scroll all the way back up there to see it? Not me, that’s for sure.
As a reminder, we are ordering everything based on InvoiceID.
ROW_NUMBER() generates a unique, ascending number for each value, starting from 1.
ROW_NUMBER(), along with the other ranking functions, is guaranteed to be non-strictly monotonic with respect to the ordering column(s). I can use
ROW_NUMBER() to generate a unique identifier per row if I need one.
DENSE_RANK() functions are both very similar to one another and deal with ranking results. With
ROW_NUMBER(), we generate a unique, ascending number. But when it comes to ranks, if we have a tie, we can’t differentiate between the two. Think of a sprint, in which we calculate how long it takes runners to make it a certain distance. If two competitors take exactly the same measured time, we can’t arbitrarily say that A wins over B, as the only thing we include (race time) was a tie. In that case, we count both as having the same place.
In the example above, we have one line for invoice 1, but then we have two lines for invoice 2. Because we are only ordering by invoice ID, we can’t differentiate invoice lines 2 and 3, and therefore our ranking functions assign both lines the value of 2 because of a tie.
ROW_NUMBER() doesn’t care and arbitrarily assigned one the number 2 and the other the number 3 because that’s what
ROW_NUMBER() does and if you wanted guarantees in life, you should have made the relationship strictly monotonic.
The brashness of
ROW_NUMBER() aside, we can see that
DENSE_RANK() both agree on the course of action in the event of a tie, but where they differ is in the next value. Going back to the example of a sprint, if we have two people tie for second place, the next contestant doesn’t end up in third place; instead, he ends up in fourth. This is what the
RANK() function does: it skips numbers based on the incidence of ties. We have two people in second place, so therefore we no longer have a third place. OR we have three people in eighth place, so we no longer have ninth or tenth places and move to 11. Competitive events typically use
RANK() style rules for determining place.
But competitive events aren’t the only place in which we might rank. Sometimes, you want to collapse the numbers down and see how many ranks there are. That’s where
DENSE_RANK() comes into play. We might use this for tiering: A, B, and C are in the top tier because they all score a 100. Then, D and E are in the second tier because they scored 95. F is in the third tier, with a score of 94.99999. Therefore, we have reasons to use both
A Simple Question
What if we ordered everything by InvoiceLineID? Well, then our ranking function results would be strictly monotonic with respect to the element we are using for ordering, and so all three of them would increase 1 at a time alongside InvoiceLineID. This would lead to a rather boring and non-illustrative example, as I couldn’t show you the differences between the three ranking functions as they’d all give exactly the same results. So let’s not do that.
I Am Not a Number; I am a Free Man!
We’ve looked at three of the four ranking functions so far, and now it’s time to move on to the final ranking function,
NTILE(). This allows you to create percentiles…ish.
ROW_NUMBER() are all functions which accept zero parameters.
NTILE() is a function which requires a single parameter: the number of -tiles.
NTILE(100) would give you percentiles,
NTILE(4) gives you quartiles, and so on. Like other ranking functions, you need to include at least an
ORDER BY clause and can also include
PARTITION BY and
RANGE/ROWS. Here’s an example of one, which breaks the data out into five separate listings:
WITH records AS ( SELECT TOP(250) il.InvoiceLineID, il.InvoiceID, i.CustomerID, il.LineProfit FROM Sales.InvoiceLinesSmall il INNER JOIN Sales.InvoicesSmall i ON il.InvoiceID = i.InvoiceID ORDER BY il.InvoiceLineID ) SELECT il.InvoiceLineID, il.InvoiceID, il.CustomerID, il.LineProfit, NTILE(3) OVER (ORDER BY il.LineProfit) AS nt3, NTILE(5) OVER (ORDER BY il.LineProfit) AS nt5, NTILE(10) OVER (ORDER BY il.LineProfit) AS nt10, NTILE(20) OVER (ORDER BY il.LineProfit) AS nt20, NTILE(100) OVER (ORDER BY il.LineProfit) AS nt100 FROM records il ORDER BY il.LineProfit;
Instead of looking at all of the rows, I’m only getting the first 250. That way, you can see changes pretty quickly. Before I show the results, knowing that we have 250 rows, that makes it easy to break down for things like
NTILE(5), meaning that there’d be 250 / 5 or 50 rows per “-tile.” But what happens when you can’t cleanly divide your number of groupings into the total number of rows? What happens is that the remainder is spread from top to bottom. What this means is that, for 250 rows and
NTILE(3), we have 3 groupings. 250 / 3 = 83 1/3. That means that each group (1, 2, and 3) will get 83 rows, and group 1 will get an extra row. For
NTILE(20), each of the 20 groups will get 12 rows, and there are 10 rows left over (250 / 20 = 12 10/20), so the first 10 groups will get an extra row. Let’s see it in action!
We can see that with
NTILE(3), the first 84 rows are in group 1, and then we move to group 2. For
NTILE(100), we see 3 rows per grouping, and can expect that to continue for the first 50 groups, continuing from there with 2 rows per group.
Ranking functions are quite useful in SQL Server. We’ll see specific business uses of ranking functions later on in the series, but we can already see right now how useful it can be for creating order out of data and inciting me to long segues about monotonicity.
In the next post, we’re going to look at offset functions, another SQL Server 2012 addition.
2 thoughts on “A Slice of Time: Ranking Functions”