This is part seven in a series on window functions in SQL Server.
The Road So Far
To this point, we’ve looked at five classes of window function in SQL Server. I’ve given you a couple of solid use cases, but for the most part, we’ve focused on what the classes of window functions are. Now we’re going to talk about why you want to use them.
What’s the Point?
Window functions help us solve a variety of problem types in a rather efficient way. Let’s cover several in turn.
Deleting Duplicate Rows
Suppose you have a really great event system. You track events and numbers of attendees, and your app works really well: it even inserts data when you ask it to!
CREATE TABLE #t1 ( Id INT IDENTITY(1,1) NOT NULL, EventDate DATE NOT NULL, NumberOfAttendees INT NOT NULL ); -- Our first round of insertions INSERT INTO #t1(EventDate, NumberOfAttendees) VALUES ('2021-04-01', 150), ('2021-04-08', 165), ('2021-04-15', 144), ('2021-04-22', 170), ('2021-04-29', 168), ('2021-05-06', 164), ('2021-05-13', 152), ('2021-05-20', 158), ('2021-05-27', 168), ('2021-06-03', 170), ('2021-06-10', 161), ('2021-06-17', 155);
In fact, it’s so great that sometimes it inserts the same data multiple times, which I’m pretty sure is how you back up the data in a database.
INSERT INTO #t1(EventDate, NumberOfAttendees) VALUES ('2021-04-01', 150), ('2021-04-01', 150), ('2021-04-01', 150), ('2021-04-01', 150), ('2021-04-08', 165), ('2021-04-08', 165), ('2021-04-29', 168);
Some people might appreciate having 5 copies of their data for maximum redundancy, but it turns out that this makes it difficult for event organizers to understand what’s going on, and they keep getting weird results in the app because we can only have one event per day, just like in real life.
So how do we delete the bad rows? Well, we could give an intern a print-out copy of a spreadsheet containing all of the data in the table and then hand-write
DELETE statements based on the ID column. Actually, forget this whole window function thing—let’s just hire a whole bunch of interns. Problem solved!
…My producer is telling me that this is not a good enough solution to the problem and if I want to finish up this blog post, I have to solve it using window functions. So let’s do that so I can
go home continue being at home.
The following query gets me a unique, ascending row number for each event date, sorted by the ID of the event. We’re going to call that
rownum because I am not clever with naming.
SELECT t1.Id, t1.EventDate, t1.NumberOfAttendees, ROW_NUMBER() OVER (PARTITION BY t1.EventDate ORDER BY t1.Id) AS rownum FROM #t1 t1
Once we have our row number in place, we can perform the kill shot:
WITH records AS ( SELECT t1.Id, t1.EventDate, t1.NumberOfAttendees, ROW_NUMBER() OVER (PARTITION BY t1.EventDate ORDER BY t1.Id) AS rownum FROM #t1 t1 ) DELETE FROM records WHERE rownum > 1;
Doing this gets us back to having one record per event.
Yes, we talked about this in a prior blog post. We’re talking about it again here because it’s just that important. Let’s say that we want to know how many attendees have come to our events over time. We can use a running total to get those results:
SELECT t1.Id, t1.EventDate, t1.NumberOfAttendees, SUM(t1.NumberOfAttendees) OVER ( ORDER BY t1.EventDate RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ) AS RunningTotal FROM #t1 t1;
Let’s say that we’re trying to talk advertisers into giving us all kinds of money to sponsor our events. We could show them attendee counts, but what if there’s a down week or a boom week? We don’t want them to think every week is like that. One good way to deal with ups and downs is to smooth things using moving averages. Here, we’re going to calculate the average number of attendees over a three-event period.
SELECT t1.Id, t1.EventDate, t1.NumberOfAttendees, AVG(t1.NumberOfAttendees) OVER ( ORDER BY t1.EventDate ROWS BETWEEN 2 PRECEDING AND CURRENT ROW ) AS MovingAverage3 FROM #t1 t1;
The way this works is, if we have fewer than three weeks in a window, we take the average of however many weeks we do have. Thus, the first week has 150 attendees and a moving average of 150. The second week has 165 attendees and a moving average of (150+165)/2 = 157 (clip off the decimal because we’re doing integer math here). Once we get to the third week, we have (150+165+144)/3 = 153. For week four, we drop off the first entry because we only want the current row and the two preceding rows: (165+144+170)/3 = 159.
One quick point here is that for moving averages, we’re generally going to want to use
ROWS as the frame rather than
RANGE. Why? For sad reasons that I’ll talk about in a later post.
Percentage of a Total and Cumulative Percent
Okay, we’ve seen some of the basics, but let’s get a little crazy.
SELECT t1.Id, t1.EventDate, t1.NumberOfAttendees, SUM(t1.NumberOfAttendees) OVER ( ORDER BY t1.EventDate RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ) AS RunningTotal, SUM(t1.NumberOfAttendees) OVER () AS GrandTotal, 1.0 * t1.NumberOfAttendees / SUM(t1.NumberOfAttendees) OVER () AS PercentOfTotal, 1.0 * SUM(t1.NumberOfAttendees) OVER ( ORDER BY t1.EventDate RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ) / SUM(t1.NumberOfAttendees) OVER () AS CumulativePercent FROM #t1 t1;
Let’s talk about what we get in one query:
- A running total. We’ve seen these before so they’re old hat.
- A grand total. Saw this one as well in our aggregate functions post.
- The number of attendees in a given week divided by the grand total is the percent of the total. This is new and gives us the share of the whole which a single row represents.
- The running total divided by the grand total is the cumulative percentage of attendees. This is also the same as the running total of the
Not bad for a query’s work. Now, if I were putting this together in a real solution, I’d probably wrap the running total and grand total calculations in a CTE so I can reuse them more easily:
WITH records AS ( SELECT t1.Id, t1.EventDate, t1.NumberOfAttendees, SUM(t1.NumberOfAttendees) OVER ( ORDER BY t1.EventDate RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ) AS RunningTotal, SUM(t1.NumberOfAttendees) OVER () AS GrandTotal FROM #t1 t1 ) SELECT t1.Id, t1.EventDate, t1.NumberOfAttendees, t1.RunningTotal, t1.GrandTotal, 1.0 * t1.NumberOfAttendees / t1.GrandTotal AS PercentOfTotal, 1.0 * t1.RunningTotal / t1.GrandTotal AS CumulativePercent FROM records t1;
Get the Latest N Values
Suppose I want to find the latest 5 orders for each customer in the Wide World Importers database. Specifically, I want to limit this to customers who do not have ” Toys” in the name. Here’s how I can do this using a window function:
WITH customers AS ( SELECT c.CustomerID, c.CustomerName FROM Sales.Customers c WHERE CustomerName NOT LIKE N'% Toys%' ), invoices AS ( SELECT i.CustomerID, i.InvoiceID, i.InvoiceDate, i.TotalChillerItems, i.TotalDryItems, ROW_NUMBER() OVER (PARTITION BY i.CustomerID ORDER BY i.InvoiceDate DESC) AS rownum FROM Sales.InvoicesSmall i INNER JOIN customers c ON i.CustomerID = c.CustomerID ) SELECT * FROM customers c LEFT OUTER JOIN invoices i ON c.CustomerID = i.CustomerID AND rownum <= 5;
Before I go any further, though, let me point out that this is not necessarily the most efficient method for getting the N latest records. It can be the most efficient method in some cases, but the APPLY operator can be more efficient in cases where you get a small percentage of a large table. I go into it in a lot more detail in my Teachable course on the topic, and if you use the discount code WINDOW, you can get the course for $20 USD through the end of the year.
Stealth advertisement over.
Distinct Customer Counts over Time
Have you ever needed to calculate the distinct number of customers on your site over time? No? Um…have you wanted to? I’ll take that awkward silence as a yes!
WITH records AS ( SELECT OrderDate, CASE WHEN ROW_NUMBER() OVER (PARTITION BY CustomerID ORDER BY OrderDate) = 1 THEN CustomerID END AS DistinctCustomerID FROM Sales.Orders o ) SELECT DISTINCT OrderDate, COUNT(DistinctCustomerID) OVER (ORDER BY OrderDate) AS NumCustomers FROM records r ORDER BY OrderDate;
We’re solving this problem with two separate window functions. The first window function determines the first order for a given customer. When that order is the first—that is, when the row number is 1—then we return the customer ID. Otherwise, there is an implicit NULL. This means that we’ll get back one row for each order in
Sales.Orders, but we’ll only have a single value per customer which is not NULL.
Then, we take that stream of mostly-NULLs and get the count of distinct customer IDs. Remember that
COUNT() ignores NULL values if you specify a column in the aggregate function. Then, we make
COUNT() into a window function ordered by
OrderDate with an implicit frame of
RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW, giving us the count of non-NULL values in the dataset up to the current date. Slap a
DISTINCT on here to collapse down the values and we have our answer.
Build an Events System and Track Maximum Concurrent Operations
This one’s a doozy and I’ve written about it previously. By the way, that problem and solution looks a bit complex, but I’ve actually used it in real life to solve actual work-related problems. Specifically, it was around seeing if any customer had multiple concurrent upload operations going, as there was a bug in one product team’s code which could fire in that scenario, and so they wanted to see how many customers might have been affected. Fortunately, the answer was zero.
If you want a version for Wide World Importers, let’s suppose that we want to see the maximum number of concurrent orders that a customer has had in transit. Suppose that the
OrderDate represents when the customer places an order and
ExpectedDeliveryDate is the date the customer receives the order. We can unpivot orders data and figure out the maximum number of concurrent orders per customer using the following query:
WITH StartStopPoints AS ( SELECT o.CustomerID, o.OrderDate AS TimeUTC, 1 AS IsStartingPoint, ROW_NUMBER() OVER (PARTITION BY o.CustomerID ORDER BY o.OrderDate) AS StartOrdinal FROM Sales.Orders o UNION ALL SELECT o.CustomerID, o.ExpectedDeliveryDate AS TimeUTC, 0 AS IsStartingPoint, NULL AS StartOrdinal FROM Sales.Orders o ), StartStopOrder AS ( SELECT s.CustomerID, s.TimeUTC, s.IsStartingPoint, s.StartOrdinal, ROW_NUMBER() OVER (PARTITION BY s.CustomerID ORDER BY s.TimeUTC, s.IsStartingPoint) AS StartOrEndOrdinal FROM StartStopPoints s ) SELECT s.CustomerID, MAX(2 * s.StartOrdinal - s.StartOrEndOrdinal) AS MaxConcurrentOrders FROM StartStopOrder s WHERE s.IsStartingPoint = 1 GROUP BY s.CustomerID ORDER BY MaxConcurrentOrders DESC, CustomerID ASC;
Over the course of this post, we’ve looked at seven business use cases for window functions and we could easily come up with more. Window functions provide a huge amount of versatility when it comes to solving business problems. Furthermore, as we saw in the last two examples, clever combinations of window functions and reshaping of data can solve challenging business problems quite effectively.