This is part seven in a series on window functions in SQL Server.

The Road So Far

To this point, we’ve looked at five classes of window function in SQL Server. I’ve given you a couple of solid use cases, but for the most part, we’ve focused on what the classes of window functions are. Now we’re going to talk about why you want to use them.

What’s the Point?

Window functions help us solve a variety of problem types in a rather efficient way. Let’s cover several in turn.

Deleting Duplicate Rows

Suppose you have a really great event system. You track events and numbers of attendees, and your app works really well: it even inserts data when you ask it to!

CREATE TABLE #t1
(
    Id INT IDENTITY(1,1) NOT NULL,
    EventDate DATE NOT NULL,
    NumberOfAttendees INT NOT NULL
);

-- Our first round of insertions
INSERT INTO #t1(EventDate, NumberOfAttendees) VALUES
('2021-04-01', 150),
('2021-04-08', 165),
('2021-04-15', 144),
('2021-04-22', 170),
('2021-04-29', 168),
('2021-05-06', 164),
('2021-05-13', 152),
('2021-05-20', 158),
('2021-05-27', 168),
('2021-06-03', 170),
('2021-06-10', 161),
('2021-06-17', 155);

In fact, it’s so great that sometimes it inserts the same data multiple times, which I’m pretty sure is how you back up the data in a database.

INSERT INTO #t1(EventDate, NumberOfAttendees) VALUES
('2021-04-01', 150),
('2021-04-01', 150),
('2021-04-01', 150),
('2021-04-01', 150),
('2021-04-08', 165),
('2021-04-08', 165),
('2021-04-29', 168);

Some people might appreciate having 5 copies of their data for maximum redundancy, but it turns out that this makes it difficult for event organizers to understand what’s going on, and they keep getting weird results in the app because we can only have one event per day, just like in real life.

So how do we delete the bad rows? Well, we could give an intern a print-out copy of a spreadsheet containing all of the data in the table and then hand-write DELETE statements based on the ID column. Actually, forget this whole window function thing—let’s just hire a whole bunch of interns. Problem solved!

…My producer is telling me that this is not a good enough solution to the problem and if I want to finish up this blog post, I have to solve it using window functions. So let’s do that so I can go home continue being at home.

The following query gets me a unique, ascending row number for each event date, sorted by the ID of the event. We’re going to call that rownum because I am not clever with naming.

SELECT
    t1.Id,
    t1.EventDate,
    t1.NumberOfAttendees,
    ROW_NUMBER() OVER (PARTITION BY t1.EventDate ORDER BY t1.Id) AS rownum
FROM #t1 t1
Look, I’m sure this is intended behavior in the application. Have you checked the non-functional requirements?

Once we have our row number in place, we can perform the kill shot:

WITH records AS
(
    SELECT
        t1.Id,
        t1.EventDate,
        t1.NumberOfAttendees,
        ROW_NUMBER() OVER (PARTITION BY t1.EventDate ORDER BY t1.Id) AS rownum
    FROM #t1 t1
)
DELETE FROM records
WHERE
    rownum > 1;

Doing this gets us back to having one record per event.

It is scientifically proven that deleting data is 85% more fun than inserting data.

Running Totals

Yes, we talked about this in a prior blog post. We’re talking about it again here because it’s just that important. Let’s say that we want to know how many attendees have come to our events over time. We can use a running total to get those results:

SELECT
    t1.Id,
    t1.EventDate,
    t1.NumberOfAttendees,
    SUM(t1.NumberOfAttendees) OVER (
        ORDER BY t1.EventDate
        RANGE BETWEEN UNBOUNDED PRECEDING
            AND CURRENT ROW
    ) AS RunningTotal
FROM #t1 t1;
Organizer protip: if your running total is equal to your unique attendee count, you probably should stop sacrificing all of your attendees by throwing them into volcanoes.

Moving Average

Let’s say that we’re trying to talk advertisers into giving us all kinds of money to sponsor our events. We could show them attendee counts, but what if there’s a down week or a boom week? We don’t want them to think every week is like that. One good way to deal with ups and downs is to smooth things using moving averages. Here, we’re going to calculate the average number of attendees over a three-event period.

SELECT
    t1.Id,
    t1.EventDate,
    t1.NumberOfAttendees,
    AVG(t1.NumberOfAttendees) OVER (
        ORDER BY t1.EventDate
        ROWS BETWEEN 2 PRECEDING
            AND CURRENT ROW
    ) AS MovingAverage3
FROM #t1 t1;
So smooth and silky.

The way this works is, if we have fewer than three weeks in a window, we take the average of however many weeks we do have. Thus, the first week has 150 attendees and a moving average of 150. The second week has 165 attendees and a moving average of (150+165)/2 = 157 (clip off the decimal because we’re doing integer math here). Once we get to the third week, we have (150+165+144)/3 = 153. For week four, we drop off the first entry because we only want the current row and the two preceding rows: (165+144+170)/3 = 159.

One quick point here is that for moving averages, we’re generally going to want to use ROWS as the frame rather than RANGE. Why? For sad reasons that I’ll talk about in a later post.

Percentage of a Total and Cumulative Percent

Okay, we’ve seen some of the basics, but let’s get a little crazy.

SELECT
    t1.Id,
    t1.EventDate,
    t1.NumberOfAttendees,
    SUM(t1.NumberOfAttendees) OVER (
        ORDER BY t1.EventDate
        RANGE BETWEEN UNBOUNDED PRECEDING
            AND CURRENT ROW
    ) AS RunningTotal,
    SUM(t1.NumberOfAttendees) OVER () AS GrandTotal,
    1.0 * t1.NumberOfAttendees / SUM(t1.NumberOfAttendees) OVER () AS PercentOfTotal,
    1.0 * SUM(t1.NumberOfAttendees) OVER (
        ORDER BY t1.EventDate
        RANGE BETWEEN UNBOUNDED PRECEDING
            AND CURRENT ROW
    ) / SUM(t1.NumberOfAttendees) OVER () AS CumulativePercent
FROM #t1 t1;
So many beautiful calculations.

Let’s talk about what we get in one query:

  • A running total. We’ve seen these before so they’re old hat.
  • A grand total. Saw this one as well in our aggregate functions post.
  • The number of attendees in a given week divided by the grand total is the percent of the total. This is new and gives us the share of the whole which a single row represents.
  • The running total divided by the grand total is the cumulative percentage of attendees. This is also the same as the running total of the PercentOfTotal column.

Not bad for a query’s work. Now, if I were putting this together in a real solution, I’d probably wrap the running total and grand total calculations in a CTE so I can reuse them more easily:

WITH records AS
(
    SELECT
        t1.Id,
        t1.EventDate,
        t1.NumberOfAttendees,
        SUM(t1.NumberOfAttendees) OVER (
            ORDER BY t1.EventDate
            RANGE BETWEEN UNBOUNDED PRECEDING
                AND CURRENT ROW
        ) AS RunningTotal,
        SUM(t1.NumberOfAttendees) OVER () AS GrandTotal
    FROM #t1 t1
)
SELECT
    t1.Id,
    t1.EventDate,
    t1.NumberOfAttendees,
    t1.RunningTotal,
    t1.GrandTotal,
    1.0 * t1.NumberOfAttendees / t1.GrandTotal AS PercentOfTotal,
    1.0 * t1.RunningTotal / t1.GrandTotal AS CumulativePercent
FROM records t1;

Get the Latest N Values

Suppose I want to find the latest 5 orders for each customer in the Wide World Importers database. Specifically, I want to limit this to customers who do not have ” Toys” in the name. Here’s how I can do this using a window function:

WITH customers AS
(
    SELECT
        c.CustomerID,
        c.CustomerName
    FROM Sales.Customers c
    WHERE
        CustomerName NOT LIKE N'% Toys%'
),
invoices AS
(
    SELECT
        i.CustomerID,
        i.InvoiceID,
        i.InvoiceDate,
        i.TotalChillerItems,
        i.TotalDryItems,
        ROW_NUMBER() OVER (PARTITION BY i.CustomerID ORDER BY i.InvoiceDate DESC) AS rownum
    FROM Sales.InvoicesSmall i
        INNER JOIN customers c
            ON i.CustomerID = c.CustomerID
)
SELECT *
FROM customers c
    LEFT OUTER JOIN invoices i
        ON c.CustomerID = i.CustomerID
        AND rownum <= 5;
Could I get a GDPR notice for fake names? I’m pretty sure Liechtenstein is a fake country.

Before I go any further, though, let me point out that this is not necessarily the most efficient method for getting the N latest records. It can be the most efficient method in some cases, but the APPLY operator can be more efficient in cases where you get a small percentage of a large table. I go into it in a lot more detail in my Teachable course on the topic, and if you use the discount code WINDOW, you can get the course for $20 USD through the end of the year.

Stealth advertisement over.

Distinct Customer Counts over Time

Have you ever needed to calculate the distinct number of customers on your site over time? No? Um…have you wanted to? I’ll take that awkward silence as a yes!

WITH records AS
(
	SELECT
		OrderDate,
		CASE
			WHEN ROW_NUMBER() OVER (PARTITION BY CustomerID ORDER BY OrderDate) = 1 THEN CustomerID
		END AS DistinctCustomerID
	FROM Sales.Orders o
)
SELECT DISTINCT
	OrderDate,
	COUNT(DistinctCustomerID) OVER (ORDER BY OrderDate) AS NumCustomers
FROM records r
ORDER BY
	OrderDate;
Wide World Importers only allowed themselves to add 43 customers per day at the beginning. You don’t want to go all-out early on and run out of customers, after all.

We’re solving this problem with two separate window functions. The first window function determines the first order for a given customer. When that order is the first—that is, when the row number is 1—then we return the customer ID. Otherwise, there is an implicit NULL. This means that we’ll get back one row for each order in Sales.Orders, but we’ll only have a single value per customer which is not NULL.

Then, we take that stream of mostly-NULLs and get the count of distinct customer IDs. Remember that COUNT() ignores NULL values if you specify a column in the aggregate function. Then, we make COUNT() into a window function ordered by OrderDate with an implicit frame of RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW, giving us the count of non-NULL values in the dataset up to the current date. Slap a DISTINCT on here to collapse down the values and we have our answer.

Build an Events System and Track Maximum Concurrent Operations

This one’s a doozy and I’ve written about it previously. By the way, that problem and solution looks a bit complex, but I’ve actually used it in real life to solve actual work-related problems. Specifically, it was around seeing if any customer had multiple concurrent upload operations going, as there was a bug in one product team’s code which could fire in that scenario, and so they wanted to see how many customers might have been affected. Fortunately, the answer was zero.

If you want a version for Wide World Importers, let’s suppose that we want to see the maximum number of concurrent orders that a customer has had in transit. Suppose that the OrderDate represents when the customer places an order and ExpectedDeliveryDate is the date the customer receives the order. We can unpivot orders data and figure out the maximum number of concurrent orders per customer using the following query:

WITH StartStopPoints AS
(
	SELECT
		o.CustomerID,
		o.OrderDate AS TimeUTC,
		1 AS IsStartingPoint,
		ROW_NUMBER() OVER (PARTITION BY o.CustomerID ORDER BY o.OrderDate) AS StartOrdinal
	FROM Sales.Orders o

	UNION ALL

	SELECT
		o.CustomerID,
		o.ExpectedDeliveryDate AS TimeUTC,
		0 AS IsStartingPoint,
		NULL AS StartOrdinal
	FROM Sales.Orders o
),
StartStopOrder AS
(
	SELECT
		s.CustomerID,
		s.TimeUTC,
		s.IsStartingPoint,
		s.StartOrdinal,
		ROW_NUMBER() OVER (PARTITION BY s.CustomerID ORDER BY s.TimeUTC, s.IsStartingPoint) AS StartOrEndOrdinal
	FROM StartStopPoints s
)
SELECT
	s.CustomerID,
	MAX(2 * s.StartOrdinal - s.StartOrEndOrdinal) AS MaxConcurrentOrders
FROM StartStopOrder s
WHERE
	s.IsStartingPoint = 1
GROUP BY
	s.CustomerID
ORDER BY
	MaxConcurrentOrders DESC,
	CustomerID ASC;

Packages in transit.

Conclusion

Over the course of this post, we’ve looked at seven business use cases for window functions and we could easily come up with more. Window functions provide a huge amount of versatility when it comes to solving business problems. Furthermore, as we saw in the last two examples, clever combinations of window functions and reshaping of data can solve challenging business problems quite effectively.

2 thoughts on “A Slice of Time: Good Use Cases

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s