A Slice of Time: Aggregate Functions

This is part two in a series on window functions in SQL Server.

Last Time on 36 Chambers

In the prior post, we looked at the anatomy of a window function and said that it looks kind of like this:

SELECT
	Col1,
	Col2,
	WINDOW_FUNCTION() OVER(
			PARTITION BY Col1
			ORDER BY Col2
			ROWS BETWEEN UNBOUNDED PRECEDING
				AND CURRENT ROW
		) AS wf
FROM dbo.Table;

Now we’re going to look at a specific class of window functions: aggregate functions.

I Know These!

Even if you aren’t at all familiar with window functions, you’ve seen aggregate functions before. Here’s an example using the WideWorldImporters database.

SELECT
    i.CustomerID,
    SUM(il.LineProfit) AS TotalProfit
FROM Sales.InvoiceLinesSmall il
    INNER JOIN Sales.InvoicesSmall i
        ON il.InvoiceID = i.InvoiceID
GROUP BY
    i.CustomerID;

Here, we get the sum of LineProfit by CustomerID. Because SUM() is an aggregate function, we need a GROUP BY clause for all non-aggregated columns. This is an aggregate function. The full set of them in T-SQL is available here, but you’ll probably be most familiar with MIN(), MAX(), SUM(), AVG(), and COUNT().

To turn this into a window function, we slap on an OVER() and boom! Note: “boom!” only works on SQL Server 2012 and later, so if you’re still on 2008 R2, it’s more of a fizzle than a boom.

SELECT
    i.InvoiceID,
    i.CustomerID,
    i.InvoiceDate,
    SUM(il.LineProfit) OVER () AS TotalProfit
FROM Sales.InvoiceLinesSmall il
    INNER JOIN Sales.InvoicesSmall i
        ON il.InvoiceID = i.InvoiceID;

So, uh, what’s going on here? Well, the answer is that the phrase OVER() translates to “over all rows, considering everything as one window, starting from the beginning of the window and progressing through the end of the window.” In other words, $85,729,180.90 is the sum of line profit over all of the invoice lines in the entire database.

Also note that there is no GROUP BY. This is because window aggregates behave a bit differently than normal aggreagtes. Here’s the query plan from the first operation:

We can see a pair of Hash Match aggregates. The first one gets us from 228,265 rows down to 70,510 (number of invoices), and the second gets us down to 663 (number of customers). Meanwhile, here’s the window function:

We’re still performing an aggregation, but that aggregate becomes an additional column rather than grouping together rows. When we turn an aggregate function into a window function, the number of rows remains the same as if we hadn’t included the window function at all.

Partitioning

Next up, let’s partition by customer ID.

SELECT
    i.InvoiceID,
    i.CustomerID,
    i.InvoiceDate,
    SUM(il.LineProfit) OVER (PARTITION BY i.CustomerID) AS TotalProfit
FROM Sales.InvoiceLinesSmall il
    INNER JOIN Sales.InvoicesSmall i
        ON il.InvoiceID = i.InvoiceID
ORDER BY CustomerID;

We’ll still get one row per invoice line, but now the TotalProfit is a total profit per customer. Note that it resets as we move from one customer to the next. That’s because our window is now “over all rows for a given customer ID, starting from the beginning of the window and progressing through the end of the window.” This is already pretty useful because we could slap on line profit as a separate column and show the profit of an individual invoice line versus the total profit across the customer. We could even do it by invoice, too:

SELECT
    i.InvoiceID,
    i.CustomerID,
    i.InvoiceDate,
    il.LineProfit,
    SUM(il.LineProfit) OVER (PARTITION BY i.CustomerID, i.InvoiceID) As InvoiceProfit,
    SUM(il.LineProfit) OVER (PARTITION BY i.CustomerID) AS TotalProfit
FROM Sales.InvoiceLinesSmall il
    INNER JOIN Sales.InvoicesSmall i
        ON il.InvoiceID = i.InvoiceID
ORDER BY CustomerID;

Running Totals

This gets us to our first interesting use case with window functions: the running total. Given a partition and order combination, we can calculate the running total of profitability. Let’s take a look:

SELECT
    i.InvoiceID,
    i.CustomerID,
    i.InvoiceDate,
    il.InvoiceLineID,
    il.LineProfit,
    SUM(il.LineProfit) OVER (
        PARTITION BY i.CustomerID
        ORDER BY i.InvoiceDate) AS RunningTotalProfit
FROM Sales.InvoiceLinesSmall il
    INNER JOIN Sales.InvoicesSmall i
        ON il.InvoiceID = i.InvoiceID
ORDER BY
    i.CustomerID,
    i.InvoiceDate;

This is technically correct, but check out the running totals: we start with $260 but our running total is $316. That’s because our ORDER BY clause is based on invoice date, and in the event of ties, we treat all ties equally. The reason for this is that the default window for an aggregate function with an ORDER BY clause is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. What this means is “start from the beginning of the window and continue until you get to this row or any row whose ordering causes a tie with this row.” RANGE is the reason for the bolded clause, and if we don’t want to have that bolded clause, we need to change the clause to ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.

SELECT
    i.InvoiceID,
    i.CustomerID,
    i.InvoiceDate,
    il.InvoiceLineID,
    il.LineProfit,
    SUM(il.LineProfit) OVER (
        PARTITION BY i.CustomerID
        ORDER BY i.InvoiceDate
        ROWS BETWEEN UNBOUNDED PRECEDING
            AND CURRENT ROW) AS RunningTotalProfit
FROM Sales.InvoiceLinesSmall il
    INNER JOIN Sales.InvoicesSmall i
        ON il.InvoiceID = i.InvoiceID
ORDER BY
    i.CustomerID,
    i.InvoiceDate;

Now our running total looks more like what we’d expect to see.

A Medley of Aggregates

Finally, I want to point out that you can have multiple aggregate functions over multiple windows in the same query. I showed one example with several SUM() operations, but you can certainly have more than that:

SELECT
    i.InvoiceID,
    i.CustomerID,
    i.InvoiceDate,
    il.InvoiceLineID,
    il.LineProfit,
    SUM(il.LineProfit) OVER (
        PARTITION BY i.CustomerID
        ORDER BY i.InvoiceDate
        ROWS BETWEEN UNBOUNDED PRECEDING
            AND CURRENT ROW) AS RunningTotalProfit,
    SUM(il.LineProfit) OVER () AS TotalProfit,
    MIN(il.Lineprofit) OVER(
        PARTITION BY i.CustomerID
        ORDER BY i.InvoiceDate
        ROWS BETWEEN UNBOUNDED PRECEDING
            AND CURRENT ROW
    ) AS MinLineProfit,
    MAX(il.Lineprofit) OVER(
        PARTITION BY i.CustomerID
        ORDER BY i.InvoiceDate
        ROWS BETWEEN UNBOUNDED PRECEDING
            AND CURRENT ROW
    ) AS MaxLineProfit,
    AVG(il.Lineprofit) OVER(
        PARTITION BY i.CustomerID
        ORDER BY i.InvoiceDate
        ROWS BETWEEN UNBOUNDED PRECEDING
            AND CURRENT ROW
    ) AS AvgLineProfit,
    STDEV(il.Lineprofit) OVER(
        PARTITION BY i.CustomerID
        ORDER BY i.InvoiceDate
        ROWS BETWEEN UNBOUNDED PRECEDING
            AND CURRENT ROW
    ) AS StdevLineProfit,
    COUNT(il.Lineprofit) OVER(
        PARTITION BY i.CustomerID
        ORDER BY i.InvoiceDate
        ROWS BETWEEN UNBOUNDED PRECEDING
            AND CURRENT ROW
    ) AS CountLineProfit
FROM Sales.InvoiceLinesSmall il
    INNER JOIN Sales.InvoicesSmall i
        ON il.InvoiceID = i.InvoiceID
ORDER BY
    i.CustomerID,
    i.InvoiceDate;

Given how complex this is, you might think that the execution plan will be horrible. Well, it’s…kind of not, maybe?

In spite of us having seven aggregate functions, we only have two window aggregate operations. The reason for this is that most of our window aggregates use the same frame: PARTITION BY i.CustomerID ORDER BY i.InvoiceDate ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. When you have multiple aggregate functions which use the same frame, SQL Server’s query optimizer is smart enough to realize that it only needs to do the work once and can apply the frame results to each function. There is one function which uses a different frame: SUM(il.LineProfit) OVER(). Because the OVER() clause is different, this function requires a different frame than its peers, and so we need a different window aggregate operator. But on the whole, this is quite optimal compared to alternatives such as creating a lazy spool or performing multiple scans of the tables. We have one scan of the InvoiceLinesSmall table, one scan of an index on InvoicesSmall, and some compute work tracking aggregate results as we stream the rows. Not bad for a quarter-second’s work.

Conclusion

Aggregate window functions were not the first type of window function introduced in SQL Server—that honor goes to ranking functions, which we’ll discuss next time. But they were one of the best things to come out of a great release in SQL Server 2012. We have already seen one excellent use of them in calculating running totals, and we’ll see a few more as this series progresses.