A long time ago, I talked about alerts and monitors.  A bit later, I added thoughts on auditing as well.  Today, I’m going to cover a technique that I’ve used to alert me to real problems without too much noise.  That technique is called the leaky bucket algorithm.

Step One:  Get A Bucket

Let’s say that we have a process we want to monitor, like an API.  Further suppose we have a table in SQL Server which collects every 500 class response that our API sends out.  As a quick refresher, 200 class responses (e.g., HTTP 200) indicate success.  300 class responses (e.g., 301, 302) indicate redirection.  400 class responses (e.g., 403, 404) typically indicate that the user has failed us somehow.  500 class responses (e.g., 500) typically indicate that the server has failed us somehow.  We might find it interesting to track 404 errors because that might help us find out that, for example, we don’t have a favicon.ico file, but there’s a lot of noise in those types of errors (especially if your API is publicly accessible), so we’ll ignore those and focus on the 500 errors.

So let’s create that table:

CREATE TABLE [dbo].[ApiCall]
(
	ApiCallID BIGINT IDENTITY(1,1) NOT NULL,
	EndpointName NVARCHAR(128) NULL,
	RequestVerb NVARCHAR(10) NOT NULL,
	ResponseCode SMALLINT NOT NULL,
	ResponseTimeMS INT NOT NULL,
	LogDateGMT DATETIME2(3) NOT NULL CONSTRAINT [DF_ApiCall_LogDateGMT] DEFAULT (SYSUTCDATETIME()),
	CONSTRAINT [PK_ApiCall] PRIMARY KEY CLUSTERED(ApiCallID)
);
GO

This is a very simple table, and a real alerting table will have additional information, but let’s not complicate things unnecessarily here.

Step Two:  Put A Hole In It

Now that we have a table, we want to do something with it.  The most naive solution would be to fire off an alert every time a row gets added to this table.  The problem with this solution is that it can flood our inbox with useless information.  Suppose the developers push out an API change that breaks everything.  The first 500 response will be important information.  The second one might be important because it’s confirmatory.  The six hundredth alert will not help.  And heaven help you if you’ve got this alert tied to your PagerDuty account…

So let’s only alert if we get to a certain threshold—we’ll call it 5 messages.  Fewer than 5 and we can assume it’s a blip or just somebody doing something weird.  The next part of the problem is, 5 messages per what?  We don’t want to alert for every fifth error message.  Let’s say we get one error message a year, so there was one in 2014, one in 2015, one in 2016, one in 2017, and now one in 2018.  If we simply set our threshold to report on every fifth error, we’ll get an alert in 2018, but most likely, the prior years’ errors are long gone by now, so that’s not helpful.

Instead, we want to set a timeframe for those alerts.  In my case, I’ll say that I want to look for 5 errors anytime during the past 30 minutes.  The particular number and timeframe are variable and I pulled those numbers out of a hat, but they’ll be useful for us.  But what makes it nice is that errors “drain” from the bucket automatically.  Errors lose their importance over time in our alerting system, and 30 minutes of calm completely drains the bucket.

Step Three:  Thinking About Overflow

But now we have a new frequency of alert issue.  Let’s suppose that I alert anytime I see 5 errors in the past 30 minutes.  Let’s further suppose that my process checks every 5 minutes and sends out an e-mail.  The problem is that I might get upwards of 6 alert e-mails for the same five errors—that’s definitely not what I want!

As an example of this, suppose that we had 5 errors happen in a burst at 5:30 PM, right before my alert check ran.  I will subsequently get alerts at:

  1. 5:30 PM
  2. 5:35 PM
  3. 5:40 PM
  4. 5:45 PM
  5. 5:50 PM
  6. 5:55 PM

They will finally wash out at my 6 PM alert cycle, but I really didn’t need 5 of those alert e-mails.  One viable solution is to mark those particular errors as alerted and therefore no longer interesting to us.  That solution can work for us with this table design (though I’d want to add a new bitflag indicating whether we’ve sent an alert for this error), but in real life, I’m storing a form of this data in a table with a clustered columnstore index and I don’t want to update rows in that table.  So I’m going to use a different approach:  only alerting when we overflow.  And now we’re getting to the metaphor.

Think about a bucket that can hold five gallons of water.  If  I try to pour in six gallons, I can get the first five, but the remaining gallon overflows and causes a mess everywhere.  This also gives us a cap on how much water we care about:  if the bucket is empty and I immediately dump 10 gallons into it, 5 of those gallons will overflow immediately, leaving me with a full five-gallon bucket.  We tell people about the spill (those five gallons not in the bucket) but don’t care any more about that spilled water.  Now suppose that I take that five-gallon bucket and drill a hole in it.  Let’s say that it has a drain rate of one gallon every six minutes.  If I average no more than one gallon every six minutes, I can pour into perpetuity without overflow…assuming that the bucket wasn’t full when I began the process.

Step Four:  Handling History

That’s where we need to think about the next problem.  In order to implement a leaky bucket, I need to know the entire history of the bucket, or at least a long enough period of time that I can feel confident that I’m not going to miss important alerts.  If I’m only looking at the last 30 minutes, I’m missing the fact that the bucket might not have been empty when my 30-minute window started, so I have to look back at the prior 30-minute window.  But to do that, I need to look back at the prior window, and so on, until I get back to the first usage of that bucket.

You might be wondering why that’s important, and here goes:  let’s say that we have one error every 8 minutes.  Based on the error rate alone and alerting at 5 errors in a 30-minute window, we will never fire an alert even though we have a consistent error.  That means we’re potentially ignoring useful information and this potential false negative might come back to bite us later.

By contrast, with our leaky bucket, we will eventually overflow, though it will take a while.  We have 7.5 errors per hour and 6 drained errors per hour, so we have a little bit of an algebra problem:  5 = 7.5n – 6n –> 5 = 1.5n –> n = 3.33 hours.  After 3.33 hours, our bucket will fill and we’ll get an alert.

But looking at the entire history of a table is not really practical for us when that table gets large, so instead I’ll compromise:  I’ll look at 5 cycles.  The idea here is that no matter what happened, the entire bucket will have been flushed five times during this process and older errors are very unlikely to have an effect.  We still get the benefit of looking at longer timeframes and capturing slowly-building errors, but don’t need to accept the full cost of reading everything.

Step Five:  The Final Product

After all of this talk, I finally get to the code.  We’ll take the table defined above, add a few rows, and test a couple of scenarios.

So let’s add some rows:

INSERT INTO dbo.ApiCall
(
	EndpointName,
	RequestVerb,
	ResponseCode,
	ResponseTimeMS,
	LogDateGMT
)
VALUES
--Initial burst to trigger an alert; note that there are more than enough errors here.
('Endpoint', 'POST', 500, 15, '2018-01-08 09:03:10.274'),
('Endpoint', 'POST', 500, 15, '2018-01-08 09:03:10.274'),
('Endpoint', 'POST', 500, 15, '2018-01-08 09:03:10.274'),
('Endpoint', 'POST', 500, 15, '2018-01-08 09:03:10.274'),
('Endpoint', 'POST', 500, 15, '2018-01-08 09:03:10.274'),
('Endpoint', 'POST', 500, 15, '2018-01-08 09:03:10.274'),
('Endpoint', 'POST', 500, 15, '2018-01-08 09:03:10.274'),
--Occasional record, but should stay under the threshold
('Endpoint', 'POST', 500, 15, '2018-01-08 09:14:10.274'),
('Endpoint', 'POST', 500, 15, '2018-01-08 09:24:10.274'),
('Endpoint', 'POST', 500, 15, '2018-01-08 09:34:10.274'),
('Endpoint', 'POST', 500, 15, '2018-01-08 09:44:10.274'),
--Another burst to trigger an alert
('Endpoint', 'POST', 500, 15, '2018-01-08 09:48:10.274'),
('Endpoint', 'POST', 500, 15, '2018-01-08 09:48:10.274'),
('Endpoint', 'POST', 500, 15, '2018-01-08 09:48:10.274'),
('Endpoint', 'POST', 500, 15, '2018-01-08 09:48:10.274'),
('Endpoint', 'POST', 500, 15, '2018-01-08 09:48:10.274'),
--Occasional record to stay under the threshold, but at a fast enough pace that we eventually trigger again
('Endpoint', 'POST', 500, 15, '2018-01-08 09:59:10.274'),
('Endpoint', 'POST', 500, 15, '2018-01-08 10:09:10.274'),
('Endpoint', 'POST', 500, 15, '2018-01-08 10:19:10.274'),
('Endpoint', 'POST', 500, 15, '2018-01-08 10:24:10.274'),
('Endpoint', 'POST', 500, 15, '2018-01-08 10:29:10.274'),
('Endpoint', 'POST', 500, 15, '2018-01-08 10:34:10.274'),
('Endpoint', 'POST', 500, 15, '2018-01-08 10:39:10.274');

I have some tests I’d like to cover, but first let’s walk through the leaky bucket procedure itself.

CREATE OR ALTER PROCEDURE dbo.LeakyBucketAlert
(
@HistoricalFrameCycles INT = 5,
@FrameSizeSeconds INT = 60,
@DrainEventsPerFrame DECIMAL(9,2) = 0.1,
@MaxNumErrors INT = 5,
@ObservationFrameSeconds INT = 1800,
@StartDateGMT DATETIME2(3) = NULL
)
AS
SET NOCOUNT ON;

CREATE TABLE #Events
(
	FrameBeginTime DATETIME2(0),
	FrameEndTime DATETIME2(0),
	AlertIfFull TINYINT,
	NetBucketChange DECIMAL(9,2),
	FillLevelAtEndOfFrame DECIMAL(9,2)
);

DECLARE
	@FrameBeginTime DATETIME2(0),
	@StartingFillLevel DECIMAL(9,2),
	@NetBucketChange DECIMAL(9,2),
	@PriorFrameBeginTime DATETIME2(0);

SET @StartDateGMT = ISNULL(@StartDateGMT, GETUTCDATE());

DECLARE
	@CurrentWindowStartTime DATETIME2(0) = DATEADD(SECOND, -1 * @ObservationFrameSeconds, @StartDateGMT),
	--Start at ObservationFrameSeconds and then go back the amount of time needed for the bucket to cycle through completely.
	@PriorCycleBeginTime DATETIME2(0) = DATEADD(SECOND, -@HistoricalFrameCycles * (1.0 * @MaxNumErrors / @DrainEventsPerFrame * @FrameSizeSeconds) - @ObservationFrameSeconds, @StartDateGMT);

--The Drain events happen every period
INSERT INTO #Events
(
	FrameBeginTime,
	FrameEndTime,
	AlertIfFull,
	NetBucketChange
)
SELECT
	frame.FrameBeginTime,
	frame.FrameEndTime,
	CASE
		WHEN frame.FrameBeginTime >= @CurrentWindowStartTime THEN 1
		ELSE 0
	END AS AlertIfFull,
	-1 * @DrainEventsPerFrame AS NetBucketChange
FROM dbo.Tally t
	CROSS APPLY
	(
		SELECT
			DATEADD(SECOND, t.N * @FrameSizeSeconds, @PriorCycleBeginTime) AS FrameBeginTime,
			DATEADD(SECOND, (t.N + 1) * @FrameSizeSeconds, @PriorCycleBeginTime) AS FrameEndTime
	) frame
WHERE
	t.N < 1.0 * DATEDIFF(SECOND, @PriorCycleBeginTime, @StartDateGMT) / @FrameSizeSeconds; --The Fill events might happen during a period; if they do, collect them.

WITH FillEvents AS (
	SELECT
		e.FrameBeginTime,
		COUNT(1) AS FillEvents
	FROM dbo.ApiCall ac
		INNER JOIN #Events e
			ON ac.LogDateGMT >= e.FrameBeginTime
			AND ac.LogDateGMT < e.FrameEndTime
	WHERE
		ac.ResponseCode = 500
		AND ac.LogDateGMT >= @PriorCycleBeginTime
	GROUP BY
		e.FrameBeginTime
)
UPDATE e
SET
	NetBucketChange = e.NetBucketChange + fe.FillEvents
FROM #Events e
	INNER JOIN FillEvents fe
		ON e.FrameBeginTime = fe.FrameBeginTime;

--Marking the bucket fill line
--Basic rules:
--	1)  Cannot go below 0
--	2)  Cannot go above @MaxNumErrors
--	3)  Addition and subtraction happen "simultaneously" so we apply the net change as a unit
--Because of rules 1 and 2, we cannot simply use a rolling window summation to determine fill level.
DECLARE
	BucketCursor CURSOR LOCAL FAST_FORWARD FOR
	SELECT
		e.FrameBeginTime,
		e.NetBucketChange,
		LAG(e.FrameBeginTime) OVER (ORDER BY e.FrameBeginTime) AS PriorFrameBeginTime
	FROM #Events e
	ORDER BY
		e.FrameBeginTime ASC;

OPEN BucketCursor;
FETCH NEXT FROM BucketCursor INTO @FrameBeginTime, @NetBucketChange, @PriorFrameBeginTime;

WHILE (@@FETCH_STATUS = 0)
BEGIN
	SELECT
		@StartingFillLevel = FillLevelAtEndOfFrame
	FROM #Events e
	WHERE
		e.FrameBeginTime = @PriorFrameBeginTime;

	SET @StartingFillLevel = ISNULL(@StartingFillLevel, 0);

	UPDATE #Events
	SET
		FillLevelAtEndOfFrame = CASE
			WHEN @StartingFillLevel + @NetBucketChange < 0 THEN 0 WHEN @StartingFillLevel + @NetBucketChange > @MaxNumErrors THEN @MaxNumErrors
			ELSE @StartingFillLevel + @NetBucketChange
		END
	WHERE
		FrameBeginTime = @FrameBeginTime;

	FETCH NEXT FROM BucketCursor INTO @FrameBeginTime, @NetBucketChange, @PriorFrameBeginTime;
END
CLOSE BucketCursor;
DEALLOCATE BucketCursor;

WITH records AS
(
	SELECT
		e.FrameBeginTime,
		e.FrameEndTime,
		e.AlertIfFull,
		LAG(e.FillLevelAtEndOfFrame) OVER (ORDER BY e.FrameBeginTime) AS FillLevelAtStartOfFrame,
		e.NetBucketChange,
		e.FillLevelAtEndOfFrame
	FROM #Events e
	WHERE
		EXISTS
		(
			SELECT 1
			FROM #Events ee
			WHERE
				ee.AlertIfFull = 1
				AND ee.FillLevelAtEndOfFrame = @MaxNumErrors
		)
)
SELECT
	r.FrameBeginTime,
	r.FrameEndTime,
	r.FillLevelAtStartOfFrame,
	-1 * @DrainEventsPerFrame AS DrainEventsPerFrame,
	r.NetBucketChange + @DrainEventsPerFrame AS AddEventsPerFrame,
	r.NetBucketChange,
	r.FillLevelAtEndOfFrame
FROM records r
WHERE
	r.AlertIfFull = 1;
GO

Fun fact: I spent an entire day trying to remove that cursor. Don’t tell my boss.

Let’s explain what’s happening here.  At the top, we declare two variables:  @CurrentWindowStartTime and @PriorCycleBeginTime.  @CurrentWindowStartTime is @ObservationFrameSeconds ago from our starting point, and @PriorCycleBeginTime is @HistoricalFrameCycles time periods back behind that.  To make this concrete, our current window’s start time is 30 minutes prior to the start date, so the prior cycle begin time is 180 minutes prior to the start date—that is, there are 5 30-minute windows which sit between the prior cycle’s begin time and the current window’s start time.

From there, I use a tally table that I’ve built separately to build a constant drain rate.  If you don’t have a tally table, here’s one for you:

SELECT TOP(10001)
	IDENTITY(INT, 0, 1) AS N
INTO dbo.Tally
FROM sys.objects s1
	CROSS JOIN sys.objects s2;
ALTER TABLE dbo.Tally ADD CONSTRAINT [PK_Tally] PRIMARY KEY CLUSTERED (N);

The Events temp table lets me see how many events per minute I have, as well as the drain rate.  I’m using a constant drain rate here, though you could modify this to have a proportionate drain rate (where it drains faster with more errors) if you wanted.  I also have a flag, AlertIfFull.  That flag makes sure that I don’t accidentally alert on historical overflows—my assumption is that I already caught those, so I don’t need to keep sending messages out.

Next up, there’s the cursor.  I feel obligated to explain why I need a cursor here, especially when I clearly know enough about  window functions to use them.  The problem is that we have not only a window, but also a floor and ceiling (no extended metaphor or pun intended).  I cannot go below 0 errors and I cannot go above@MaxNumErrors errors.  In other words, I cannot build up error immunity by having an extended period of time with zero errors and thereby go negative.  I also do not want to see the error count pushed up so high that it never drops back down—that would be like trying to pour 10 gallons into a 5-gallon bucket and counting those 10 gallons as there.  They’re not really there—we had an overflow and now we’re moving on with a full 5-gallon bucket.  Anyhow, the nature of that rule means that I couldn’t simply use a window function to solve this problem.

Instead, for each minute, I figure out what the starting point is, add or drain based on whether NetBucketChange is positive or negative (or do nothing if it’s 0), and deal with floors or ceilings.  The end result of that is FillLevelAtEndOfFrame, which is the thing I really care about.

Finally, after the cursor ends, we look at the Events temp table and return results if we experienced an overflow event.

Step Six:  Proving That It Works

Here are my tests given the leaky bucket.  I only alert when I get rows back, so if the result set is empty, we don’t send an alert.

--Expected:  no results.
EXEC dbo.LeakyBucketAlert
	@StartDateGMT = '2018-01-08 09:02:00';

--Expected:  results.
EXEC dbo.LeakyBucketAlert
	@StartDateGMT = '2018-01-08 09:04:00';

--Expected:  results with some drain.
EXEC dbo.LeakyBucketAlert
	@StartDateGMT = '2018-01-08 09:07:00';

--Expected:  results with continued drain and occasional bumps.
EXEC dbo.LeakyBucketAlert
	@StartDateGMT = '2018-01-08 09:25:00';

--Expected:  no results.
EXEC dbo.LeakyBucketAlert
	@StartDateGMT = '2018-01-08 09:45:00';

--Expected:  new results
EXEC dbo.LeakyBucketAlert
	@StartDateGMT = '2018-01-08 09:50:00';

--Expected:  no results
EXEC dbo.LeakyBucketAlert
	@StartDateGMT = '2018-01-08 10:20:00';

--Expected:  results
EXEC dbo.LeakyBucketAlert
	@StartDateGMT = '2018-01-08 10:34:00';

--Expected:  results.  We aren't letting the bucket drain and keep getting errors, so we keep alerting.
EXEC dbo.LeakyBucketAlert
	@StartDateGMT = '2018-01-08 10:39:00';

Step Seven: Another Approach

One area of potential improvement could be tracking whether a set of errors have already been part of an alert and ignoring them from then on.  That would be equivalent to, when the bucket overflows, pouring out the entire bucket and starting over.  There are arguments for and against this approach.  I ended up going with my approach because I want to be prodded if my API keeps sending me waves of errors, but I could see the argument in favor of emptying the bucket completely.  In that case, you’d just need to update #Events to set WHEN @StartingFillLevel + @NetBucketChange > @MaxNumErrors THEN 0. Then you’d have a leaky bucket with full flush after overflow.