Not too long ago, I helped a co-worker solve a problem.  The quick version of the problem is, we want to collate together common occurrences of an event.  In this case, it’s log records by IP address.  If the same IP address shows up 3 times in a row, we want to know that there were 3 events, followed by the next IP, and so on.  If that first IP shows up later on, it’s a new chain.

This is another way of formulating the island problem.  Itzik Ben-Gan covers it wonderfully in his chapter on gaps and islands in SQL Server MVP Deep Dives, which chapter Manning has made available for free.

In my case, here’s a sample and the solution:

CREATE TABLE #Temp
(
	LogID INT,
	ErrorDate DATETIME,
	IPAddress VARCHAR(50)
);

INSERT into dbo.#Temp
(
	LogID,
	ErrorDate,
	IPAddress
)
VALUES
	(173371, '2014-04-07 09:47:35', 'IP 1'),
	(173372, '2014-04-07 09:47:36', 'IP 1'),
	(173373, '2014-04-07 09:47:37', 'IP 1'),
	(173374, '2014-04-07 10:01:35', 'IP 2'),
	(173375, '2014-04-07 10:02:35', 'IP 3'),
	(173376, '2014-04-07 10:03:04', 'IP 3'),
	(173377, '2014-04-07 10:03:05', 'IP 3'),
	(173378, '2014-04-07 10:03:18', 'IP 3'),
	(173379, '2014-04-07 10:18:26', 'IP 1'),
	(173380, '2014-04-07 10:18:26', 'IP 1'),
	(173381, '2014-04-07 10:18:26', 'IP 1'),
	(173382, '2014-04-07 10:18:27', 'IP 3');

WITH records AS
(
	SELECT
		LogID,
		ErrorDate,
		IPAddress,
		ROW_NUMBER() OVER (ORDER BY LogID) - ROW_NUMBER() OVER (ORDER BY IPAddress, LogID) AS grp
	FROM
		#Temp
)
SELECT
	MIN(LogID) AS MinimumID,
	MAX(LogID) AS MaximumID,
	COUNT(*) AS NumberOfRows,
	IPAddress
FROM
	records
GROUP BY
	IPAddress,
	grp
ORDER BY
	MinimumID;

The trick here is that we use two separate ROW_NUMBER calls to generate a difference in values.  The first part of the line (ROW_NUMBER() OVER (ORDER BY LogID)) generates a unique row number for each record, ordered by LogID.  The second ROW_NUMBER() operation generates a unique row number for each line, but it’s ordered by IP Address first and LogID second.  This means that the six records with an IP address of 1 will be clustered together and numbered 1-6.  We can take the difference between these two numbers to get a unique cluster, and that (which we call grp) lets us separate each IP address cluster out, solving the island problem with a single read of the data, much more efficiently than any row-by-row or self-joining solution could ever come up with.

Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s