Not too long ago, I helped a co-worker solve a problem. The quick version of the problem is, we want to collate together common occurrences of an event. In this case, it’s log records by IP address. If the same IP address shows up 3 times in a row, we want to know that there were 3 events, followed by the next IP, and so on. If that first IP shows up later on, it’s a new chain.
This is another way of formulating the island problem. Itzik Ben-Gan covers it wonderfully in his chapter on gaps and islands in SQL Server MVP Deep Dives, which chapter Manning has made available for free.
In my case, here’s a sample and the solution:
CREATE TABLE #Temp ( LogID INT, ErrorDate DATETIME, IPAddress VARCHAR(50) ); INSERT into dbo.#Temp ( LogID, ErrorDate, IPAddress ) VALUES (173371, '2014-04-07 09:47:35', 'IP 1'), (173372, '2014-04-07 09:47:36', 'IP 1'), (173373, '2014-04-07 09:47:37', 'IP 1'), (173374, '2014-04-07 10:01:35', 'IP 2'), (173375, '2014-04-07 10:02:35', 'IP 3'), (173376, '2014-04-07 10:03:04', 'IP 3'), (173377, '2014-04-07 10:03:05', 'IP 3'), (173378, '2014-04-07 10:03:18', 'IP 3'), (173379, '2014-04-07 10:18:26', 'IP 1'), (173380, '2014-04-07 10:18:26', 'IP 1'), (173381, '2014-04-07 10:18:26', 'IP 1'), (173382, '2014-04-07 10:18:27', 'IP 3'); WITH records AS ( SELECT LogID, ErrorDate, IPAddress, ROW_NUMBER() OVER (ORDER BY LogID) - ROW_NUMBER() OVER (ORDER BY IPAddress, LogID) AS grp FROM #Temp ) SELECT MIN(LogID) AS MinimumID, MAX(LogID) AS MaximumID, COUNT(*) AS NumberOfRows, IPAddress FROM records GROUP BY IPAddress, grp ORDER BY MinimumID;
The trick here is that we use two separate ROW_NUMBER calls to generate a difference in values. The first part of the line (ROW_NUMBER() OVER (ORDER BY LogID)) generates a unique row number for each record, ordered by LogID. The second ROW_NUMBER() operation generates a unique row number for each line, but it’s ordered by IP Address first and LogID second. This means that the six records with an IP address of 1 will be clustered together and numbered 1-6. We can take the difference between these two numbers to get a unique cluster, and that (which we call grp) lets us separate each IP address cluster out, solving the island problem with a single read of the data, much more efficiently than any row-by-row or self-joining solution could ever come up with.