Deleting Data In Batches

In my last post, I talked about deleting partially-migrated data from an instance. My ideal set-based solution to this problem would be a simple query like:

BEGIN TRANSACTION;

DELETE
FROM dbo.MyTable
WHERE
	MyCriteria = 1;
	
COMMIT TRANSACTION;

The upside to this is that it’s a trivial query to write, totally understandable, and a set-based solution. The downside, however, is that if you’re deleting a lot of rows, your SQL Server transaction log can blow up in size to the point where you might run out of disk space. This process can also block other readers or writers if you are not in Read Committed Snapshot Isolation mode (or if your readers are not using, ugh, Read Uncommitted). On a busy production system, you might not have the ability to block for as long as it takes to delete a lot of rows.

One pattern that I’ve taken to is batching deletions:

DECLARE
	@RowCount INT = 0,
	@n INT = 1000;

WHILE (1=1)
BEGIN
	BEGIN TRANSACTION;
	
	DELETE TOP(@n) x
	FROM dbo.MyTable x
	WHERE
		x.MyCriteria = 1;
		
	SET @RowCount = @@ROWCOUNT;
	
	COMMIT TRANSACTION;
	
	IF (@RowCount < @n)
	BEGIN
		BREAK;
	END
END

This way, we delete 1000 records at a time (or whatever you set @n to be). That will reduce blocking, allowing readers and writers time to sneak in between batches. If you have frequent transaction log backups, you can also prevent the log from growing too much, as each deletion is its own operation, rather than everything being in one big transaction. I’ve found that performance will differ based on a few factors, but generally, a number between 100 and 10,000 is probably best. I usually do this by order of magnitude, so I’ll test with 10K, then 1K, and then 100, and see which performs best. Sometimes I’ll try 5K or 500 to see if there’s a significant difference, but I normally don’t go too much further.

The biggest downside to this is that if you are not deleting based on a nice index, you may need to scan your table once for each batch. If I need to delete 1 million rows from a billion-row table, SQL Server needs to scan that billion-row table 1000 times (when @n = 1000 rows per batch), and that’s costly. To compensate, you should try to have an appropriate index; at least then you’re seeking against the index.

Advertisements

3 thoughts on “Deleting Data In Batches

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s