In my last post, I talked about deleting partially-migrated data from an instance. My ideal set-based solution to this problem would be a simple query like:
BEGIN TRANSACTION; DELETE FROM dbo.MyTable WHERE MyCriteria = 1; COMMIT TRANSACTION;
The upside to this is that it’s a trivial query to write, totally understandable, and a set-based solution. The downside, however, is that if you’re deleting a lot of rows, your SQL Server transaction log can blow up in size to the point where you might run out of disk space. This process can also block other readers or writers if you are not in Read Committed Snapshot Isolation mode (or if your readers are not using, ugh, Read Uncommitted). On a busy production system, you might not have the ability to block for as long as it takes to delete a lot of rows.
One pattern that I’ve taken to is batching deletions:
DECLARE @RowCount INT = 0, @n INT = 1000; WHILE (1=1) BEGIN BEGIN TRANSACTION; DELETE TOP(@n) x FROM dbo.MyTable x WHERE x.MyCriteria = 1; SET @RowCount = @@ROWCOUNT; COMMIT TRANSACTION; IF (@RowCount < @n) BEGIN BREAK; END END
This way, we delete 1000 records at a time (or whatever you set @n to be). That will reduce blocking, allowing readers and writers time to sneak in between batches. If you have frequent transaction log backups, you can also prevent the log from growing too much, as each deletion is its own operation, rather than everything being in one big transaction. I’ve found that performance will differ based on a few factors, but generally, a number between 100 and 10,000 is probably best. I usually do this by order of magnitude, so I’ll test with 10K, then 1K, and then 100, and see which performs best. Sometimes I’ll try 5K or 500 to see if there’s a significant difference, but I normally don’t go too much further.
The biggest downside to this is that if you are not deleting based on a nice index, you may need to scan your table once for each batch. If I need to delete 1 million rows from a billion-row table, SQL Server needs to scan that billion-row table 1000 times (when @n = 1000 rows per batch), and that’s costly. To compensate, you should try to have an appropriate index; at least then you’re seeking against the index.