Creating proper indexes seems like a craft activity to me: there are rules to follow, and those rules often make sense, but there are so many variations and intricacies to it that it takes years to master and there simply aren’t many formulas. It’s great to read people who really know indexes write great books on performance, but you really need to do, analyze, and test so much in order to understand appropriate indexes. It’s not something that you can swoop in on, drop off a few nuggets of wisdom, and fly away knowing that your work here is done. In fact, I’m beginning to realize that you really need to be a domain expert with knowledge of who runs what where, how, and why, in order to optimize performance. Oy.
Here’s a quick example of just how much work there is when it comes to indexing. Let’s say we have one single table. It has a few different fields on it, including three big char columns (which I added on there to take up space):
create table #test
insert into #test values(1, 1, 'aaa', 'ccc', 'ddd');
insert into #test values(1, 0, 'bbb', 'ccc', 'eee');
insert into #test values(1, 1, 'bbb', 'ccc', 'fff');
GO 333 -- Give us roughly a thousand records.
At this point, let’s pick a query to run. What we’re going to do is a bit weird, but it was similar to a question I answered a couple of weeks back.
case when COUNT(distinct(somebit)) = COUNT(distinct someotherchar) then 1 else 0 end,
case when COUNT(distinct(somebit)) = COUNT(distinct somechar) then 1 else 0 end,
case when COUNT(distinct(somebit)) = SUM(case when somechar = 'aaa' then 1 else 0 end) then 1 else 0 end,
case when COUNT(distinct(somebit)) = COUNT(distinct yetanotherchar) then 1 else 0 end
group by selectfield;
This is kind of a weird query with a lot of aggregation, and it generates a somewhat complicated but still pretty straight-forward query plan:
The query takes 105 ms on my machine and the test table has 5 scans and 1035 logical reads. There are five table scans and 5 sorts which make up almost the entire query cost.
So, how do we optimize this? Here is where trial and error comes into play. This table is a heap—there is no clustered index. I’m kind of doing this on purpose, but if there were a good clustered index candidate, that wouldn’t change this too much.
Anyhow, my first thought was to put an index on (somebit). We’re doing a lot of work with somebit, so it seemed that having that as an index might help.
create index ix_somebit on test(somebit);
Here is the query plan following that change:
Yeah, that looks pretty similar… So clearly, having an index just on somebit isn’t going to cut it. So my next thought was to create one on somebit and include selectfield. That does a little better, but I followed that up by creating an index on selectfield and somebit.
drop index ix_somebit on test;
create index ix_somebit on test(selectfield, somebit);
105 ms, 5 scans and 833 reads of the test table. The time to return isn’t improving, but at least we’re reducing IO costs. Here’s the new plan:
We’ve replaced one of the table scans with an index scan, thus shaving roughly 16% of the I/O off (because there were 5 table scans). So that’s an improvement. Next up, I thought about an index on somechar, as it shows up twice in the query. I would include selectfield to help the join.
create index ix_somechar on test(somechar) include(selectfield);
This drops us down to 68 ms to run, and 5 scans with 567 logical reads. Another improvement. Here’s the plan:
I’m happy that there’s an improvement: instead of taking 24% of the query’s required resources, it’s down to about 15%. But it just seems like there should be a better plan. The thing is that we’re spending a pretty good amount of time with that sort, which is sorting by selectfield. So that got me thinking: maybe if, instead of having the index be on somechar and just including selectfield, we put the index on (selectfield, somechar). That would have our index sorted first by selectfield and should remove the requirement for a sort. To see which index works out better, I’ll leave both on for the moment.
create index ix_somechar2 on test(selectfield, somechar);
With the new index, we’re back up to 98 ms, but still at 5 scans and 567 logical reads. SQL Server prefers the second index, though, because that means it doesn’t need to do a sort:
It’s easy to understand why, too: instead of 15%, we’re down to 10.5%. So even though the CPU time is a little higher, it seems like we’re on the right track. That’s one of the things I’ve noticed with indexing strategies: sometimes, it seems like you’re taking a step back even when you really are moving forward—you might have a plateau and then a drop in the middle before finding further improvements later. Again, it’s a craft, not a science.
Anyhow, we’ll drop the unused index and then index the last two columns.
drop index ix_somechar on test;
create index ix_someotherchar on test(selectfield, someotherchar);
create index ix_yetanotherchar on test(selectfield, yetanotherchar);
What we see at this point is a series of index scans. Because we’re aggregating all of the rows together, it makes sense to use a scan rather than a seek, so that’s OK. And the percentages have gone up, but the total cost has gone down, which is what we want to see:
With the final stats, I have roughly 70 ms of time (15 ms of CPU, which is a pretty good-sized drop from earlier) and 5 scans for 301 logical reads.
But here’s the part where indexing really becomes a craft: we now have four indexes on this table. Inserting another record results in 67 logical reads and 44 ms of time, as each of those indexes needs to be updated to accommodate the new record. Without the indexes, I need just 1 logical read (to get the current page) and 1 ms of time to add the new record. Here’s the trade-off that we normally think of when indexing: each index improves select performance at the cost of update/insert/delete performance.
But the problem with this type of analysis is that it’s so partial: there are a number of other queries which involve this table, and those should be analyzed as well to see if the indexes have any benefit, or even if they positively harm these other queries. In the world of economics, we talk about how partial equilibrium analysis (the standard supply and demand curves which most people think of when they think of economics) can be helpful, but if you really want to know the full effects, you have to use a general equilibrium model (or, as a shout out to the Austrians, an analysis of the ERE) Effects which may seem large in a partial equilibrium may be offset by effects on unanalyzed portions of the world, or may be so small in the grand scheme of things as to be irrelevant. Similarly, indexing around a single query—unless that query dominates the database—runs the risk of ignoring pertinent factors and other queries.
Going back to the example at hand, let’s say that the above query is, in fact, the most common (or at least important) query. But our next-most important query is
This query requires 1 scan, 207 logical reads, and 88 ms. The query plan is a table scan and then a hash match aggregation. But if we were thinking of this query when building our indexes, we could have improved this query as well by adding somebit onto ix_yetanotherchar:
drop index ix_yetanotherchar on test;
create index ix_yetanotherchar on test(selectfield, yetanotherchar, somebit);
Now this query takes 6 ms because it involves 1 scan of 74 reads. The other query doesn’t change by adding this bit, as there aren’t enough rows to demand the creation of a new page as a result of its inclusion.
So it’s important to understand the whole picture before focusing on a decision, as there may be unintended consequences or points of easy improvement overlooked. Just like how a partial equilibrium analysis misses the details of the rest of the world, so does a partial database analysis miss the rest of the story. Unlike academic economics, though, we don’t have the luxury of stripping our databases down to extremely simplified and aggregated models and applying basic calculus to derive first-order conditions which tell us how things, in this overly simplified scenario, should work. There is no general database equlibrium analysis here; all we can do is iterate and hone our skills as craftsmen.