Scores Are Meant To Be Gamed

Eric Moretti has a post criticizing the Klout score.  One of his problems with the score is that there are relatively easy avenues for gaming:  you can do it using surreptitious and somewhat unethical ways (having Twitter sock puppets re-tweet whatever you say broad and wide)—or simply post a lot of vacuous stuff that other people re-tweet or forward on.

Any single-number score which people can affect directly leaves open the potential for gaming, simply because the rules are broad enough.  My favorite example is voting methods:  it’s easy to come up with scenarios in which a relatively small, concerted subset of the population can game an election…or can accidentally end up with their lowest-preference result.

In fairness, though, I do like the idea of coming up with these types of scores.  There is a great deal of subjective knowledge and I would not want to discount that in any way, but being able to get a nice first-approximation with a single number is helpful.  When a score is very easy to game (as this one may be), the knowledge derived is lower than what you might find desirable, but in many circumstances, it’s still better than zero knowledge.  If somebody does have a high score, you still need to spend some time filtering out the chaff, but if someone is claiming to be a boffo social media expert but is in Nowheresville on a reasonably objective score, that should at least raise doubts.  With that said, I would not use it strictly as a filtering mechanism (certainly not as a sole filtering mechanism), but think of the alternative:  the stair technique.

Call It What It Is

I hate the current euphemism going on with taxes:  government “revenue.”  “Revenue” is a nice, happy term that people use to obscure the fact that they’re really talking about taxes.  Why would they do that?  Because people hate taxes.  That’s why it was so funny when President Lisa Simpson talked about a “temporary refund adjustment” (and Bart’s Kinsley gaffe revealed what it really is).  We’re not raising taxes—we’re opening “markets” and increasing “revenues,” just like a business would do!  Well, like a business would do if it got to throw people in jail for not paying whatever it demands…

The fundamental problem going on is not a lack of taxes; it is a surfeit of spending.  If the federal government spent at 2007 levels, there would still be a deficit, but a much smaller one.  If federal spending were taken down to 2000 levels (back when children worked in coal mines and we burned the elderly for fuel), we would see the government at a break-even point.  But to do that requires real cuts in entitlement programs, not a pitiful exercise in punting.

In The Papers: Not Biting The Hand That Feeds You

Suppose that you run a newspaper and one of your primary advertisers is the government.  You then get wind of a corruption scandal involving members of said government.  Do you alienate your sponsor or quash the story?  This is the real question Rafael Di Tella and Ignacio Franceschelli ask in Government Advertising and Media Coverage of Corruption Scandals.


We construct measures of the extent to which the 4 main newspapers in Argentina report government corruption in their front page during the period 1998-2007 and correlate them with the extent to which each newspaper is a recipient of government advertising. The correlation is negative. The size is considerable: a one standard deviation increase in monthly government advertising (0.26 million pesos of 2000) is associated with a reduction in the coverage of the government’s corruption scandals by 0.31 of a front page per month, or 25% of a standard deviation in our measure of coverage. The results are robust to the inclusion of newspaper, month, newspaper*president and individual-corruption scandal fixed effects as well as newspaper*president specific time trends.

The authors survey the four main newspaper in Argentina over the time period 1998-2007, and focus on corruption scandals involving government officials (2).  Their theory is that adverse coverage correlates negatively with government funding (2).  Because the government tends to finance the media to a great extent in Argentina, the results are different than in someplace like the US, where the (often left-leaning) media tend to have partisan papers, and in which party affiliation affects coverage (3-4), regardless of who is currently in power.

This difference is not simply academic.  In the Argentinian case, 200 tax inspectors were sent to investigate one newspaper the day after a report of corruption within the tax agency was published (5, footnote 6).  Aside from direct threats, there are more indirect methods:  much of the “private” advertising is actually advertising by government-affiliated firms (6-7).  As the authors note, “One of the characteristics of small developing countries is the relatively large influence of the government on business” (7).  I would re-phrase that to say that “one of the characteristics keeping countries underdeveloped is the relatively large influence of government on business.”

At any rate, the authors have a data set of 254 scandals, of which more than 150 were reported on only one paper’s front page (8).  They looked at front-page offerings, and consider a scandal “buried” if a paper does not print an article on a front page regarding that story.  They also found that 256,000 pesos (at 2000 values) led to a half a cover drop (that is, a 37% drop) in corruption reporting for one month (13).  Non-government corruption coverage, meanwhile, was not affected by government payments (15-16).  On the other hand, there is a positive correlation between corruption coverage and circulation (17).  So this leads newspaper companies to come to a financial decision:  the drop of one front-page corruption story leads to 560,000 pesos from the government.  But each front-page story leads to 1.7 million pesos from subscribers (1.48 million more subscribers, each paying roughly 1.15 pesos apiece).  It would seem as though this should be a no-brainer, but if the marginal cost of a newspaper is 0.77 pesos, the authors note that this would be a break-even point:  0.77 = (1.7 – 0.56) / 1.48.  Actually, the authors have 0.75 pesos, but they use 0.58 in their calculation rather than 0.56, which I believe to be in error.

My quick takeaway:  the best way to reduce corruption is to have a free press and a small, limited government which can neither afford nor be allowed to influence the media.

Tally Tables: Date Checking And Filtering

This next tally table use is something I swiped from Seth Phelabaum, who wrote about it on his blog (where I had first seen the idea).  The idea is that, in many circumstances, we receive dates as varchar values, in the format YYYYMMDD, instead of dates (YYYY-MM-DD).  Maybe these source systems do valid checks of the data to make sure that only “good” dates (using that YYYYMMDD format) get saved, but honestly, probably not…  So when they come over to you and you need to filter out or flag the bad dates, a tally table is just the trick.

In this case, I’m going to do this two different ways.  The first way involves using a physical tally table, and the second way involves using a CTE tally table.  We can compare IO and time stats to see which method performs better.

First, some basic setup:

set statistics io on;
set statistics time on;

--Date validation
declare @somedates table
	adate varchar(8)

insert into @somedates(adate) values
('20111499'),	/* Obviously bad. */
('20111305'),	/* Bad...or is it? */
('ababyyyy'),	/* Obviously bad. */
('9901'),		/* Should be bad. */
('Jun 1 09'),	/* Should be bad. */

DECLARE @StartDate date = '1991-01-01';
DECLARE @EndDate date = '2015-01-01';

So at this point, we have a fixed range of where we’re going to investigate (to keep the number of records below 10,000 and thus let us use a CTE solution) and a set of dates. Some of these are good, and some are bad. Note that we consider a date to be bad if it does not follow the YYYYMMDD format, so if you’re handling a lot of French dates, that “20111305″ would actually be good and “20110513″ would be bad. Anyhow, now we’ll put together the physical version of this tally table, which I’ve come pretty close to cribbing from the source, although I did use a single CONVERT rather than three casts, so I guess I changed it a little bit…

with tallyDates as
		CONVERT(varchar(8), DATEADD(d, n, @StartDate), 112) as ValidDate
		DATEADD(d, n, @StartDate) <= @EndDate
	@somedates sd
	left outer join tallyDates td on sd.adate = td.ValidDate
	td.ValidDate is null;

This is really cool, I think. We’re basically converting our integer-based tally table into a date tally table in the CTE. Then, all we need to do is find the records in @somedates which do not join to a valid date.

Now we’ll try this using a tally table built from a common table expression:

),										--10E+1 or 10 rows
E2(N) AS (SELECT 1 FROM E1 a, E1 b),	--10E+2 or 100 rows
E4(N) AS (SELECT 1 FROM E2 a, E2 b),	--10E+4 or 10,000 rows max
cteTally(N) AS
     SELECT TOP (DATEDIFF(dd, @StartDate, @EndDate))
tallyDates as
		CONVERT(varchar(8), DATEADD(d, n, @StartDate), 112) as ValidDate
		DATEADD(d,n,@StartDate) <= @EndDate
	@somedates sd
	left outer join tallyDates td on sd.adate = td.ValidDate
	td.ValidDate is null;

The guts of the code are the same, with only the method for creating the table changing.

Performance-wise, the physical method takes 83 ms to run on my machine, with a scan of sp_tally worth 160 logical reads. The CTE version takes just 63 ms to run and eliminates the 160 logical reads of the sp_tally table without introducing any additional reads of its own.

Cost-wise, SQL Server loves the CTE version.  It’s more CPU-heavy (because it needs to build that tally table) but doesn’t hit the slow hard drives, so that’s fine.

The physical version has a nice, pretty query plan.  We’re scanning the tally table, computing a scalar (which is the DATEADD function), and doing a nested loops join to @somedates.  Filter out the null values and we have our results.

The CTE version has a slightly uglier query plan.  Instead of a clustered index scan + compute scalar, we have a more complicated build-up.  After that build-up, though, it still computes the scalar and does a nested loops join to @somedates.  Note that the table scan of @somedates is 80% of the query cost, whereas it is only 6.5% of the physical table’s cost.  Finding a way to remove that tally table IO is great in this case.

Oh, and one final note:  you might think to yourself, “Self, why isn’t this schmuck just using ISDATE()?  I mean, it’s built into SQL Server, so it’s got to work!”  Unfortunately, ISDATE() is very, well, flexible.  Try running the following statements:

select ISDATE('19990501');
select ISDATE('20111499');
select ISDATE('9901');
select ISDATE('Jun 1 09');
select ISDATE('19991305');
select ISDATE('19990513');

In my particular circumstances, the only two elements which return false are the second and fifth.  So this doesn’t filter out dates like ’9901′ (which probably was supposed to be January 1, 1999 but gets cast to ’9999-01-01′ instead) or semi-text rows like ‘Jun 1 09′, and is collation-dependent.  Running the same query on a database with a French collation would return true for #5 and false for #6.  So if you’re getting the data from your Quebec center, they’ll have the records work just fine when they run ISDATE on ’19991305′ but it’ll bomb for you when you try to cast that as a date.  This method, though, lets you convert the obvious good dates and figure out/modify the bad ones.

The Power Of Good Indexes

Creating proper indexes seems like a craft activity to me:  there are rules to follow, and those rules often make sense, but there are so many variations and intricacies to it that it takes years to master and there simply aren’t many formulas.  It’s great to read people who really know indexes write great books on performance, but you really need to do, analyze, and test so much in order to understand appropriate indexes.  It’s not something that you can swoop in on, drop off a few nuggets of wisdom, and fly away knowing that your work here is done.  In fact, I’m beginning to realize that you really need to be a domain expert with knowledge of who runs what where, how, and why, in order to optimize performance.  Oy.

Here’s a quick example of just how much work there is when it comes to indexing.  Let’s say we have one single table.  It has a few different fields on it, including three big char columns (which I added on there to take up space):

create table #test
    selectfield int,
    somebit bit,
    somechar char(500),
    someotherchar char(500),
    yetanotherchar char(500)


insert into #test values(1, 1, 'aaa', 'ccc', 'ddd');
insert into #test values(1, 0, 'bbb', 'ccc', 'eee');
insert into #test values(1, 1, 'bbb', 'ccc', 'fff');
GO 333 -- Give us roughly a thousand records.

At this point, let’s pick a query to run. What we’re going to do is a bit weird, but it was similar to a question I answered a couple of weeks back.

    case when COUNT(distinct(somebit)) = COUNT(distinct someotherchar) then 1 else 0 end,
    case when COUNT(distinct(somebit)) = COUNT(distinct somechar) then 1 else 0 end,
    case when COUNT(distinct(somebit)) = SUM(case when somechar = 'aaa' then 1 else 0 end) then 1 else 0 end,
    case when COUNT(distinct(somebit)) = COUNT(distinct yetanotherchar) then 1 else 0 end
from #test
group by selectfield;

This is kind of a weird query with a lot of aggregation, and it generates a somewhat complicated but still pretty straight-forward query plan:

The query takes 105 ms on my machine and the test table has 5 scans and 1035 logical reads.  There are five table scans and 5 sorts which make up almost the entire query cost.

So, how do we optimize this? Here is where trial and error comes into play. This table is a heap—there is no clustered index. I’m kind of doing this on purpose, but if there were a good clustered index candidate, that wouldn’t change this too much.

Anyhow, my first thought was to put an index on (somebit). We’re doing a lot of work with somebit, so it seemed that having that as an index might help.

create index ix_somebit on test(somebit);

Here is the query plan following that change:

Yeah, that looks pretty similar…  So clearly, having an index just on somebit isn’t going to cut it.  So my next thought was to create one on somebit and include selectfield.  That does a little better, but I followed that up by creating an index on selectfield and somebit.

drop index ix_somebit on test;
create index ix_somebit on test(selectfield, somebit);

105 ms, 5 scans and 833 reads of the test table.  The time to return isn’t improving, but at least we’re reducing IO costs.  Here’s the new plan:

We’ve replaced one of the table scans with an index scan, thus shaving roughly 16% of the I/O off (because there were 5 table scans).  So that’s an improvement.  Next up, I thought about an index on somechar, as it shows up twice in the query.  I would include selectfield to help the join.

create index ix_somechar on test(somechar) include(selectfield);

This drops us down to 68 ms to run, and 5 scans with 567 logical reads.  Another improvement.  Here’s the plan:

I’m happy that there’s an improvement:  instead of taking 24% of the query’s required resources, it’s down to about 15%.  But it just seems like there should be a better plan.  The thing is that we’re spending a pretty good amount of time with that sort, which is sorting by selectfield.  So that got me thinking:  maybe if, instead of having the index be on somechar and just including selectfield, we put the index on (selectfield, somechar).  That would have our index sorted first by selectfield and should remove the requirement for a sort.  To see which index works out better, I’ll leave both on for the moment.

create index ix_somechar2 on test(selectfield, somechar);

With the new index, we’re back up to 98 ms, but still at 5 scans and 567 logical reads.  SQL Server prefers the second index, though, because that means it doesn’t need to do a sort:

It’s easy to understand why, too:  instead of 15%, we’re down to 10.5%.  So even though the CPU time is a little higher, it seems like we’re on the right track.  That’s one of the things I’ve noticed with indexing strategies:  sometimes, it seems like you’re taking a step back even when you really are moving forward—you might have a plateau and then a drop in the middle before finding further improvements later.  Again, it’s a craft, not a science.

Anyhow, we’ll drop the unused index and then index the last two columns.

drop index ix_somechar on test;
create index ix_someotherchar on test(selectfield, someotherchar);
create index ix_yetanotherchar on test(selectfield, yetanotherchar);

What we see at this point is a series of index scans.  Because we’re aggregating all of the rows together, it makes sense to use a scan rather than a seek, so that’s OK.  And the percentages have gone up, but the total cost has gone down, which is what we want to see:

With the final stats, I have roughly 70 ms of time (15 ms of CPU, which is a pretty good-sized drop from earlier) and 5 scans for 301 logical reads.

But here’s the part where indexing really becomes a craft:  we now have four indexes on this table.  Inserting another record results in 67 logical reads and 44 ms of time, as each of those indexes needs to be updated to accommodate the new record.  Without the indexes, I need just 1 logical read (to get the current page) and 1 ms of time to add the new record.  Here’s the trade-off that we normally think of when indexing:  each index improves select performance at the cost of update/insert/delete performance.

But the problem with this type of analysis is that it’s so partial:  there are a number of other queries which involve this table, and those should be analyzed as well to see if the indexes have any benefit, or even if they positively harm these other queries.  In the world of economics, we talk about how partial equilibrium analysis (the standard supply and demand curves which most people think of when they think of economics) can be helpful, but if you really want to know the full effects, you have to use a general equilibrium model (or, as a shout out to the Austrians, an analysis of the ERE)  Effects which may seem large in a partial equilibrium may be offset by effects on unanalyzed portions of the world, or may be so small in the grand scheme of things as to be irrelevant.  Similarly, indexing around a single query—unless that query dominates the database—runs the risk of ignoring pertinent factors and other queries.

Going back to the example at hand, let’s say that the above query is, in fact, the most common (or at least important) query.  But our next-most important query is

select distinct

This query requires 1 scan, 207 logical reads, and 88 ms. The query plan is a table scan and then a hash match aggregation. But if we were thinking of this query when building our indexes, we could have improved this query as well by adding somebit onto ix_yetanotherchar:

drop index ix_yetanotherchar on test;
create index ix_yetanotherchar on test(selectfield, yetanotherchar, somebit);

Now this query takes 6 ms because it involves 1 scan of 74 reads. The other query doesn’t change by adding this bit, as there aren’t enough rows to demand the creation of a new page as a result of its inclusion.

So it’s important to understand the whole picture before focusing on a decision, as there may be unintended consequences or points of easy improvement overlooked.  Just like how a partial equilibrium analysis misses the details of the rest of the world, so does a partial database analysis miss the rest of the story.  Unlike academic economics, though, we don’t have the luxury of stripping our databases down to extremely simplified and aggregated models and applying basic calculus to derive first-order conditions which tell us how things, in this overly simplified scenario, should work.  There is no general database equlibrium analysis here; all we can do is iterate and hone our skills as craftsmen.