36 Chambers – The Legendary Journeys: Execution to the max!

July 31, 2011

Scores Are Meant To Be Gamed

Filed under: Curmudgeonliness — Kevin Feasel @ 8:55 pm

Eric Moretti has a post criticizing the Klout score.  One of his problems with the score is that there are relatively easy avenues for gaming:  you can do it using surreptitious and somewhat unethical ways (having Twitter sock puppets re-tweet whatever you say broad and wide)—or simply post a lot of vacuous stuff that other people re-tweet or forward on.

Any single-number score which people can affect directly leaves open the potential for gaming, simply because the rules are broad enough.  My favorite example is voting methods:  it’s easy to come up with scenarios in which a relatively small, concerted subset of the population can game an election…or can accidentally end up with their lowest-preference result.

In fairness, though, I do like the idea of coming up with these types of scores.  There is a great deal of subjective knowledge and I would not want to discount that in any way, but being able to get a nice first-approximation with a single number is helpful.  When a score is very easy to game (as this one may be), the knowledge derived is lower than what you might find desirable, but in many circumstances, it’s still better than zero knowledge.  If somebody does have a high score, you still need to spend some time filtering out the chaff, but if someone is claiming to be a boffo social media expert but is in Nowheresville on a reasonably objective score, that should at least raise doubts.  With that said, I would not use it strictly as a filtering mechanism (certainly not as a sole filtering mechanism), but think of the alternative:  the stair technique.

July 30, 2011

Call It What It Is

Filed under: Curmudgeonliness — Kevin Feasel @ 10:05 am

I hate the current euphemism going on with taxes:  government “revenue.”  “Revenue” is a nice, happy term that people use to obscure the fact that they’re really talking about taxes.  Why would they do that?  Because people hate taxes.  That’s why it was so funny when President Lisa Simpson talked about a “temporary refund adjustment” (and Bart’s Kinsley gaffe revealed what it really is).  We’re not raising taxes—we’re opening “markets” and increasing “revenues,” just like a business would do!  Well, like a business would do if it got to throw people in jail for not paying whatever it demands…

The fundamental problem going on is not a lack of taxes; it is a surfeit of spending.  If the federal government spent at 2007 levels, there would still be a deficit, but a much smaller one.  If federal spending were taken down to 2000 levels (back when children worked in coal mines and we burned the elderly for fuel), we would see the government at a break-even point.  But to do that requires real cuts in entitlement programs, not a pitiful exercise in punting.

PyOhio

Filed under: Programming & Work — Kevin Feasel @ 8:00 am

Today is the beginning of PyOhio 2011.  I’m planning to make it for day 1 this year (which means today—I’m writing this early), so I’ll have an update after the conference ends.

July 29, 2011

A Trillion Here, A Trillion There

Filed under: Curmudgeonliness — Kevin Feasel @ 4:46 pm

People don’t understand the concept of a “trillion.”  It’s not a number we can deal with.  Even a million is too big for popular understanding.  This is what a trillion dollars looks like.

In The Papers: Not Biting The Hand That Feeds You

Filed under: Curmudgeonliness, Economics, In The Papers — Kevin Feasel @ 3:53 pm

Suppose that you run a newspaper and one of your primary advertisers is the government.  You then get wind of a corruption scandal involving members of said government.  Do you alienate your sponsor or quash the story?  This is the real question Rafael Di Tella and Ignacio Franceschelli ask in Government Advertising and Media Coverage of Corruption Scandals.

Abstract:

We construct measures of the extent to which the 4 main newspapers in Argentina report government corruption in their front page during the period 1998-2007 and correlate them with the extent to which each newspaper is a recipient of government advertising. The correlation is negative. The size is considerable: a one standard deviation increase in monthly government advertising (0.26 million pesos of 2000) is associated with a reduction in the coverage of the government’s corruption scandals by 0.31 of a front page per month, or 25% of a standard deviation in our measure of coverage. The results are robust to the inclusion of newspaper, month, newspaper*president and individual-corruption scandal fixed effects as well as newspaper*president specific time trends.

The authors survey the four main newspaper in Argentina over the time period 1998-2007, and focus on corruption scandals involving government officials (2).  Their theory is that adverse coverage correlates negatively with government funding (2).  Because the government tends to finance the media to a great extent in Argentina, the results are different than in someplace like the US, where the (often left-leaning) media tend to have partisan papers, and in which party affiliation affects coverage (3-4), regardless of who is currently in power.

This difference is not simply academic.  In the Argentinian case, 200 tax inspectors were sent to investigate one newspaper the day after a report of corruption within the tax agency was published (5, footnote 6).  Aside from direct threats, there are more indirect methods:  much of the “private” advertising is actually advertising by government-affiliated firms (6-7).  As the authors note, “One of the characteristics of small developing countries is the relatively large influence of the government on business” (7).  I would re-phrase that to say that “one of the characteristics keeping countries underdeveloped is the relatively large influence of government on business.”

At any rate, the authors have a data set of 254 scandals, of which more than 150 were reported on only one paper’s front page (8).  They looked at front-page offerings, and consider a scandal “buried” if a paper does not print an article on a front page regarding that story.  They also found that 256,000 pesos (at 2000 values) led to a half a cover drop (that is, a 37% drop) in corruption reporting for one month (13).  Non-government corruption coverage, meanwhile, was not affected by government payments (15-16).  On the other hand, there is a positive correlation between corruption coverage and circulation (17).  So this leads newspaper companies to come to a financial decision:  the drop of one front-page corruption story leads to 560,000 pesos from the government.  But each front-page story leads to 1.7 million pesos from subscribers (1.48 million more subscribers, each paying roughly 1.15 pesos apiece).  It would seem as though this should be a no-brainer, but if the marginal cost of a newspaper is 0.77 pesos, the authors note that this would be a break-even point:  0.77 = (1.7 – 0.56) / 1.48.  Actually, the authors have 0.75 pesos, but they use 0.58 in their calculation rather than 0.56, which I believe to be in error.

My quick takeaway:  the best way to reduce corruption is to have a free press and a small, limited government which can neither afford nor be allowed to influence the media.

July 28, 2011

Tally Tables: Date Checking And Filtering

Filed under: Database Administration — Kevin Feasel @ 6:06 pm

This next tally table use is something I swiped from Seth Phelabaum, who wrote about it on his blog (where I had first seen the idea).  The idea is that, in many circumstances, we receive dates as varchar values, in the format YYYYMMDD, instead of dates (YYYY-MM-DD).  Maybe these source systems do valid checks of the data to make sure that only “good” dates (using that YYYYMMDD format) get saved, but honestly, probably not…  So when they come over to you and you need to filter out or flag the bad dates, a tally table is just the trick.

In this case, I’m going to do this two different ways.  The first way involves using a physical tally table, and the second way involves using a CTE tally table.  We can compare IO and time stats to see which method performs better.

First, some basic setup:

set statistics io on;
set statistics time on;

--Date validation
declare @somedates table
(
	adate varchar(8)
);

insert into @somedates(adate) values
('19990501'),
('20111499'),	/* Obviously bad. */
('20090101'),
('20111305'),	/* Bad...or is it? */
('ababyyyy'),	/* Obviously bad. */
('9901'),		/* Should be bad. */
('Jun 1 09'),	/* Should be bad. */
('20030419');

DECLARE @StartDate date = '1991-01-01';
DECLARE @EndDate date = '2015-01-01';

So at this point, we have a fixed range of where we’re going to investigate (to keep the number of records below 10,000 and thus let us use a CTE solution) and a set of dates. Some of these are good, and some are bad. Note that we consider a date to be bad if it does not follow the YYYYMMDD format, so if you’re handling a lot of French dates, that “20111305” would actually be good and “20110513” would be bad. Anyhow, now we’ll put together the physical version of this tally table, which I’ve come pretty close to cribbing from the source, although I did use a single CONVERT rather than three casts, so I guess I changed it a little bit…

with tallyDates as
(
	SELECT
		CONVERT(varchar(8), DATEADD(d, n, @StartDate), 112) as ValidDate
	FROM
		sp_Tally
	WHERE
		DATEADD(d, n, @StartDate) <= @EndDate
)
SELECT
	sd.*
FROM
	@somedates sd
	left outer join tallyDates td on sd.adate = td.ValidDate
WHERE
	td.ValidDate is null;

This is really cool, I think. We’re basically converting our integer-based tally table into a date tally table in the CTE. Then, all we need to do is find the records in @somedates which do not join to a valid date.

Now we’ll try this using a tally table built from a common table expression:

WITH E1(N) AS
(
	SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
	SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
	SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
),										--10E+1 or 10 rows
E2(N) AS (SELECT 1 FROM E1 a, E1 b),	--10E+2 or 100 rows
E4(N) AS (SELECT 1 FROM E2 a, E2 b),	--10E+4 or 10,000 rows max
cteTally(N) AS
(
     SELECT 0 UNION ALL
     SELECT TOP (DATEDIFF(dd, @StartDate, @EndDate))
					ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
		FROM E4
),
tallyDates as
(
	SELECT
		CONVERT(varchar(8), DATEADD(d, n, @StartDate), 112) as ValidDate
	FROM
		cteTally
	WHERE
		DATEADD(d,n,@StartDate) <= @EndDate
)
SELECT
	sd.*
FROM
	@somedates sd
	left outer join tallyDates td on sd.adate = td.ValidDate
WHERE
	td.ValidDate is null;

The guts of the code are the same, with only the method for creating the table changing.

Performance-wise, the physical method takes 83 ms to run on my machine, with a scan of sp_tally worth 160 logical reads. The CTE version takes just 63 ms to run and eliminates the 160 logical reads of the sp_tally table without introducing any additional reads of its own.

Cost-wise, SQL Server loves the CTE version.  It’s more CPU-heavy (because it needs to build that tally table) but doesn’t hit the slow hard drives, so that’s fine.

The physical version has a nice, pretty query plan.  We’re scanning the tally table, computing a scalar (which is the DATEADD function), and doing a nested loops join to @somedates.  Filter out the null values and we have our results.

The CTE version has a slightly uglier query plan.  Instead of a clustered index scan + compute scalar, we have a more complicated build-up.  After that build-up, though, it still computes the scalar and does a nested loops join to @somedates.  Note that the table scan of @somedates is 80% of the query cost, whereas it is only 6.5% of the physical table’s cost.  Finding a way to remove that tally table IO is great in this case.

Oh, and one final note:  you might think to yourself, “Self, why isn’t this schmuck just using ISDATE()?  I mean, it’s built into SQL Server, so it’s got to work!”  Unfortunately, ISDATE() is very, well, flexible.  Try running the following statements:

select ISDATE('19990501');
select ISDATE('20111499');
select ISDATE('9901');
select ISDATE('Jun 1 09');
select ISDATE('19991305');
select ISDATE('19990513');

In my particular circumstances, the only two elements which return false are the second and fifth.  So this doesn’t filter out dates like ‘9901’ (which probably was supposed to be January 1, 1999 but gets cast to ‘9999-01-01′ instead) or semi-text rows like ‘Jun 1 09′, and is collation-dependent.  Running the same query on a database with a French collation would return true for #5 and false for #6.  So if you’re getting the data from your Quebec center, they’ll have the records work just fine when they run ISDATE on ‘19991305’ but it’ll bomb for you when you try to cast that as a date.  This method, though, lets you convert the obvious good dates and figure out/modify the bad ones.

July 27, 2011

The Power Of Good Indexes

Filed under: Database Administration — Kevin Feasel @ 9:26 pm

Creating proper indexes seems like a craft activity to me:  there are rules to follow, and those rules often make sense, but there are so many variations and intricacies to it that it takes years to master and there simply aren’t many formulas.  It’s great to read people who really know indexes write great books on performance, but you really need to do, analyze, and test so much in order to understand appropriate indexes.  It’s not something that you can swoop in on, drop off a few nuggets of wisdom, and fly away knowing that your work here is done.  In fact, I’m beginning to realize that you really need to be a domain expert with knowledge of who runs what where, how, and why, in order to optimize performance.  Oy.

Here’s a quick example of just how much work there is when it comes to indexing.  Let’s say we have one single table.  It has a few different fields on it, including three big char columns (which I added on there to take up space):

create table #test
(
    selectfield int,
    somebit bit,
    somechar char(500),
    someotherchar char(500),
    yetanotherchar char(500)
);

GO

insert into #test values(1, 1, 'aaa', 'ccc', 'ddd');
insert into #test values(1, 0, 'bbb', 'ccc', 'eee');
insert into #test values(1, 1, 'bbb', 'ccc', 'fff');
GO 333 -- Give us roughly a thousand records.

At this point, let’s pick a query to run. What we’re going to do is a bit weird, but it was similar to a question I answered a couple of weeks back.

select
    selectfield,
    case when COUNT(distinct(somebit)) = COUNT(distinct someotherchar) then 1 else 0 end,
    case when COUNT(distinct(somebit)) = COUNT(distinct somechar) then 1 else 0 end,
    case when COUNT(distinct(somebit)) = SUM(case when somechar = 'aaa' then 1 else 0 end) then 1 else 0 end,
    case when COUNT(distinct(somebit)) = COUNT(distinct yetanotherchar) then 1 else 0 end
from #test
group by selectfield;

This is kind of a weird query with a lot of aggregation, and it generates a somewhat complicated but still pretty straight-forward query plan:

The query takes 105 ms on my machine and the test table has 5 scans and 1035 logical reads.  There are five table scans and 5 sorts which make up almost the entire query cost.

So, how do we optimize this? Here is where trial and error comes into play. This table is a heap—there is no clustered index. I’m kind of doing this on purpose, but if there were a good clustered index candidate, that wouldn’t change this too much.

Anyhow, my first thought was to put an index on (somebit). We’re doing a lot of work with somebit, so it seemed that having that as an index might help.

create index ix_somebit on test(somebit);

Here is the query plan following that change:

Yeah, that looks pretty similar…  So clearly, having an index just on somebit isn’t going to cut it.  So my next thought was to create one on somebit and include selectfield.  That does a little better, but I followed that up by creating an index on selectfield and somebit.

drop index ix_somebit on test;
create index ix_somebit on test(selectfield, somebit);

105 ms, 5 scans and 833 reads of the test table.  The time to return isn’t improving, but at least we’re reducing IO costs.  Here’s the new plan:

We’ve replaced one of the table scans with an index scan, thus shaving roughly 16% of the I/O off (because there were 5 table scans).  So that’s an improvement.  Next up, I thought about an index on somechar, as it shows up twice in the query.  I would include selectfield to help the join.

create index ix_somechar on test(somechar) include(selectfield);

This drops us down to 68 ms to run, and 5 scans with 567 logical reads.  Another improvement.  Here’s the plan:

I’m happy that there’s an improvement:  instead of taking 24% of the query’s required resources, it’s down to about 15%.  But it just seems like there should be a better plan.  The thing is that we’re spending a pretty good amount of time with that sort, which is sorting by selectfield.  So that got me thinking:  maybe if, instead of having the index be on somechar and just including selectfield, we put the index on (selectfield, somechar).  That would have our index sorted first by selectfield and should remove the requirement for a sort.  To see which index works out better, I’ll leave both on for the moment.

create index ix_somechar2 on test(selectfield, somechar);

With the new index, we’re back up to 98 ms, but still at 5 scans and 567 logical reads.  SQL Server prefers the second index, though, because that means it doesn’t need to do a sort:

It’s easy to understand why, too:  instead of 15%, we’re down to 10.5%.  So even though the CPU time is a little higher, it seems like we’re on the right track.  That’s one of the things I’ve noticed with indexing strategies:  sometimes, it seems like you’re taking a step back even when you really are moving forward—you might have a plateau and then a drop in the middle before finding further improvements later.  Again, it’s a craft, not a science.

Anyhow, we’ll drop the unused index and then index the last two columns.

drop index ix_somechar on test;
create index ix_someotherchar on test(selectfield, someotherchar);
create index ix_yetanotherchar on test(selectfield, yetanotherchar);

What we see at this point is a series of index scans.  Because we’re aggregating all of the rows together, it makes sense to use a scan rather than a seek, so that’s OK.  And the percentages have gone up, but the total cost has gone down, which is what we want to see:

With the final stats, I have roughly 70 ms of time (15 ms of CPU, which is a pretty good-sized drop from earlier) and 5 scans for 301 logical reads.

But here’s the part where indexing really becomes a craft:  we now have four indexes on this table.  Inserting another record results in 67 logical reads and 44 ms of time, as each of those indexes needs to be updated to accommodate the new record.  Without the indexes, I need just 1 logical read (to get the current page) and 1 ms of time to add the new record.  Here’s the trade-off that we normally think of when indexing:  each index improves select performance at the cost of update/insert/delete performance.

But the problem with this type of analysis is that it’s so partial:  there are a number of other queries which involve this table, and those should be analyzed as well to see if the indexes have any benefit, or even if they positively harm these other queries.  In the world of economics, we talk about how partial equilibrium analysis (the standard supply and demand curves which most people think of when they think of economics) can be helpful, but if you really want to know the full effects, you have to use a general equilibrium model (or, as a shout out to the Austrians, an analysis of the ERE)  Effects which may seem large in a partial equilibrium may be offset by effects on unanalyzed portions of the world, or may be so small in the grand scheme of things as to be irrelevant.  Similarly, indexing around a single query—unless that query dominates the database—runs the risk of ignoring pertinent factors and other queries.

Going back to the example at hand, let’s say that the above query is, in fact, the most common (or at least important) query.  But our next-most important query is

select distinct
	selectfield,
	somebit,
	yetanotherchar
from
	test;

This query requires 1 scan, 207 logical reads, and 88 ms. The query plan is a table scan and then a hash match aggregation. But if we were thinking of this query when building our indexes, we could have improved this query as well by adding somebit onto ix_yetanotherchar:

drop index ix_yetanotherchar on test;
create index ix_yetanotherchar on test(selectfield, yetanotherchar, somebit);

Now this query takes 6 ms because it involves 1 scan of 74 reads. The other query doesn’t change by adding this bit, as there aren’t enough rows to demand the creation of a new page as a result of its inclusion.

So it’s important to understand the whole picture before focusing on a decision, as there may be unintended consequences or points of easy improvement overlooked.  Just like how a partial equilibrium analysis misses the details of the rest of the world, so does a partial database analysis miss the rest of the story.  Unlike academic economics, though, we don’t have the luxury of stripping our databases down to extremely simplified and aggregated models and applying basic calculus to derive first-order conditions which tell us how things, in this overly simplified scenario, should work.  There is no general database equlibrium analysis here; all we can do is iterate and hone our skills as craftsmen.

Reducing Tax Burdens One Way Or Another

Filed under: Curmudgeonliness — Kevin Feasel @ 7:11 pm

I don’t think that this is quite the endemic problem that Steve Sailer makes it out to be, but he does point out that Microsoft has found a way to reduce its tax burden considerably by claiming, essentially, that it is a Puerto Rican, Irish, and Singaporean manufacturing company that just happens to do a little bit of work in the US.

Why do they do this?  Because top marginal corporate tax rates in the US are the second-highest in the civilized world, with only Japan ranking higher.  Reducing these rates would likely increase compliance among firms which have the option of setting up tax shelters abroad.

July 26, 2011

Powershell: Export-Csv

Filed under: Database Administration, Powershell — Kevin Feasel @ 5:22 pm

I have started to use Powershell at work to solve database administration problems:  scripting out jobs to run automatically, using SMO to do all kinds of awesome stuff, that kind of thing.  Another good use I found for it was exporting very large data sets.  I was asked by one of the research folks at work to create a delimited file with a pretty large amount of data:  roughly 240 MB.  Although I did protest and try to get that person to use the data warehouse, there were some time constraints and a lack of T-SQL knowledge which prevented that from happening.

Anyhow, my first thought was to use Matt Whitfield’s SQL Everywhere, which is a nice tool for SQL developers, and something I’m trying to use more often than I currently do.  Anyhow, it has a great right-click –> export as CSV feature which works just fine.  Unfortunately, it doesn’t appear to export using another delimiter, and running the query in text mode (where I can set a delimiter) caused a System.OutOfMemoryException—it’s a pretty hefty dataset, taking up about 950 MB of RAM.

So what I did was to use Powershell to export the data.  First of all, this requires SQLPSX, which you really should get if you have anything to do with databases.  In the code sample below, I’ll connect to my local SQL server instance and run a query from AdventureWorks2008.  Nothing fancy, but the query will then be exported as a tab-delimited file called C:\Temp\output.txt.  I’m assuming that you have all of this set up on your end—it would be better for me to do various data and directory checks and the like, but this is a really simple demo.

$server = Get-SqlServer "."
$db = "AdventureWorks2008"
$query =
"
	select
		s.Name as StoreName,
		soh.SalesOrderNumber,
		sod.LineTotal,
		sod.OrderQty,
		p.Name as ProductName
	from
		Sales.SalesOrderHeader soh
		inner join Sales.SalesOrderDetail sod on soh.SalesOrderID = sod.SalesOrderID
		inner join Production.Product p on sod.ProductID = p.ProductID
		inner join Sales.Customer sc on soh.CustomerID = sc.CustomerID
		inner join Sales.Store s on sc.StoreID = s.BusinessEntityID
"

Get-SqlData $server $db $query | Export-Csv -Delimiter "`t" -Path "C:\Temp\output.txt" -NoTypeInformation

Here are a couple of quick notes.  The first one is that I, at first, had a little bit of trouble finding the delimiter for tabs:  it’s `t (grave-t).  Secondly, -NoTypeInformation removes a type declaration at the top of your file.  For this query, if you don’t include that parameter, it will include a line at the top which reads “#TYPE System.Data.DataRow” as it’s exporting a series of data rows returned from SQL Server.

If you find yourself needing to export data regularly to delimited files, but the file structures change too often to build an SSIS package around it, then this is an easy way to do the job—even easier than copy-pasting into Excel, folks.

July 25, 2011

EPD: Popular Follies Of Great Cities

Filed under: Wacky Theories — Kevin Feasel @ 4:57 pm

We are pretty close to the end here, having reached chapter 13 of Extraordinary Popular Delusions and the Madness of Crowds.  It starts off kind of interesting and blasts bleeding-hearts:  “The miseries of us poor earth-dwellers gain no alleviation from the sympathy of those who merely hunt them out to be pathetic over them.  The weeping philosopher too often impairs his eyesight by his woe, and becomes unable from his tears to see the remedies for the evils which he deplores.  Thus it will often be found that the man of no tears is the truest philanthropist, as he is the best physician who wears a cheerful face, even in the worst of cases” (619).  But this chapter is really about silly things which take place with large groups of people living together, specifically, weird slang.  They can be weird (“Quoz”), insulting (“Does your mother know you’re out?”), or outright baffling (“There he goes with his eye out!”), but in all cases, they burn brightly only to die out within a few months.

So, in other words, internet memes were around long before the internet…

Older Posts »

The Silver is the New Black Theme. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 96 other followers