A professor claims that Stalin was not a mass murderer. Is there a charitable explanation here, or is this further proof of Orwell’s admonition that there exist ideas so foolish that only an intellectual could believe them?
Why are Republicans in secret negotiations with President Obama? Actually, I should rephrase that: why is John Boehner (seemingly alone) in secret negotiations with President Obama? Secret negotiations are a fool’s game here. Republicans should be forthright and open in discussion, given the importance of this issue and the fact that they tend to get rolled when making these backroom deals.
An interesting question showed up on Ask SQL Server Central last week, and I wanted to blog my answer, mostly so I can show pretty pictures and get credit for actually having a DBA-related post once in a while…
The poster had a relatively simple problem: a view was returning an error which read “Invalid length parameter passed to the LEFT or SUBSTRING function” even though the operation was actually correct. Fortunately, the poster put together a nice setup, so I can steal it here and display the queries formatted as I tend to like them:
CREATE TABLE #contact (personID int, email varchar(150), type varchar(20)) CREATE TABLE #t (personID int, type varchar(20)) INSERT INTO #contact (personID, email, type) SELECT 1, 'firstname.lastname@example.org', 'student' UNION SELECT 2, 'email@example.com', 'teacher' UNION SELECT 3, '', 'teacher' UNION SELECT 4, null, 'teacher' INSERT INTO #t(personID, type) SELECT 1, 'student' UNION SELECT 2, 'teacher' UNION SELECT 3, 'teacher' UNION SELECT 4, 'teacher' --This works fine.... SELECT * FROM ( SELECT COALESCE(SUBSTRING(c.email,1, CHARINDEX('@', c.email)-1), '') UserName FROM #contact c INNER JOIN #t t ON c.personID=t.personID WHERE c.type='Student' ) subQuery WHERE subQuery.UserName <> ''; --This generates an error.... SELECT * FROM ( SELECT COALESCE(SUBSTRING(c.email,1, CHARINDEX('@', c.email)-1), '') UserName FROM #contact c INNER JOIN #t t ON c.personID=t.personID WHERE t.type='Student' ) subQuery WHERE subQuery.UserName <> '';
What’s interesting here is that the two queries are basically the same. The only difference is that the first, working query filters where c.type is equal to Student, and the second, non-working query filters where t.type is equal to Student. The reason the query fails is because some of the #contact records don’t actually have @ signs in them, and so charindex returns -1 and you’re passing a -2 into the substring function, which causes you to see the error message.
But wait, our query clearly says only to get the rows that are marked as Student, so this should just be getting valid rows (i.e., rows in which there is a valid e-mail address). Thus, everything should be unicorns and pixie dust…right? Well, let’s dig into the inner workings a bit and come up with the entire answer.
Investigating the Execution Plans
Let’s generate some estimated execution plans. There are two reasons that I’m looking at an estimated plan here. The first one is that the estimated and actual plans will both contain all of the necessary information for solving this problem. More importantly, we can’t get the broken query’s actual execution plan: it fails before completion, and so no plan gets generated for us to see. Anyhow, to generate an execution plan, press Ctrl-L. This will give you an Execution plan tab with the different activities we’re performing. The important part here is the last two sets of plans, because they relate to the two queries that we’re interested in. The estimated execution plans for these two queries look the same:
This indicates that there’s nothing really wacky. What we have is a scan of the #contact table, followed by a computed column (which gives us the UserName column), a scan of #t, a hash match, and finally a SELECT to give us the result set. It’s a fairly standard query plan. But execution plans are like snitches: they aren’t forthright with information; you need to rough them up a bit to get the good stuff. So let’s do just that.
Interrogating the Execution Plans
What we’re looking for is the reason we’re getting bad rows even though our query clearly says that we want to ignore them. The obvious place to look is the #contact table scan.
Let’s start with the working plan first, to see what’s right. Then we can compare that to the other plan and see where the difference lies. In SQL Server Management Studio, click on the #contact Table Scan icon and press F4 to go to the Properties window. This gives us a whole host of information. A lot of it is useful, but for this problem, we’re looking for the Predicate.
Click on the predicate text to see the […] symbol. Click that button and you’ll get a window with the full predicate. It should look something like this:
#contact.[type] as [c].[type]='Student' AND CASE WHEN substring(#contact.[email] as [c].[email],(1),charindex('@',#contact.[email] as [c].[email])-(1)) IS NOT NULL THEN substring(#contact.[email] as [c].[email],(1),charindex('@',#contact.[email] as [c].[email])-(1)) ELSE '' END<>''
This looks pretty familiar. The first part of the statement is a filter on the type, showing us only students. The second part is a messier version of subQuery.UserName. We want to get the usernames which are not blank, and username is defined as
COALESCE(SUBSTRING(c.email,1, CHARINDEX('@', c.email)-1), '')
So the predicate makes some sense here. So now let’s look at the estimated plan of our other query. Remember that the pictures and costs looked the same in SQL Server Management Studio, but that doesn’t mean that everything is the same. Once again, we’ll go to the Predicate for our #contact table.
In this case, the predicate is different. Instead of checking where Type = “Student” and UserName is not blank, we’re only checking cases in which UserName is blank.
CASE WHEN substring(#contact.[email] as [c].[email],(1),charindex('@',#contact.[email] as [c].[email])-(1)) IS NOT NULL THEN substring(#contact.[email] as [c].[email],(1),charindex('@',#contact.[email] as [c].[email])-(1)) ELSE '' END<>''
So what happened to that check for type? Well, if you look at the second query, you can see that the check for type is now on #t. You can verify this by checking out the Predicate for the #t Table Scan:
From this, you can clearly see the filter on #t, winnowing down the list to students alone.
Putting It All Together
Okay, so how does that help us? Well, let’s look back at the execution plans again. In both cases, we have a hash match join between #contact and #type, as well as a filter on #contact that eliminates blank usernames. The primary difference between these two query plans is where the type filter goes: on #t or on #contact. This is vital to us because the optimizer wants us to filter out blank usernames during the scan, rather than after the join is complete. In normal circumstances, this is a great thing: any row you throw out during the scanning portion is a row that you don’t need to join or transform or perform some costly activity against. Unfortunately, in this case, the filter actually depends upon another filter being run first: the filter to get rid of contacts with the wrong type. That works out in this case with both on the same #contact table, but once you move the filter over to the other table, you no longer guarantee that the first filter runs and we get only valid rows back.
So given that, here’s a thought experiment: what if we kept the filter on #t but forced a particular join order, making sure that #t goes first? Then we could throw out all of the non-joined rows and not have to worry about our problem, right?
Well, one way to force join order is to specify a join hint. Let’s make this a HASH JOIN (to match our other queries) but hash joining #t to #contact, and see if that gets us anywhere:
SELECT * FROM ( SELECT COALESCE(SUBSTRING(c.email,1, CHARINDEX('@', c.email)-1), '') UserName FROM #t t INNER HASH JOIN #contact c ON c.personID=t.personID WHERE t.type='Student' ) subQuery WHERE subQuery.UserName <> '';
The execution plan doesn’t look very different (we just move the Compute Scalar from before the join to after the join), and the end result is the same:
Msg 537, Level 16, State 2, Line 2
Invalid length parameter passed to the LEFT or SUBSTRING function.
So that doesn’t help at all. How about if we force a non-clustered index on #contact and join that way?
create index ix_test on #contact(PersonId, Type, Email);
Running the breaking query again, we find out that this does fix the problem. So what changed? Well, let’s look at the execution plan:
In this case, when we go to the Properties of the Index Seek item, we still have our Predicate column, but it has been joined by a Seek Predicates column. The seek predicate joins personIDs together from #contact and #t. What is important here is that the seek predicate gets executed first, before the normal predicate. We know this is the case because we don’t get the error. SQL Server is able to optimize the join by throwing out any rows which don’t join, and the expected gain from doing this is greater than our previous method of scanning all of the rows and then throwing them away based on the predicate.
In this short article, we were able to investigate an “intermittent” problem and figure out the root cause. The ‘best’ fix in this case is to move the entire predicate to the #contact table. That way, we prevent this problem from happening in the future if some data set makes it more valuable to read from the #contact table first and then join to #t rather than the other way around.
There’s a series at the Hacker News on the election.
My favorite line from the first article: “The government worker has become the new aristocracy in the United States.”
Mohit Kumar rants about bug bounty programs. Done correctly, a bug bounty can be a great incentive for finding problems with software. But there is a repudiation problem inherent in the system. Here’s a stylized example to show the problem.
Suppose we have a company X, which releases a software product. Company X’s management decides to offer a bug bounty for security vulnerabilities, realizing that security is important and very difficult.
Security Researcher A finds a novel security flaw in the product. He submits corroborating information to Company X. Company X determines that yes, it is a bug. In the meantime, Security Researcher B also finds this same flaw. The problem is that B doesn’t necessarily have any idea what A has already done, and so doesn’t realize that what he’s doing is a waste of time.* Eventually, an honest X would award A with the bug bounty and alert B that somebody else received the bounty for this.
But X could have an incentive to repudiate the bounty a certain percentage of the time. The company still gets increased outside security help from people motivated by the monetary reward, but doesn’t actually pay it out. Instead, they tell Security Researcher A that somebody else already reported it (or internal testing found it first) and A has no proof otherwise. As long as you don’t do it in cases which are high-profile (e.g., competitions at large conferences) and the outside security researchers don’t have a way of getting proof of this recalcitrance, it’s easy to perpetuate.
A standard mechanism for non-repudiation is to give a hash of a message. That way, you can prove later that the message you see is the same as what the sender originally hashed, meaning that nobody interfered with the message. An analog here would be to have a public database of vulnerability hashes, with researcher IDs and dates of entry. When you submit a claim to a company, a hash of your attack goes into the public database, and you could see whether or not you are the first person to submit this particular vulnerability.
By itself, this is probably not a feasible mechanism in this particular case, due to the fact that you would need to hash concepts rather than the submission text itself. In addition to the external bounty site, you would also need an independent arbiter with access to the code and a mandate to arbitrate fairly. Making the hashes public would go a long step in that direction by keeping the arbitration agency honest.
* – There are cases in which B might find something that A did not find, so we don’t know a priori that this is a total waste of time. But it probably is.
Chad Miller shows us how to script SSIS deployments via Powershell. One of the two depressing things about integrating SSIS packages into an automated build process was just how difficult it was to automate deployment (the other was that source control is effectively meaningless when it comes to determining changes). Chad’s solution is a pretty good one, and something I hadn’t really seen elsewhere.
From here, my eight-month plan is pretty hectic but fairly well filled-out:
November through January: study for LPIC-1 101. It doesn’t directly relate to my job, but I’d like to get some Linux certifications in there to balance out my Microsoft-centric resume. I would probably take this exam at the end of January.
January and February: study for 70-461. I might be able to do all of this in less than a month, given that it’s right in my wheelhouse. I’d expect to take this before the end of February.
February, March, and early April: study for 70-463. Of the three MCSA exams, this is my relatively weakest topic.
April and May: study for LPIC-1 102. That way, I can get my LPI-1 certification.
May through July: study for my CEH. Like I mentioned before, I think this is an introductory cert, and so given where I’m at now, there is some value in me pursuing it.