Invalid length parameter passed to the LEFT or SUBSTRING function

An interesting question showed up on Ask SQL Server Central last week, and I wanted to blog my answer, mostly so I can show pretty pictures and get credit for actually having a DBA-related post once in a while…

The poster had a relatively simple problem: a view was returning an error which read “Invalid length parameter passed to the LEFT or SUBSTRING function” even though the operation was actually correct. Fortunately, the poster put together a nice setup, so I can steal it here and display the queries formatted as I tend to like them:

CREATE TABLE #contact (personID int, email varchar(150), type varchar(20))
CREATE TABLE #t (personID int, type varchar(20))

INSERT INTO #contact (personID, email, type) SELECT 1, 'student1@school.com', 'student'
UNION SELECT 2, 'teacher2@school.com', 'teacher'
UNION SELECT 3, '', 'teacher'
UNION SELECT 4, null, 'teacher'

INSERT INTO #t(personID, type)
SELECT 1, 'student'
UNION SELECT 2, 'teacher'
UNION SELECT 3, 'teacher'
UNION SELECT 4, 'teacher'

--This works fine....

SELECT
	*
FROM
(
	SELECT
		COALESCE(SUBSTRING(c.email,1, CHARINDEX('@', c.email)-1), '') UserName
	FROM
		#contact c
		INNER JOIN #t t ON c.personID=t.personID
	WHERE
		c.type='Student'
) subQuery
WHERE
	subQuery.UserName <> '';

--This generates an error....

SELECT
	*
FROM
(
	SELECT
		COALESCE(SUBSTRING(c.email,1, CHARINDEX('@', c.email)-1), '') UserName
	FROM
		#contact c
		INNER JOIN #t t ON c.personID=t.personID
	WHERE
		t.type='Student'
) subQuery
WHERE
	subQuery.UserName <> '';

What’s interesting here is that the two queries are basically the same. The only difference is that the first, working query filters where c.type is equal to Student, and the second, non-working query filters where t.type is equal to Student. The reason the query fails is because some of the #contact records don’t actually have @ signs in them, and so charindex returns -1 and you’re passing a -2 into the substring function, which causes you to see the error message.

But wait, our query clearly says only to get the rows that are marked as Student, so this should just be getting valid rows (i.e., rows in which there is a valid e-mail address). Thus, everything should be unicorns and pixie dust…right? Well, let’s dig into the inner workings a bit and come up with the entire answer.

Investigating the Execution Plans

Let’s generate some estimated execution plans. There are two reasons that I’m looking at an estimated plan here. The first one is that the estimated and actual plans will both contain all of the necessary information for solving this problem. More importantly, we can’t get the broken query’s actual execution plan: it fails before completion, and so no plan gets generated for us to see. Anyhow, to generate an execution plan, press Ctrl-L. This will give you an Execution plan tab with the different activities we’re performing. The important part here is the last two sets of plans, because they relate to the two queries that we’re interested in. The estimated execution plans for these two queries look the same:

Estimated query plans

This indicates that there’s nothing really wacky. What we have is a scan of the #contact table, followed by a computed column (which gives us the UserName column), a scan of #t, a hash match, and finally a SELECT to give us the result set. It’s a fairly standard query plan. But execution plans are like snitches: they aren’t forthright with information; you need to rough them up a bit to get the good stuff. So let’s do just that.

Interrogating the Execution Plans

What we’re looking for is the reason we’re getting bad rows even though our query clearly says that we want to ignore them. The obvious place to look is the #contact table scan.

A table scan on the #contact table

Let’s start with the working plan first, to see what’s right. Then we can compare that to the other plan and see where the difference lies. In SQL Server Management Studio, click on the #contact Table Scan icon and press F4 to go to the Properties window. This gives us a whole host of information. A lot of it is useful, but for this problem, we’re looking for the Predicate.

The table scan's Properties window, with special emphasis on the Predicate

Click on the predicate text to see the [...] symbol. Click that button and you’ll get a window with the full predicate. It should look something like this:

#contact.[type] as [c].[type]='Student' AND CASE WHEN substring(#contact.[email] as [c].[email],(1),charindex('@',#contact.[email] as [c].[email])-(1)) IS NOT NULL THEN substring(#contact.[email] as [c].[email],(1),charindex('@',#contact.[email] as [c].[email])-(1)) ELSE '' END<>''

This looks pretty familiar. The first part of the statement is a filter on the type, showing us only students. The second part is a messier version of subQuery.UserName. We want to get the usernames which are not blank, and username is defined as

COALESCE(SUBSTRING(c.email,1, CHARINDEX('@', c.email)-1), '')

So the predicate makes some sense here. So now let’s look at the estimated plan of our other query. Remember that the pictures and costs looked the same in SQL Server Management Studio, but that doesn’t mean that everything is the same. Once again, we’ll go to the Predicate for our #contact table.

A query with a 'bad' predicate

In this case, the predicate is different. Instead of checking where Type = “Student” and UserName is not blank, we’re only checking cases in which UserName is blank.

CASE WHEN substring(#contact.[email] as [c].[email],(1),charindex('@',#contact.[email] as [c].[email])-(1)) IS NOT NULL THEN substring(#contact.[email] as [c].[email],(1),charindex('@',#contact.[email] as [c].[email])-(1)) ELSE '' END<>''

So what happened to that check for type? Well, if you look at the second query, you can see that the check for type is now on #t. You can verify this by checking out the Predicate for the #t Table Scan:

#t's predicate has a filter on Student

From this, you can clearly see the filter on #t, winnowing down the list to students alone.

Putting It All Together

Okay, so how does that help us? Well, let’s look back at the execution plans again. In both cases, we have a hash match join between #contact and #type, as well as a filter on #contact that eliminates blank usernames. The primary difference between these two query plans is where the type filter goes: on #t or on #contact. This is vital to us because the optimizer wants us to filter out blank usernames during the scan, rather than after the join is complete. In normal circumstances, this is a great thing: any row you throw out during the scanning portion is a row that you don’t need to join or transform or perform some costly activity against. Unfortunately, in this case, the filter actually depends upon another filter being run first: the filter to get rid of contacts with the wrong type. That works out in this case with both on the same #contact table, but once you move the filter over to the other table, you no longer guarantee that the first filter runs and we get only valid rows back.

So given that, here’s a thought experiment: what if we kept the filter on #t but forced a particular join order, making sure that #t goes first? Then we could throw out all of the non-joined rows and not have to worry about our problem, right?

Well, one way to force join order is to specify a join hint. Let’s make this a HASH JOIN (to match our other queries) but hash joining #t to #contact, and see if that gets us anywhere:

SELECT
	*
FROM
(
	SELECT
		COALESCE(SUBSTRING(c.email,1, CHARINDEX('@', c.email)-1), '') UserName
	FROM
		#t t
		INNER HASH JOIN #contact c ON c.personID=t.personID
	WHERE
		t.type='Student'
) subQuery
WHERE
	subQuery.UserName <> '';

Using a join hint to force the table join order

The execution plan doesn’t look very different (we just move the Compute Scalar from before the join to after the join), and the end result is the same:

Msg 537, Level 16, State 2, Line 2
Invalid length parameter passed to the LEFT or SUBSTRING function.

So that doesn’t help at all.  How about if we force a non-clustered index on #contact and join that way?

create index ix_test on #contact(PersonId, Type, Email);

Running the breaking query again, we find out that this does fix the problem. So what changed? Well, let’s look at the execution plan:

Adding an index helped here

In this case, when we go to the Properties of the Index Seek item, we still have our Predicate column, but it has been joined by a Seek Predicates column. The seek predicate joins personIDs together from #contact and #t. What is important here is that the seek predicate gets executed first, before the normal predicate. We know this is the case because we don’t get the error. SQL Server is able to optimize the join by throwing out any rows which don’t join, and the expected gain from doing this is greater than our previous method of scanning all of the rows and then throwing them away based on the predicate.

Conclusions

In this short article, we were able to investigate an “intermittent” problem and figure out the root cause.  The ‘best’ fix in this case is to move the entire predicate to the #contact table.  That way, we prevent this problem from happening in the future if some data set makes it more valuable to read from the #contact table first and then join to #t rather than the other way around.

The Repudiation Problem In Bug Bounties

Mohit Kumar rants about bug bounty programs.  Done correctly, a bug bounty can be a great incentive for finding problems with software.  But there is a repudiation problem inherent in the system.  Here’s a stylized example to show the problem. 

Suppose we have a company X, which releases a software product.  Company X’s management decides to offer a bug bounty for security vulnerabilities, realizing that security is important and very difficult.

Security Researcher A finds a novel security flaw in the product.  He submits corroborating information to Company X.  Company X determines that yes, it is a bug.  In the meantime, Security Researcher B also finds this same flaw.  The problem is that B doesn’t necessarily have any idea what A has already done, and so doesn’t realize that what he’s doing is a waste of time.*  Eventually, an honest X would award A with the bug bounty and alert B that somebody else received the bounty for this.

But X could have an incentive to repudiate the bounty a certain percentage of the time.  The company still gets increased outside security help from people motivated by the monetary reward, but doesn’t actually pay it out.  Instead, they tell Security Researcher A that somebody else already reported it (or internal testing found it first) and A has no proof otherwise.  As long as you don’t do it in cases which are high-profile (e.g., competitions at large conferences) and the outside security researchers don’t have a way of getting proof of this recalcitrance, it’s easy to perpetuate.

A standard mechanism for non-repudiation is to give a hash of a message.  That way, you can prove later that the message you see is the same as what the sender originally hashed, meaning that nobody interfered with the message.  An analog here would be to have a public database of vulnerability hashes, with researcher IDs and dates of entry.  When you submit a claim to a company, a hash of your attack goes into the public database, and you could see whether or not you are the first person to submit this particular vulnerability.

By itself, this is probably not a feasible mechanism in this particular case, due to the fact that you would need to hash concepts rather than the submission text itself.  In addition to the external bounty site, you would also need an independent arbiter with access to the code and a mandate to arbitrate fairly.  Making the hashes public would go a long step in that direction by keeping the arbitration agency honest.

* – There are cases in which B might find something that A did not find, so we don’t know a priori that this is a total waste of time.  But it probably is.

Certification Exam #1 Completed

I have taken my first certification exam in my current plan.  The result is that I now have my first Microsoft-related certification.

From here, my eight-month plan is pretty hectic but fairly well filled-out:

November through January:  study for LPIC-1 101.  It doesn’t directly relate to my job, but I’d like to get some Linux certifications in there to balance out my Microsoft-centric resume.  I would probably take this exam at the end of January.

January and February:  study for 70-461.  I might be able to do all of this in less than a month, given that it’s right in my wheelhouse.  I’d expect to take this before the end of February.

February, March, and early April:  study for 70-463.  Of the three MCSA exams, this is my relatively weakest topic.

April and May:  study for LPIC-1 102.  That way, I can get my LPI-1 certification.

May through July:  study for my CEH.  Like I mentioned before, I think this is an introductory cert, and so given where I’m at now, there is some value in me pursuing it.