Finding Invalid String Combinations

Today, I had to write a query to parse a column in SQL Server to figure out if the process which loads this column is using delimiters correctly.

Problem Description

One of the developers on my team came to me asking about how to find a hidden recordset character in a column.  They were using this recordset character (CHAR(30)) to separate four-digit numbers in a VARCHAR column, and somehow bad data was getting in, where the recordset character was separating subsets of those four-digit numbers.  My mission was to figure out how widespread this was.

Let’s Prep Some Data

For this demo, I’m going to introduce some slightly simplified data, but pretty close to reality.

CREATE TABLE #Test
(
	CustomerID INT,
	CampaignID INT,
	Value VARCHAR(30)
);

INSERT INTO #Test
(
	CustomerID,
	CampaignID,
	Value
)
VALUES
(1, 1, CONCAT('1234', CHAR(30), '5678')),							--Good
(1, 2, '5678'),														--Good
(1, 3, CONCAT('1', CHAR(30), '2', CHAR(30), '6', CHAR(30), '8')),	--Bad
(1, 4, '12345678'),													--Bad
(2, 1, CONCAT('123', CHAR(30), '4', CHAR(30), '5678')),				--Bad
(2, 2, ''),															--Ignore
(2, 6, CONCAT('2222', CHAR(30), '4442'));							--Good

I’ve marked records as “good” if they follow the practice of 4-digit blocks with recordset separators.

Rant Time

Before I get to the code, let me get to the rant.  Normalization is a vital part of database design.  The scenario I am about to describe fails even first normal form, and this has caused a small amount of suffering on the part of developers and on my part as well.  Ignore first normal form at your peril; eventually, you will suffer as a result.

For a primer on normal forms, I have a presentation (which I’ll probably update by the next time I give it).  In this particular scenario, it’s obvious that we’re storing sets of 0 or more numeric values, so instead of a single table with a character-based value field, we really ought to have a table which stores integers.

CREATE TABLE #campaign
(
	CustomerID INT,
	CampaignID INT
);

INSERT INTO #campaign
(
	CustomerID,
	CampaignID
)
VALUES
(1,1),
(1,2),
(1,3),
(1,4),
(2,1),
(2,2),
(2,6);

CREATE TABLE #CampaignValue
(
	CustomerID INT,
	CampaignID INT,
	Value INT
);

INSERT INTO #CampaignValue
(
	CustomerID,
	CampaignID,
	Value
)
VALUES
(1,1,1234),
(1,1,5678),
(1,2,5678),
(1,3,1268),
(1,4,1234),
(1,4,5678),
(2,1,1234),
(2,1,5678),
(2,6,2222),
(2,6,4442);

There are several advantages to this.  First, we guarantee that there will never be an invalid row separator because SQL Server handles row separation for us automatically.  Second, we can put a check constraint in place if the number must be four digits (e.g., Value between 1000 and 9999 because 0 is not a valid lead character).  Third, it avoids the following code to detect issues.

Issue Detection

The following script finds any issues.  I start by building a tally table.  I am a huge fan of tally tables and recommend you have one on every instance.  If you don’t have one, I create one in a temp table.

From here, I’ll show the code and then walk through each section.

SET NOCOUNT ON;
CREATE TABLE #Tally
(
	N INT IDENTITY(0,1) NOT NULL PRIMARY KEY
);
GO
INSERT INTO #Tally DEFAULT VALUES
GO 10000
GO
WITH records AS
(
	SELECT
		t.CustomerID,
		t.CampaignID,
		t.Value,
		val.ReplacedValue,
		val.RemovedValue
	FROM #Test t
	CROSS APPLY
	(
		SELECT
			REPLACE(t.Value, CHAR(30), 'x') AS ReplacedValue,
			REPLACE(t.Value, CHAR(30), '') AS RemovedValue
	) val
	WHERE
		t.Value <> ''
),
tallyresults AS
(
	SELECT
		r.CustomerID,
		r.CampaignID,
		r.Value,
		r.ReplacedValue,
		r.RemovedValue,
		t.N,
		SUBSTRING(r.RemovedValue, t.N + 1, 4) AS SplitVersion
	FROM records r
	CROSS JOIN #Tally t
	WHERE
		t.N < LEN(r.RemovedValue)
		AND t.N % 4 = 0
),
tallyjoin AS
(
	SELECT DISTINCT
		tr.CustomerID,
		tr.CampaignID,
		tr.Value,
		tr.ReplacedValue,
		tr.RemovedValue,
		REPLACE
		(
			CAST
			(
				STUFF
				((
					SELECT
						',' + tri.SplitVersion
					FROM tallyresults tri
					WHERE
						tri.CustomerID = tr.CustomerID
						AND tri.CampaignID = tr.CampaignID
					ORDER BY
						tri.N
					FOR XML PATH ('')
				), 1, 1, '') AS VARCHAR(8000)
			), N',', CHAR(30)
		) AS NewValue
	FROM tallyresults tr
)
SELECT
	tj.CustomerID,
	tj.CampaignID,
	tj.Value,
	tj.ReplacedValue,
	tj.RemovedValue,
	tj.NewValue,
	val.NewReplacedValue
FROM tallyjoin tj
	CROSS APPLY
	(
		SELECT
			REPLACE(tj.NewValue, CHAR(30), 'x') AS NewReplacedValue
	) val
WHERE
	tj.Value <> tj.NewValue;

In this solution, I have three common table expressions.  Let’s start with the first one, records.  Records does two things:  first, it gives us a visual indicator of what where my recordset characters are by replacing them with x’s.

To do this, I use one of my favorite T-SQL features:  CROSS APPLY.

SELECT
	t.CustomerID,
	t.CampaignID,
	t.Value,
	val.ReplacedValue,
	val.RemovedValue
FROM #Test t
CROSS APPLY
(
	SELECT
		REPLACE(t.Value, CHAR(30), 'x') AS ReplacedValue,
		REPLACE(t.Value, CHAR(30), '') AS RemovedValue
) val
WHERE
	t.Value <> ''

What I’m doing is taking each record in #Test and performing two REPLACE operations. I’m doing this in a derived function I’m calling val. The derived function operates at no extra cost to the query, but simplifies my SELECT clause by moving calculations elsewhere. In this particular scenario, I’m not really gaining anything, but in my real-life version, I performed some other work against t.Value before running the REPLACE functions, and so I would otherwise need to repeat that work in several places. If you’re interested in learning more about how to simplify code using APPLY, check out an example from my APPLY Yourself talk.

After using APPLY to simplify the calculations, I feed the results from records into a second common table expression, tallyresults.

SELECT
	r.CustomerID,
	r.CampaignID,
	r.Value,
	r.ReplacedValue,
	r.RemovedValue,
	t.N,
	SUBSTRING(r.RemovedValue, t.N + 1, 4) AS SplitVersion
FROM records r
CROSS JOIN #Tally t
WHERE
	t.N < LEN(r.RemovedValue)
	AND t.N % 4 = 0

This joins my records to a tally table, which gives one row for each character in RemovedValue (that is, the numbers without recordset separators).  I then retain only the values which start a sequence, and use SUBSTRING to snatch up four digits. What I’m left with is a column named SplitVersion, which has one row for each customer, campaign, and 4-digit value (which is equivalent to my normalized table’s structure).

If that wasn’t exciting enough, we now need to slam this back together into our denormalized format, and that’s what tallyjoin does. It uses the FOR XML PATH trick to concatenate my four-digit values into one string, separated by commas. You might be wondering why I use comma instead of CHAR(30), and the answer is that converting CHAR(30) to XML returns a nasty result, so instead of trying to handle that, I use a character which is copacetic and translate it back using the REPLACE function after casting my “XML” result to varchar.

SELECT DISTINCT
	tr.CustomerID,
	tr.CampaignID,
	tr.Value,
	tr.ReplacedValue,
	tr.RemovedValue,
	REPLACE
	(
		CAST
		(
			STUFF
			((
				SELECT
					',' + tri.SplitVersion
				FROM tallyresults tri
				WHERE
					tri.CustomerID = tr.CustomerID
					AND tri.CampaignID = tr.CampaignID
				ORDER BY
					tri.N
				FOR XML PATH ('')
			), 1, 1, '') AS VARCHAR(8000)
		), N',', CHAR(30)
	) AS NewValue
FROM tallyresults tr

This query is pretty straightforward except for the XML trick. I need to get a distinct set of results because I don’t need the same customer ID, campaign ID, and value showing up multiple times; I only need it once, and I know there will be duplicates because I’m combining multiple rows together in a subquery. Note that I removed SplitVersion and N from the result set; if I don’t do that, I’ll continue to show every row, which means I’d have duplicate results at the end.

Finally, I wrap it up by selecting values from tallyjoin and doing one last replace of recordset values with x’s so we can see the separators in our results:

Recordset Result Set

Conclusion

This is an excellent case study for the importance of proper database normalization.  Due to one little bug in the code, I needed to write a fairly complicated query to recover from it.  With a properly normalized database, we would not have had this bug.

Advertisements

2 thoughts on “Finding Invalid String Combinations

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s