Today, I had to write a query to parse a column in SQL Server to figure out if the process which loads this column is using delimiters correctly.
One of the developers on my team came to me asking about how to find a hidden recordset character in a column. They were using this recordset character (CHAR(30)) to separate four-digit numbers in a VARCHAR column, and somehow bad data was getting in, where the recordset character was separating subsets of those four-digit numbers. My mission was to figure out how widespread this was.
Let’s Prep Some Data
For this demo, I’m going to introduce some slightly simplified data, but pretty close to reality.
CREATE TABLE #Test ( CustomerID INT, CampaignID INT, Value VARCHAR(30) ); INSERT INTO #Test ( CustomerID, CampaignID, Value ) VALUES (1, 1, CONCAT('1234', CHAR(30), '5678')), --Good (1, 2, '5678'), --Good (1, 3, CONCAT('1', CHAR(30), '2', CHAR(30), '6', CHAR(30), '8')), --Bad (1, 4, '12345678'), --Bad (2, 1, CONCAT('123', CHAR(30), '4', CHAR(30), '5678')), --Bad (2, 2, ''), --Ignore (2, 6, CONCAT('2222', CHAR(30), '4442')); --Good
I’ve marked records as “good” if they follow the practice of 4-digit blocks with recordset separators.
Before I get to the code, let me get to the rant. Normalization is a vital part of database design. The scenario I am about to describe fails even first normal form, and this has caused a small amount of suffering on the part of developers and on my part as well. Ignore first normal form at your peril; eventually, you will suffer as a result.
For a primer on normal forms, I have a presentation (which I’ll probably update by the next time I give it). In this particular scenario, it’s obvious that we’re storing sets of 0 or more numeric values, so instead of a single table with a character-based value field, we really ought to have a table which stores integers.
CREATE TABLE #campaign ( CustomerID INT, CampaignID INT ); INSERT INTO #campaign ( CustomerID, CampaignID ) VALUES (1,1), (1,2), (1,3), (1,4), (2,1), (2,2), (2,6); CREATE TABLE #CampaignValue ( CustomerID INT, CampaignID INT, Value INT ); INSERT INTO #CampaignValue ( CustomerID, CampaignID, Value ) VALUES (1,1,1234), (1,1,5678), (1,2,5678), (1,3,1268), (1,4,1234), (1,4,5678), (2,1,1234), (2,1,5678), (2,6,2222), (2,6,4442);
There are several advantages to this. First, we guarantee that there will never be an invalid row separator because SQL Server handles row separation for us automatically. Second, we can put a check constraint in place if the number must be four digits (e.g., Value between 1000 and 9999 because 0 is not a valid lead character). Third, it avoids the following code to detect issues.
The following script finds any issues. I start by building a tally table. I am a huge fan of tally tables and recommend you have one on every instance. If you don’t have one, I create one in a temp table.
From here, I’ll show the code and then walk through each section.
SET NOCOUNT ON; CREATE TABLE #Tally ( N INT IDENTITY(0,1) NOT NULL PRIMARY KEY ); GO INSERT INTO #Tally DEFAULT VALUES GO 10000 GO WITH records AS ( SELECT t.CustomerID, t.CampaignID, t.Value, val.ReplacedValue, val.RemovedValue FROM #Test t CROSS APPLY ( SELECT REPLACE(t.Value, CHAR(30), 'x') AS ReplacedValue, REPLACE(t.Value, CHAR(30), '') AS RemovedValue ) val WHERE t.Value <> '' ), tallyresults AS ( SELECT r.CustomerID, r.CampaignID, r.Value, r.ReplacedValue, r.RemovedValue, t.N, SUBSTRING(r.RemovedValue, t.N + 1, 4) AS SplitVersion FROM records r CROSS JOIN #Tally t WHERE t.N < LEN(r.RemovedValue) AND t.N % 4 = 0 ), tallyjoin AS ( SELECT DISTINCT tr.CustomerID, tr.CampaignID, tr.Value, tr.ReplacedValue, tr.RemovedValue, REPLACE ( CAST ( STUFF (( SELECT ',' + tri.SplitVersion FROM tallyresults tri WHERE tri.CustomerID = tr.CustomerID AND tri.CampaignID = tr.CampaignID ORDER BY tri.N FOR XML PATH ('') ), 1, 1, '') AS VARCHAR(8000) ), N',', CHAR(30) ) AS NewValue FROM tallyresults tr ) SELECT tj.CustomerID, tj.CampaignID, tj.Value, tj.ReplacedValue, tj.RemovedValue, tj.NewValue, val.NewReplacedValue FROM tallyjoin tj CROSS APPLY ( SELECT REPLACE(tj.NewValue, CHAR(30), 'x') AS NewReplacedValue ) val WHERE tj.Value <> tj.NewValue;
In this solution, I have three common table expressions. Let’s start with the first one, records. Records does two things: first, it gives us a visual indicator of what where my recordset characters are by replacing them with x’s.
To do this, I use one of my favorite T-SQL features: CROSS APPLY.
SELECT t.CustomerID, t.CampaignID, t.Value, val.ReplacedValue, val.RemovedValue FROM #Test t CROSS APPLY ( SELECT REPLACE(t.Value, CHAR(30), 'x') AS ReplacedValue, REPLACE(t.Value, CHAR(30), '') AS RemovedValue ) val WHERE t.Value <> ''
What I’m doing is taking each record in #Test and performing two REPLACE operations. I’m doing this in a derived function I’m calling val. The derived function operates at no extra cost to the query, but simplifies my SELECT clause by moving calculations elsewhere. In this particular scenario, I’m not really gaining anything, but in my real-life version, I performed some other work against t.Value before running the REPLACE functions, and so I would otherwise need to repeat that work in several places. If you’re interested in learning more about how to simplify code using APPLY, check out an example from my APPLY Yourself talk.
After using APPLY to simplify the calculations, I feed the results from records into a second common table expression, tallyresults.
SELECT r.CustomerID, r.CampaignID, r.Value, r.ReplacedValue, r.RemovedValue, t.N, SUBSTRING(r.RemovedValue, t.N + 1, 4) AS SplitVersion FROM records r CROSS JOIN #Tally t WHERE t.N < LEN(r.RemovedValue) AND t.N % 4 = 0
This joins my records to a tally table, which gives one row for each character in RemovedValue (that is, the numbers without recordset separators). I then retain only the values which start a sequence, and use SUBSTRING to snatch up four digits. What I’m left with is a column named SplitVersion, which has one row for each customer, campaign, and 4-digit value (which is equivalent to my normalized table’s structure).
If that wasn’t exciting enough, we now need to slam this back together into our denormalized format, and that’s what tallyjoin does. It uses the FOR XML PATH trick to concatenate my four-digit values into one string, separated by commas. You might be wondering why I use comma instead of CHAR(30), and the answer is that converting CHAR(30) to XML returns a nasty result, so instead of trying to handle that, I use a character which is copacetic and translate it back using the REPLACE function after casting my “XML” result to varchar.
SELECT DISTINCT tr.CustomerID, tr.CampaignID, tr.Value, tr.ReplacedValue, tr.RemovedValue, REPLACE ( CAST ( STUFF (( SELECT ',' + tri.SplitVersion FROM tallyresults tri WHERE tri.CustomerID = tr.CustomerID AND tri.CampaignID = tr.CampaignID ORDER BY tri.N FOR XML PATH ('') ), 1, 1, '') AS VARCHAR(8000) ), N',', CHAR(30) ) AS NewValue FROM tallyresults tr
This query is pretty straightforward except for the XML trick. I need to get a distinct set of results because I don’t need the same customer ID, campaign ID, and value showing up multiple times; I only need it once, and I know there will be duplicates because I’m combining multiple rows together in a subquery. Note that I removed SplitVersion and N from the result set; if I don’t do that, I’ll continue to show every row, which means I’d have duplicate results at the end.
Finally, I wrap it up by selecting values from tallyjoin and doing one last replace of recordset values with x’s so we can see the separators in our results:
This is an excellent case study for the importance of proper database normalization. Due to one little bug in the code, I needed to write a fairly complicated query to recover from it. With a properly normalized database, we would not have had this bug.