TIL: I Stole A Presentation Title

I’m getting prepared for the Raleigh-Durham SQL Server Launch 2016 Discovery Day coming up this Saturday and working with the data set.  I’m not going to link the data set here (because I don’t want anybody getting any head starts), but through perusing it, I learned that my Much Ado About Hadoop talk is not the only talk by that name.  In fact, Jeremiah Peschka had a talk with exactly the same title in 2013, and one of the two places he presented it was at SQL Saturday #217 in Columbus, an event I attended.

At the time, I didn’t know anything about Hadoop and my name selection was completely unrelated, but I’m glad to see that great minds think alike.

Another couple of tidbits from this data set:

  1. There were 594 submissions which include “Hadoop,” “HDInsight,” or “Big Data” in the title.  I am responsible for 21 of these.  If you strip out “Big Data” and just use the other two terms, I am responsible for 21 out of 278, or 7.5%.
  2. 323 Hadoop/HDInsight/Big Data talks were selected, or 54% of submissions.  Stripping out Big Data, we have 164 out of 278, or 59%.  Maybe it pays to be more specific?
  3. 14 of my 21 submissions were selected, meaning that I am responsible for 8.5% of Hadoop/HDInsight talks at SQL Saturdays through approximately April of 2016.
  4. Looking just at Hadoop/HDInsight talks, here’s the year breakdown of selected session counts since 2012:  13, 37, 45, 54, 15 (again, through about April).  I’ve done 4 of the 15 Hadoop talks in 2016, and did 7 of the 54 in 2015.
  5. I am the only person in the data set with my combination of postal code, industry, and job function.  This means I can trace exactly where I’ve been…although it has me at 3 SQL Saturdays I never attended and is missing two that I did attend.  The three I didn’t attend were cases in which I submitted as a speaker but did not get selected, and was before the website handled this scenario gracefully; the two I attended were long ago and I probably didn’t register for them.

This is a pretty fun data set.  I’m looking forward to seeing what people can do with it.

New Series Forthcoming: Polybase In Detail

With the general release of SQL Server 2016, it’s time for me to rebuild my SQL Server 2016 VM.  I’m going to take this opportunity to dig into Polybase in some detail, starting up a potentially long-running series on how to install & configure Polybase, how it works, and dig into some details I had previously ignored.

Polybase is my favorite feature in SQL Server 2016, and my goal in this will be to expose this feature to people who are interested but simply haven’t had the opportunity to explore it.

Finding Invalid String Combinations

Today, I had to write a query to parse a column in SQL Server to figure out if the process which loads this column is using delimiters correctly.

Problem Description

One of the developers on my team came to me asking about how to find a hidden recordset character in a column.  They were using this recordset character (CHAR(30)) to separate four-digit numbers in a VARCHAR column, and somehow bad data was getting in, where the recordset character was separating subsets of those four-digit numbers.  My mission was to figure out how widespread this was.

Let’s Prep Some Data

For this demo, I’m going to introduce some slightly simplified data, but pretty close to reality.

CREATE TABLE #Test
(
	CustomerID INT,
	CampaignID INT,
	Value VARCHAR(30)
);

INSERT INTO #Test
(
	CustomerID,
	CampaignID,
	Value
)
VALUES
(1, 1, CONCAT('1234', CHAR(30), '5678')),							--Good
(1, 2, '5678'),														--Good
(1, 3, CONCAT('1', CHAR(30), '2', CHAR(30), '6', CHAR(30), '8')),	--Bad
(1, 4, '12345678'),													--Bad
(2, 1, CONCAT('123', CHAR(30), '4', CHAR(30), '5678')),				--Bad
(2, 2, ''),															--Ignore
(2, 6, CONCAT('2222', CHAR(30), '4442'));							--Good

I’ve marked records as “good” if they follow the practice of 4-digit blocks with recordset separators.

Rant Time

Before I get to the code, let me get to the rant.  Normalization is a vital part of database design.  The scenario I am about to describe fails even first normal form, and this has caused a small amount of suffering on the part of developers and on my part as well.  Ignore first normal form at your peril; eventually, you will suffer as a result.

For a primer on normal forms, I have a presentation (which I’ll probably update by the next time I give it).  In this particular scenario, it’s obvious that we’re storing sets of 0 or more numeric values, so instead of a single table with a character-based value field, we really ought to have a table which stores integers.

CREATE TABLE #campaign
(
	CustomerID INT,
	CampaignID INT
);

INSERT INTO #campaign
(
	CustomerID,
	CampaignID
)
VALUES
(1,1),
(1,2),
(1,3),
(1,4),
(2,1),
(2,2),
(2,6);

CREATE TABLE #CampaignValue
(
	CustomerID INT,
	CampaignID INT,
	Value INT
);

INSERT INTO #CampaignValue
(
	CustomerID,
	CampaignID,
	Value
)
VALUES
(1,1,1234),
(1,1,5678),
(1,2,5678),
(1,3,1268),
(1,4,1234),
(1,4,5678),
(2,1,1234),
(2,1,5678),
(2,6,2222),
(2,6,4442);

There are several advantages to this.  First, we guarantee that there will never be an invalid row separator because SQL Server handles row separation for us automatically.  Second, we can put a check constraint in place if the number must be four digits (e.g., Value between 1000 and 9999 because 0 is not a valid lead character).  Third, it avoids the following code to detect issues.

Issue Detection

The following script finds any issues.  I start by building a tally table.  I am a huge fan of tally tables and recommend you have one on every instance.  If you don’t have one, I create one in a temp table.

From here, I’ll show the code and then walk through each section.

SET NOCOUNT ON;
CREATE TABLE #Tally
(
	N INT IDENTITY(0,1) NOT NULL PRIMARY KEY
);
GO
INSERT INTO #Tally DEFAULT VALUES
GO 10000
GO
WITH records AS
(
	SELECT
		t.CustomerID,
		t.CampaignID,
		t.Value,
		val.ReplacedValue,
		val.RemovedValue
	FROM #Test t
	CROSS APPLY
	(
		SELECT
			REPLACE(t.Value, CHAR(30), 'x') AS ReplacedValue,
			REPLACE(t.Value, CHAR(30), '') AS RemovedValue
	) val
	WHERE
		t.Value <> ''
),
tallyresults AS
(
	SELECT
		r.CustomerID,
		r.CampaignID,
		r.Value,
		r.ReplacedValue,
		r.RemovedValue,
		t.N,
		SUBSTRING(r.RemovedValue, t.N + 1, 4) AS SplitVersion
	FROM records r
	CROSS JOIN #Tally t
	WHERE
		t.N < LEN(r.RemovedValue)
		AND t.N % 4 = 0
),
tallyjoin AS
(
	SELECT DISTINCT
		tr.CustomerID,
		tr.CampaignID,
		tr.Value,
		tr.ReplacedValue,
		tr.RemovedValue,
		REPLACE
		(
			CAST
			(
				STUFF
				((
					SELECT
						',' + tri.SplitVersion
					FROM tallyresults tri
					WHERE
						tri.CustomerID = tr.CustomerID
						AND tri.CampaignID = tr.CampaignID
					ORDER BY
						tri.N
					FOR XML PATH ('')
				), 1, 1, '') AS VARCHAR(8000)
			), N',', CHAR(30)
		) AS NewValue
	FROM tallyresults tr
)
SELECT
	tj.CustomerID,
	tj.CampaignID,
	tj.Value,
	tj.ReplacedValue,
	tj.RemovedValue,
	tj.NewValue,
	val.NewReplacedValue
FROM tallyjoin tj
	CROSS APPLY
	(
		SELECT
			REPLACE(tj.NewValue, CHAR(30), 'x') AS NewReplacedValue
	) val
WHERE
	tj.Value <> tj.NewValue;

In this solution, I have three common table expressions.  Let’s start with the first one, records.  Records does two things:  first, it gives us a visual indicator of what where my recordset characters are by replacing them with x’s.

To do this, I use one of my favorite T-SQL features:  CROSS APPLY.

SELECT
	t.CustomerID,
	t.CampaignID,
	t.Value,
	val.ReplacedValue,
	val.RemovedValue
FROM #Test t
CROSS APPLY
(
	SELECT
		REPLACE(t.Value, CHAR(30), 'x') AS ReplacedValue,
		REPLACE(t.Value, CHAR(30), '') AS RemovedValue
) val
WHERE
	t.Value <> ''

What I’m doing is taking each record in #Test and performing two REPLACE operations. I’m doing this in a derived function I’m calling val. The derived function operates at no extra cost to the query, but simplifies my SELECT clause by moving calculations elsewhere. In this particular scenario, I’m not really gaining anything, but in my real-life version, I performed some other work against t.Value before running the REPLACE functions, and so I would otherwise need to repeat that work in several places. If you’re interested in learning more about how to simplify code using APPLY, check out an example from my APPLY Yourself talk.

After using APPLY to simplify the calculations, I feed the results from records into a second common table expression, tallyresults.

SELECT
	r.CustomerID,
	r.CampaignID,
	r.Value,
	r.ReplacedValue,
	r.RemovedValue,
	t.N,
	SUBSTRING(r.RemovedValue, t.N + 1, 4) AS SplitVersion
FROM records r
CROSS JOIN #Tally t
WHERE
	t.N < LEN(r.RemovedValue)
	AND t.N % 4 = 0

This joins my records to a tally table, which gives one row for each character in RemovedValue (that is, the numbers without recordset separators).  I then retain only the values which start a sequence, and use SUBSTRING to snatch up four digits. What I’m left with is a column named SplitVersion, which has one row for each customer, campaign, and 4-digit value (which is equivalent to my normalized table’s structure).

If that wasn’t exciting enough, we now need to slam this back together into our denormalized format, and that’s what tallyjoin does. It uses the FOR XML PATH trick to concatenate my four-digit values into one string, separated by commas. You might be wondering why I use comma instead of CHAR(30), and the answer is that converting CHAR(30) to XML returns a nasty result, so instead of trying to handle that, I use a character which is copacetic and translate it back using the REPLACE function after casting my “XML” result to varchar.

SELECT DISTINCT
	tr.CustomerID,
	tr.CampaignID,
	tr.Value,
	tr.ReplacedValue,
	tr.RemovedValue,
	REPLACE
	(
		CAST
		(
			STUFF
			((
				SELECT
					',' + tri.SplitVersion
				FROM tallyresults tri
				WHERE
					tri.CustomerID = tr.CustomerID
					AND tri.CampaignID = tr.CampaignID
				ORDER BY
					tri.N
				FOR XML PATH ('')
			), 1, 1, '') AS VARCHAR(8000)
		), N',', CHAR(30)
	) AS NewValue
FROM tallyresults tr

This query is pretty straightforward except for the XML trick. I need to get a distinct set of results because I don’t need the same customer ID, campaign ID, and value showing up multiple times; I only need it once, and I know there will be duplicates because I’m combining multiple rows together in a subquery. Note that I removed SplitVersion and N from the result set; if I don’t do that, I’ll continue to show every row, which means I’d have duplicate results at the end.

Finally, I wrap it up by selecting values from tallyjoin and doing one last replace of recordset values with x’s so we can see the separators in our results:

Recordset Result Set

Conclusion

This is an excellent case study for the importance of proper database normalization.  Due to one little bug in the code, I needed to write a fairly complicated query to recover from it.  With a properly normalized database, we would not have had this bug.

Presentation Goals, Half-Year Update

We’ve reached the midpoint of 2016, and I figure this is as good a time as any to see how my presentation goals are looking.

Here are my four goals:

  1. Present four new talks.
  2. Present to two brand new audiences.
  3. Speak at 12 SQL Saturdays and eight user groups.
  4. Get paid to speak at a conference.

I’m doing pretty well for the most part.

Present Four New Talks

So far this year, I have given seven new talks:

  1. Introducing Azure ML
  2. Peanut Butter & Chocolate:  Integrating Hadoop and SQL Server
  3. Machine Learning with R
  4. Data Migration with Biml
  5. Big Data, Small Data, and Everything In Between
  6. Introducing Azure SQL Database
  7. Jupyter on the Microsoft Stack

The four bolded talks are ones for which I have finalized slides and could easily present again.  The other three are more demo-driven, although I’m going to turn Jupyter on the Microsoft Stack into a talk entitled R for the SQL Server Developer, which I’m giving in July.  I also have a new talk on SQL Server security features that I haven’t yet unleashed but will do so relatively soon.

In other words, I’ve already blown this goal away.

Present to Two Brand New Audiences

I really have not done this yet.  If I were getting graded, I’d try to sell the .NET User Group’s Polyglot SIG as a “brand new” audience, but it’s not.  I have to get cracking on this; I think I will have my Hadoop audience, but need to find a second audience.

Speak At 12 SQL Saturdays and 8 User Groups

So far this year, I’ve spoken at 9 SQL Saturdays and am on the hook for 3 more:

  1. Cleveland (2016-02-06)
  2. Tampa (2016-02-27)
  3. Chicago (2016-03-05)
  4. Richmond (2016-03-19)
  5. Madison (2016-04-09)
  6. Baltimore (2016-04-30)
  7. Jacksonville (2016-05-07)
  8. Rochester (2016-05-14)
  9. Atlanta (2016-05-21)
  10. Maine (2016-06-04)
  11. Chattanooga (2016-06-25)
  12. Columbus (2016-07-16)

I have a few more irons in the fire, so that certainly won’t be the end of the list.

As far as user group talks go:

  1. TriNUG main meeting (2016-01-13)
  2. TriPASS (2016-02-16)
  3. Polyglot .NET (2016-02-23)
  4. TriNUG Data SIG (2016-03-16)
  5. Roanoke Valley .NET User Group (2016-05-05)
  6. TriPASS (2016-05-17)
  7. Polyglot .NET (2016-05-24)
  8. Triad PASS BI (2016-06-28)
  9. Triad PASS (2016-09-27)

So it looks like I’ve hit that goal already.  Again, I expect to keep up a good user group pace through the rest of the year.

Get Paid to Speak at a Conference

I am speaking at DevTeach in Montreal, this July 5th through the 8th.  I’m going to give two sessions there.

Conclusion

I guess I should set tougher goals next year…  Right now, I’m very happy with my pace of speaking.  In total, I’ve given 20 talks and am on the books for another 8.  I’d like to give 40 sessions by the end of the year, and continue that into next year.  I think one area I’m way behind on is online sessions—I’ve only done one virtual chapter meeting, so there’s a wide world available there.

TIL: Installing Spark On Windows

I’m going through Justin Pihony’s Pluralsight course on Spark, and he covers installing.  I’m glad that he did because there are a couple nuances involved.

First, you can download Spark without having Hadoop installed, but if you do that, you’ll want to follow Sangeet Chourey’s instructions.  If you’re running 64-bit Windows, you’ll want to get his version of winutils (which is 106 KB); if you’re running 32-bit Windows, you’ll need a different executable which is something like 43 or 46 KB.

If you do not follow Sangeet’s instructions, you’ll end up with an error described in SPARK-10528, but because it’s technically not a Spark error, they can’t do much to help out here.

I also recommend going into your Spark\conf\ directory and open log4j.properties.template.  Change INFO and WARN to ERROR and save the file as log4j.properties.  That will turn down logging to errors only, which gets rid of a lot of the mess of verbiage Spark inundates you with when you hit the console.

TIL: Hive And C#

I work in a .NET shop, meaning that our developers focus primarily on writing C# code.  I’m introducing Hadoop more and more into this environment, but I know it’s an easier road if I can reduce the number of unfamiliar languages.

To that end, I was happy to learn that Randy Gelhausen has a project which integrates C# code (either Microsoft C# or Mono) with Hive.

My function-writing skills in Hive are weak, but the knowledge that this exists is pretty cool.

Why I’m giving up on Morrowind (again)

I learned a new internet acronym the other day: FOMO, or Fear of Missing Out. It’s the phenomenon that occurs when everybody talks about a new form of entertainment, be it movie, TV show, music, comic book, video game, etc., and you don’t want to be left out of the conversation. As a result, you consume media you would not normally consume just so that you can keep up with the virtual Joneses.

I’ve documented my love/hate relationship with Morrowind in the past. I played it a long time ago, found it arcane and confusing, then never played it again. I got further in this time before tapping out. The main reason I’ve uninstalled this time is to free up hard drive space, but a few things still bother me about the game.

One of the biggest problems is a problem that existed in the original Witcher (which I loved): trying too hard to abstract D&D combat rules. There is little skill in combat. You approach a bad guy and click buttons until they are dead. There is little strategy. Because magic can (and does) fail, it’s an unreliable weapon at best.Archery seems suboptimal because aiming is strange; I can’t always predict what will hit and what won’t, which some might call a feature, but it means that every time you loose an arrow, you could be incurring monster wrath for no benefit.

Stamina is my other major complaint. It’s too limiting. Every action in the game uses stamina to a greater or lesser degree, so if you don’t carry lots of potions, you’re more or less screwed. You don’t heal over time, which also makes potions vitally necessary, especially since if you cast a healing spell and the spell fails, you lose stamina, which makes you more vulnerable, ad nauseam. Later Elder Scrolls games did away with this (entirely appropriately, IMO). Mods can correct this latter problem, and if/when I do reinstall the game, I will certainly consider more mods.

The reason I keep coming back to the game is mostly FOMO at this stage. Some people I know love Morrowind and speak very highly of it. I can see glimpses, here and there, of why they enjoy it. The world is small but dense and surprisingly detailed. Perhaps if Skywind is ever finished I’ll come back to Morrowind, but for now, I’ve got other games to play.