Implementing A Leaky Bucket Algorithm In T-SQL

A long time ago, I talked about alerts and monitors.  A bit later, I added thoughts on auditing as well.  Today, I’m going to cover a technique that I’ve used to alert me to real problems without too much noise.  That technique is called the leaky bucket algorithm.

Step One:  Get A Bucket

Let’s say that we have a process we want to monitor, like an API.  Further suppose we have a table in SQL Server which collects every 500 class response that our API sends out.  As a quick refresher, 200 class responses (e.g., HTTP 200) indicate success.  300 class responses (e.g., 301, 302) indicate redirection.  400 class responses (e.g., 403, 404) typically indicate that the user has failed us somehow.  500 class responses (e.g., 500) typically indicate that the server has failed us somehow.  We might find it interesting to track 404 errors because that might help us find out that, for example, we don’t have a favicon.ico file, but there’s a lot of noise in those types of errors (especially if your API is publicly accessible), so we’ll ignore those and focus on the 500 errors.

So let’s create that table:

CREATE TABLE [dbo].[ApiCall]
(
	ApiCallID BIGINT IDENTITY(1,1) NOT NULL,
	EndpointName NVARCHAR(128) NULL,
	RequestVerb NVARCHAR(10) NOT NULL,
	ResponseCode SMALLINT NOT NULL,
	ResponseTimeMS INT NOT NULL,
	LogDateGMT DATETIME2(3) NOT NULL CONSTRAINT [DF_ApiCall_LogDateGMT] DEFAULT (SYSUTCDATETIME()),
	CONSTRAINT [PK_ApiCall] PRIMARY KEY CLUSTERED(ApiCallID)
);
GO

This is a very simple table, and a real alerting table will have additional information, but let’s not complicate things unnecessarily here.

Step Two:  Put A Hole In It

Now that we have a table, we want to do something with it.  The most naive solution would be to fire off an alert every time a row gets added to this table.  The problem with this solution is that it can flood our inbox with useless information.  Suppose the developers push out an API change that breaks everything.  The first 500 response will be important information.  The second one might be important because it’s confirmatory.  The six hundredth alert will not help.  And heaven help you if you’ve got this alert tied to your PagerDuty account…

So let’s only alert if we get to a certain threshold—we’ll call it 5 messages.  Fewer than 5 and we can assume it’s a blip or just somebody doing something weird.  The next part of the problem is, 5 messages per what?  We don’t want to alert for every fifth error message.  Let’s say we get one error message a year, so there was one in 2014, one in 2015, one in 2016, one in 2017, and now one in 2018.  If we simply set our threshold to report on every fifth error, we’ll get an alert in 2018, but most likely, the prior years’ errors are long gone by now, so that’s not helpful.

Instead, we want to set a timeframe for those alerts.  In my case, I’ll say that I want to look for 5 errors anytime during the past 30 minutes.  The particular number and timeframe are variable and I pulled those numbers out of a hat, but they’ll be useful for us.  But what makes it nice is that errors “drain” from the bucket automatically.  Errors lose their importance over time in our alerting system, and 30 minutes of calm completely drains the bucket.

Step Three:  Thinking About Overflow

But now we have a new frequency of alert issue.  Let’s suppose that I alert anytime I see 5 errors in the past 30 minutes.  Let’s further suppose that my process checks every 5 minutes and sends out an e-mail.  The problem is that I might get upwards of 6 alert e-mails for the same five errors—that’s definitely not what I want!

As an example of this, suppose that we had 5 errors happen in a burst at 5:30 PM, right before my alert check ran.  I will subsequently get alerts at:

  1. 5:30 PM
  2. 5:35 PM
  3. 5:40 PM
  4. 5:45 PM
  5. 5:50 PM
  6. 5:55 PM

They will finally wash out at my 6 PM alert cycle, but I really didn’t need 5 of those alert e-mails.  One viable solution is to mark those particular errors as alerted and therefore no longer interesting to us.  That solution can work for us with this table design (though I’d want to add a new bitflag indicating whether we’ve sent an alert for this error), but in real life, I’m storing a form of this data in a table with a clustered columnstore index and I don’t want to update rows in that table.  So I’m going to use a different approach:  only alerting when we overflow.  And now we’re getting to the metaphor.

Think about a bucket that can hold five gallons of water.  If  I try to pour in six gallons, I can get the first five, but the remaining gallon overflows and causes a mess everywhere.  This also gives us a cap on how much water we care about:  if the bucket is empty and I immediately dump 10 gallons into it, 5 of those gallons will overflow immediately, leaving me with a full five-gallon bucket.  We tell people about the spill (those five gallons not in the bucket) but don’t care any more about that spilled water.  Now suppose that I take that five-gallon bucket and drill a hole in it.  Let’s say that it has a drain rate of one gallon every six minutes.  If I average no more than one gallon every six minutes, I can pour into perpetuity without overflow…assuming that the bucket wasn’t full when I began the process.

Step Four:  Handling History

That’s where we need to think about the next problem.  In order to implement a leaky bucket, I need to know the entire history of the bucket, or at least a long enough period of time that I can feel confident that I’m not going to miss important alerts.  If I’m only looking at the last 30 minutes, I’m missing the fact that the bucket might not have been empty when my 30-minute window started, so I have to look back at the prior 30-minute window.  But to do that, I need to look back at the prior window, and so on, until I get back to the first usage of that bucket.

You might be wondering why that’s important, and here goes:  let’s say that we have one error every 8 minutes.  Based on the error rate alone and alerting at 5 errors in a 30-minute window, we will never fire an alert even though we have a consistent error.  That means we’re potentially ignoring useful information and this potential false negative might come back to bite us later.

By contrast, with our leaky bucket, we will eventually overflow, though it will take a while.  We have 7.5 errors per hour and 6 drained errors per hour, so we have a little bit of an algebra problem:  5 = 7.5n – 6n –> 5 = 1.5n –> n = 3.33 hours.  After 3.33 hours, our bucket will fill and we’ll get an alert.

But looking at the entire history of a table is not really practical for us when that table gets large, so instead I’ll compromise:  I’ll look at 5 cycles.  The idea here is that no matter what happened, the entire bucket will have been flushed five times during this process and older errors are very unlikely to have an effect.  We still get the benefit of looking at longer timeframes and capturing slowly-building errors, but don’t need to accept the full cost of reading everything.

Step Five:  The Final Product

After all of this talk, I finally get to the code.  We’ll take the table defined above, add a few rows, and test a couple of scenarios.

So let’s add some rows:

INSERT INTO dbo.ApiCall
(
	EndpointName,
	RequestVerb,
	ResponseCode,
	ResponseTimeMS,
	LogDateGMT
)
VALUES
--Initial burst to trigger an alert; note that there are more than enough errors here.
('Endpoint', 'POST', 500, 15, '2018-01-08 09:03:10.274'),
('Endpoint', 'POST', 500, 15, '2018-01-08 09:03:10.274'),
('Endpoint', 'POST', 500, 15, '2018-01-08 09:03:10.274'),
('Endpoint', 'POST', 500, 15, '2018-01-08 09:03:10.274'),
('Endpoint', 'POST', 500, 15, '2018-01-08 09:03:10.274'),
('Endpoint', 'POST', 500, 15, '2018-01-08 09:03:10.274'),
('Endpoint', 'POST', 500, 15, '2018-01-08 09:03:10.274'),
--Occasional record, but should stay under the threshold
('Endpoint', 'POST', 500, 15, '2018-01-08 09:14:10.274'),
('Endpoint', 'POST', 500, 15, '2018-01-08 09:24:10.274'),
('Endpoint', 'POST', 500, 15, '2018-01-08 09:34:10.274'),
('Endpoint', 'POST', 500, 15, '2018-01-08 09:44:10.274'),
--Another burst to trigger an alert
('Endpoint', 'POST', 500, 15, '2018-01-08 09:48:10.274'),
('Endpoint', 'POST', 500, 15, '2018-01-08 09:48:10.274'),
('Endpoint', 'POST', 500, 15, '2018-01-08 09:48:10.274'),
('Endpoint', 'POST', 500, 15, '2018-01-08 09:48:10.274'),
('Endpoint', 'POST', 500, 15, '2018-01-08 09:48:10.274'),
--Occasional record to stay under the threshold, but at a fast enough pace that we eventually trigger again
('Endpoint', 'POST', 500, 15, '2018-01-08 09:59:10.274'),
('Endpoint', 'POST', 500, 15, '2018-01-08 10:09:10.274'),
('Endpoint', 'POST', 500, 15, '2018-01-08 10:19:10.274'),
('Endpoint', 'POST', 500, 15, '2018-01-08 10:24:10.274'),
('Endpoint', 'POST', 500, 15, '2018-01-08 10:29:10.274'),
('Endpoint', 'POST', 500, 15, '2018-01-08 10:34:10.274'),
('Endpoint', 'POST', 500, 15, '2018-01-08 10:39:10.274');

I have some tests I’d like to cover, but first let’s walk through the leaky bucket procedure itself.

CREATE OR ALTER PROCEDURE dbo.LeakyBucketAlert
(
@HistoricalFrameCycles INT = 5,
@FrameSizeSeconds INT = 60,
@DrainEventsPerFrame DECIMAL(9,2) = 0.1,
@MaxNumErrors INT = 5,
@ObservationFrameSeconds INT = 1800,
@StartDateGMT DATETIME2(3) = NULL
)
AS
SET NOCOUNT ON;

CREATE TABLE #Events
(
	FrameBeginTime DATETIME2(0),
	FrameEndTime DATETIME2(0),
	AlertIfFull TINYINT,
	NetBucketChange DECIMAL(9,2),
	FillLevelAtEndOfFrame DECIMAL(9,2)
);

DECLARE
	@FrameBeginTime DATETIME2(0),
	@StartingFillLevel DECIMAL(9,2),
	@NetBucketChange DECIMAL(9,2),
	@PriorFrameBeginTime DATETIME2(0);

SET @StartDateGMT = ISNULL(@StartDateGMT, GETUTCDATE());

DECLARE
	@CurrentWindowStartTime DATETIME2(0) = DATEADD(SECOND, -1 * @ObservationFrameSeconds, @StartDateGMT),
	--Start at ObservationFrameSeconds and then go back the amount of time needed for the bucket to cycle through completely.
	@PriorCycleBeginTime DATETIME2(0) = DATEADD(SECOND, -@HistoricalFrameCycles * (1.0 * @MaxNumErrors / @DrainEventsPerFrame * @FrameSizeSeconds) - @ObservationFrameSeconds, @StartDateGMT);

--The Drain events happen every period
INSERT INTO #Events
(
	FrameBeginTime,
	FrameEndTime,
	AlertIfFull,
	NetBucketChange
)
SELECT
	frame.FrameBeginTime,
	frame.FrameEndTime,
	CASE
		WHEN frame.FrameBeginTime >= @CurrentWindowStartTime THEN 1
		ELSE 0
	END AS AlertIfFull,
	-1 * @DrainEventsPerFrame AS NetBucketChange
FROM dbo.Tally t
	CROSS APPLY
	(
		SELECT
			DATEADD(SECOND, t.N * @FrameSizeSeconds, @PriorCycleBeginTime) AS FrameBeginTime,
			DATEADD(SECOND, (t.N + 1) * @FrameSizeSeconds, @PriorCycleBeginTime) AS FrameEndTime
	) frame
WHERE
	t.N < 1.0 * DATEDIFF(SECOND, @PriorCycleBeginTime, @StartDateGMT) / @FrameSizeSeconds; --The Fill events might happen during a period; if they do, collect them.

WITH FillEvents AS (
	SELECT
		e.FrameBeginTime,
		COUNT(1) AS FillEvents
	FROM dbo.ApiCall ac
		INNER JOIN #Events e
			ON ac.LogDateGMT >= e.FrameBeginTime
			AND ac.LogDateGMT < e.FrameEndTime
	WHERE
		ac.ResponseCode = 500
		AND ac.LogDateGMT >= @PriorCycleBeginTime
	GROUP BY
		e.FrameBeginTime
)
UPDATE e
SET
	NetBucketChange = e.NetBucketChange + fe.FillEvents
FROM #Events e
	INNER JOIN FillEvents fe
		ON e.FrameBeginTime = fe.FrameBeginTime;

--Marking the bucket fill line
--Basic rules:
--	1)  Cannot go below 0
--	2)  Cannot go above @MaxNumErrors
--	3)  Addition and subtraction happen "simultaneously" so we apply the net change as a unit
--Because of rules 1 and 2, we cannot simply use a rolling window summation to determine fill level.
DECLARE
	BucketCursor CURSOR LOCAL FAST_FORWARD FOR
	SELECT
		e.FrameBeginTime,
		e.NetBucketChange,
		LAG(e.FrameBeginTime) OVER (ORDER BY e.FrameBeginTime) AS PriorFrameBeginTime
	FROM #Events e
	ORDER BY
		e.FrameBeginTime ASC;

OPEN BucketCursor;
FETCH NEXT FROM BucketCursor INTO @FrameBeginTime, @NetBucketChange, @PriorFrameBeginTime;

WHILE (@@FETCH_STATUS = 0)
BEGIN
	SELECT
		@StartingFillLevel = FillLevelAtEndOfFrame
	FROM #Events e
	WHERE
		e.FrameBeginTime = @PriorFrameBeginTime;

	SET @StartingFillLevel = ISNULL(@StartingFillLevel, 0);

	UPDATE #Events
	SET
		FillLevelAtEndOfFrame = CASE
			WHEN @StartingFillLevel + @NetBucketChange < 0 THEN 0 WHEN @StartingFillLevel + @NetBucketChange > @MaxNumErrors THEN @MaxNumErrors
			ELSE @StartingFillLevel + @NetBucketChange
		END
	WHERE
		FrameBeginTime = @FrameBeginTime;

	FETCH NEXT FROM BucketCursor INTO @FrameBeginTime, @NetBucketChange, @PriorFrameBeginTime;
END
CLOSE BucketCursor;
DEALLOCATE BucketCursor;

WITH records AS
(
	SELECT
		e.FrameBeginTime,
		e.FrameEndTime,
		e.AlertIfFull,
		LAG(e.FillLevelAtEndOfFrame) OVER (ORDER BY e.FrameBeginTime) AS FillLevelAtStartOfFrame,
		e.NetBucketChange,
		e.FillLevelAtEndOfFrame
	FROM #Events e
	WHERE
		EXISTS
		(
			SELECT 1
			FROM #Events ee
			WHERE
				ee.AlertIfFull = 1
				AND ee.FillLevelAtEndOfFrame = @MaxNumErrors
		)
)
SELECT
	r.FrameBeginTime,
	r.FrameEndTime,
	r.FillLevelAtStartOfFrame,
	-1 * @DrainEventsPerFrame AS DrainEventsPerFrame,
	r.NetBucketChange + @DrainEventsPerFrame AS AddEventsPerFrame,
	r.NetBucketChange,
	r.FillLevelAtEndOfFrame
FROM records r
WHERE
	r.AlertIfFull = 1;
GO

Fun fact: I spent an entire day trying to remove that cursor. Don’t tell my boss.

Let’s explain what’s happening here.  At the top, we declare two variables:  @CurrentWindowStartTime and @PriorCycleBeginTime.  @CurrentWindowStartTime is @ObservationFrameSeconds ago from our starting point, and @PriorCycleBeginTime is @HistoricalFrameCycles time periods back behind that.  To make this concrete, our current window’s start time is 30 minutes prior to the start date, so the prior cycle begin time is 180 minutes prior to the start date—that is, there are 5 30-minute windows which sit between the prior cycle’s begin time and the current window’s start time.

From there, I use a tally table that I’ve built separately to build a constant drain rate.  If you don’t have a tally table, here’s one for you:

SELECT TOP(10001)
	IDENTITY(INT, 0, 1) AS N
INTO dbo.Tally
FROM sys.objects s1
	CROSS JOIN sys.objects s2;
ALTER TABLE dbo.Tally ADD CONSTRAINT [PK_Tally] PRIMARY KEY CLUSTERED (N);

The Events temp table lets me see how many events per minute I have, as well as the drain rate.  I’m using a constant drain rate here, though you could modify this to have a proportionate drain rate (where it drains faster with more errors) if you wanted.  I also have a flag, AlertIfFull.  That flag makes sure that I don’t accidentally alert on historical overflows—my assumption is that I already caught those, so I don’t need to keep sending messages out.

Next up, there’s the cursor.  I feel obligated to explain why I need a cursor here, especially when I clearly know enough about  window functions to use them.  The problem is that we have not only a window, but also a floor and ceiling (no extended metaphor or pun intended).  I cannot go below 0 errors and I cannot go above@MaxNumErrors errors.  In other words, I cannot build up error immunity by having an extended period of time with zero errors and thereby go negative.  I also do not want to see the error count pushed up so high that it never drops back down—that would be like trying to pour 10 gallons into a 5-gallon bucket and counting those 10 gallons as there.  They’re not really there—we had an overflow and now we’re moving on with a full 5-gallon bucket.  Anyhow, the nature of that rule means that I couldn’t simply use a window function to solve this problem.

Instead, for each minute, I figure out what the starting point is, add or drain based on whether NetBucketChange is positive or negative (or do nothing if it’s 0), and deal with floors or ceilings.  The end result of that is FillLevelAtEndOfFrame, which is the thing I really care about.

Finally, after the cursor ends, we look at the Events temp table and return results if we experienced an overflow event.

Step Six:  Proving That It Works

Here are my tests given the leaky bucket.  I only alert when I get rows back, so if the result set is empty, we don’t send an alert.

--Expected:  no results.
EXEC dbo.LeakyBucketAlert
	@StartDateGMT = '2018-01-08 09:02:00';

--Expected:  results.
EXEC dbo.LeakyBucketAlert
	@StartDateGMT = '2018-01-08 09:04:00';

--Expected:  results with some drain.
EXEC dbo.LeakyBucketAlert
	@StartDateGMT = '2018-01-08 09:07:00';

--Expected:  results with continued drain and occasional bumps.
EXEC dbo.LeakyBucketAlert
	@StartDateGMT = '2018-01-08 09:25:00';

--Expected:  no results.
EXEC dbo.LeakyBucketAlert
	@StartDateGMT = '2018-01-08 09:45:00';

--Expected:  new results
EXEC dbo.LeakyBucketAlert
	@StartDateGMT = '2018-01-08 09:50:00';

--Expected:  no results
EXEC dbo.LeakyBucketAlert
	@StartDateGMT = '2018-01-08 10:20:00';

--Expected:  results
EXEC dbo.LeakyBucketAlert
	@StartDateGMT = '2018-01-08 10:34:00';

--Expected:  results.  We aren't letting the bucket drain and keep getting errors, so we keep alerting.
EXEC dbo.LeakyBucketAlert
	@StartDateGMT = '2018-01-08 10:39:00';

Step Seven: Another Approach

One area of potential improvement could be tracking whether a set of errors have already been part of an alert and ignoring them from then on.  That would be equivalent to, when the bucket overflows, pouring out the entire bucket and starting over.  There are arguments for and against this approach.  I ended up going with my approach because I want to be prodded if my API keeps sending me waves of errors, but I could see the argument in favor of emptying the bucket completely.  In that case, you’d just need to update #Events to set WHEN @StartingFillLevel + @NetBucketChange > @MaxNumErrors THEN 0. Then you’d have a leaky bucket with full flush after overflow.

Advertisements

SQL Server ML Services Fills The Plan Cache

We call SQL Server ML Services a lot.  As in building hundreds of thousands of times a day to build models.   It turns out that doing this has a negative effect:  ML Services plans end up staying in the plan cache and don’t get removed.  Here’s how our plan cache looks:

Plan Cache

A plan cache.  Precipitous drops are predicated by service restarts.

What happens is that things work fine for a while, until our plan cache hits about 70 GB, after which point we start getting RESOURCE_SEMAPHORE waits on some of our queries and the available space for buffer pool drops to single-digit gigabytes.

This is a problem on SQL Server 2016 and SQL Server 2017.  It’s very unlikely to affect most people, as most people don’t do crazy stuff at this scale.  But hey, what’s the fun in  having a server of my own if I can’t bring it to its knees every once in a while?

Non-Solutions

The first thing that you might do here is try to run something like DBCC FREEPROCCACHE or maybe DBCC FREESYSTEMCACHE('SQL Plans') WITH MARK_IN_USE_FOR_REMOVAL; but neither of those did anything. It appears that R/ML Services plans are not marked for removal and will not clear, no matter how many times you try to flush the cache.

Workaround

For now, the workaround I have is to restart the SQL Server service occasionally. You can see that I have done it twice in the above screenshot. Our application is resilient to short database downtimes, so this isn’t a bad workaround for us; it’s just a little bit of an annoyance.

One thing to keep in mind if you are in this scenario is that if you are running ML Services hundreds of thousands of times a day, your ExtensibilityData folders might have a lot of cruft which may prevent the Launchpad service from starting as expected. I’ve had to delete all folders in \MSSQL14.MSSQLSERVER\MSSQL\ExtensibilityData\MSSQLSERVER01 after stopping the SQL Server service and before restarting it. The Launchpad service automatically does this, but if you have a huge number of folders in there, the service can time out trying to delete all of them.  In my experience at least, the other folders didn’t have enough sub-folders inside to make it worth deleting, but that may just be an artifact of how we use ML Services.

Solution

I have worked with Microsoft on the issue and they’re going to release a patch in a future SQL Server 2017 CU to fix this issue.  I’m not sure about SQL Server 2016 and also don’t know exactly when this patch will ship, but it’s working through the pipeline and I’m happy for that.

Video: Applying Forensic Accounting Techniques With SQL And R

I have a new video up:

For more information: http://csmore.info/on/outliers

Forensic accountants and fraud examiners use a range of techniques to uncover fraudulent journal entries and illegal activities. As data professionals, most of us will never unravel a Bernie Madoff scheme, but we can apply these same techniques in our own environments to uncover dirty data. This session will use a combination of SQL Server and R to apply these fraud detection techniques, which include Benford’s Law, outlier analysis, time series analysis, and cohort analysis.

Click through for the video, which turns out to be exactly the same number of seconds as my first video.  That was not planned but I thought it was pretty cool.

T-SQL Tuesday 95: Polybase And HDInsight

tt_300_11FF5A1DToday is T-SQL Tuesday #95, hosted by Derik Hammer, and the topic is Big Data.

This is a continuation of my Polybase series.  If you haven’t seen this already, I have a twenty-some part series on Polybase in SQL Server 2016 and now SQL Server 2017.  Polybase hasn’t changed with the RTM release of SQL Server 2017, so I haven’t needed to write much new about the topic lately.  That’s my excuse for being lazy, at least.

Additional Resources

If you’re new here, you might be interested in the rest of my Polybase series, which I have in rough reading order.  I also turned this series into a presentation.

You may also like my Kafka series, which led to a presentation on the topic.  Kafka is a fantastic distributed message broker which lives in the Hadoop ecosystem, and I can’t say enough good things about it.

Maybe you’re curious about this Hadoop thing but don’t know where to start.  Go check out my presentation introducing Hadoop, which I have recorded as well.

After that, if you want to see several methods you could use to connect Hadoop to SQL Server (including Polybase), check out my presentation on the topic, which the folks at NDC Sydney helpfully recorded.

Finally, if you just want to get an idea of what the modern data platform space looks like, I have a presentation on that as well.  Carlos L Chacon interviewed me on the SQL Data Partners podcast in Episode 47 and we talked about this in depth, including where Hadoop fits into this ever-growing combination of technologies.

Now, On With The Show

In today’s post, I want to look at trying to connect a SQL Server instance to Azure HDInsight and see if I can take advantage of predicate pushdown.  The short answer is, not really.  The long answer is, read on.

 

Let’s Spin Up HDInsight

The first step here is to start up an HDInsight cluster.  We won’t need a powerful cluster for this demo.

First, in the Azure Portal, search for HDInsight and you’ll get the option to create a new cluster.  Here are my cluster configuration options:

1_ClusterConfiguration

I just wanted a classic Hadoop cluster and didn’t need to install HBase, Kafka, or anything else.  Note that Windows is an option for operating system, but nobody in their right mind runs Hadoop on Windows.  It’s dead; leave it be.

Next, fill out the basics.  Here’s my setup:

2_Basics

You can configure your storage options after that.  I’m choosing the default of Azure Storage, though you can also read from Azure Data Lake Storage.  If you want Hive or Oozie metadata to persist after you kill the cluster—something that you might want to do in production as you shut down the HDInsight cluster when not in use—you can choose an Azure SQL Database in the dropdowns at the bottom.

3_Storage

Once you’ve got storage set up, you want to edit the cluster size.  By default, you get two D12 head node and 4 D4 worker nodes.  This will run you $3.63 an hour in East US.  That’s about $87 a day, or roughly $2600 per month.

4_EditClusterSize

 

 

 

You probably want to trim that down a bit if you’re just playing around.  Here’s what I ended up with:

5_SmallerNodes

Just using 2 D3 nodes cuts the price in half.  It’d be a terrible demo for Hadoop and completely unfair in a performance test, but our purpose here is just to get a simple Polybase demo working, not provide a comprehensive study on Hadoop performance.

Once you’ve got everything set up, expect to stare at this for a while:

6_DeployingHDInsight

Configuring For Predicate Pushdown

Next up, we want to configure Polybase to use our Hadoop installation.  To get the right version of Hadoop, I need to SSH to the NameNode:

7_SSH

My HDInsight cluster is using Hortonworks Data Platform version 2.6.2.2-5:

8_HDPVersion

That means that in my yarn-site.xml file, all references will be to /usr/hdp/2.6.2.2-5.

The Problem

But now we run into a problem:  there are certain ports which need to be open for Polybase to work.  This includes port 50010 on each of the data nodes against which we want to run MapReduce jobs.  This goes back to the issue we see with spinning up data nodes in Docker:  ports are not available.  If you’ve put your HDInsight cluster into an Azure VNet and monkey around with ports, you might be able to open all of the ports necessary to get this working, but that’s a lot more than I’d want to mess with, as somebody who hasn’t taken the time to learn much about cloud networking.

One Workaround

There’s one consistent workaround that I see people use:  have Azure Blob Storage work as an intermediary.  In this scenario, you’d use Polybase to write to Azure Blob Storage and then pull that data into your Azure Storage local to the HDInsight cluster.  This works for data transfer, but it doesn’t allow for predicate pushdown.

Using A Deployment Template

Robert Alexander has created a deployment template to spin up a SQL Server instance and an HDInsight cluster with a sample data set, so if you have an Azure subscription and want to play along, it’s pretty easy.  Just hit the “Deploy” button, fill in a few details, and be prepared to wait half an hour.  Note that the average daily cost is just over $150 for this demo, so you probably want to delete the objects as soon as you can to save money.  You will be charged for an HDInsight cluster even if you’ve shut off all of the machines.  There is no free pause option; the only way not to pay for the service is to delete the cluster.

Unfortunately, I was not able to get this template working.  I got through step one of the deployment but got stuck with a status of Action Required:

10_ActionRequired

When clicking on the deployment link, it’d sit there spinning, waiting for something to load.  In my browser console, I saw the problem:

9_APIError

Trying to go to that link gives an error message:

{"message":"You don't have access to this deployment ({deploymentID})."}

In this case, {deploymentID} is the exposed GUID after /deployments/ in the link above.  I tried this with a couple of accounts, but was unable to spin up this demo with either of them.

Conclusion

I’m not particularly fond of HDInsight at this point.  There’s a lot of effort to get started, and frankly, if I’m going to go through all of this effort, I’ll just build my own Cloudera or Hortonworks cluster and get full control.  I’ve also been unable to use Polybase on my on-prem SQL Server instance against an HDInsight node.  There are cases where I can see HDInsight being useful, but I don’t think there’s enough time and administrative savings over building your own Hadoop cluster to make it a worthwhile choice.  Maybe in a few years they’ll have it nailed down, but for now, I’d stick to custom-built Hadoop either on-prem or as VMs in a cloud provider like Azure or AWS.

 

New Video: Much Ado About Hadoop

I’ve decided to start recording my presentations to ensure that I have them in video form.  They’re much closer to one-take videos than polished, professional videos, but you’ve got to start somewhere.

Here is my first video, Much Ado About Hadoop.

The main audience for this video is people who are interested in learning a bit about the Hadoop ecosystem from a high level but not sure where to begin.  I cover this from a historical perspective, explaining what the major components of Hadoop are, why they made sense in the context of contemporary hardware and software paradigms, and how the Hadoop ecosystem has evolved in response to changes in hardware and software.  This talk is approximately 90 minutes in length and is an expanded version of the talk I’ve given at SQL Saturdays.

From here, as I get a bit more familiar with my software and equipment—and as I get more time—I will post additional videos.  I already know what my second video will be and want to get it out in the next two or three weeks, depending upon circumstances.  From there, who knows?

NDC Sydney Talks Up

NDC has released videos for NDC Sydney.  The NDC Conferences channel has the full set, and there were some fantastic talks this year.

I gave two talks at the conference:

They were surprisingly not awful!  Probably due to all of the post-production…

My favorite talk from NDC Sydney was Laura Bell talking about security for startups.  Definitely watch that one too; she gives four easy tips for us to use to improve our companies’ security postures.

Feasel’s Law

I’ve had this idea on my mind for a while and figured I’d might as well write it down…

Feasel’s Law – Any sufficiently advanced data retrieval process will eventually have a SQL interface.

This is a bit tongue-in-cheek, but it plays out pretty well; typically, a major sign of database maturity is its reimplementation (at least in part) of SQL.  Examples:

  • Hadoop started out as writing map-reduce jobs in Java.  Yahoo then created Pig to work with ETL processes.  After that, Facebook created Hive, an implementation of SQL for Hadoop.
  • Spark started out using Java and Scala, and then Python.  Not too long after, we had Shark and then Spark SQL.
  • HBase is a NoSQL database on top of Hadoop.  Now we have Phoenix, which is a SQL interface to HBase.
  • The primary methods of operation with Cassandra are SQL-like statements.
  • Riak TS has a SQL interface.  Incidentally, I love the author’s comment that SQL isn’t dead yet…
  • The easiest way to access CosmosDB data?  SQL.  That interface isn’t fully baked yet—it doesn’t allow grouping, for example—but it’s getting there.

One of the few strong counter-examples is MongoDB, which doesn’t have a SQL interface but does have a SQL translation guide.  DynamoDB also does not offer a SQL interface, though there are third-party interfaces and management tools which give you the same effect.

Otherwise, if you’re using a database that was created after 1990, has any significant user base, and is a mature platform, chances are it’s got a native SQL interface.