Pluralsight Reviews: Scala

I recently had the chance to go through Justin Pihony’s Pluralsight course on Scala.  If you want to write Spark code using Scala, this is a good starter course.

The entire premise of the course is to write a relatively simple file searcher, but Justin takes you through the entire process, including getting build tools working, building (and automatically running) unit tests, and developing code to meet evolving requirements.

I particularly appreciated his introduction of FlatSpec and writing tests like the following:

"Matcher that is passed a file matching the filter" should
  "return a list with that file name" in{
    val matcher = new Matcher("fake", "fakePath")
    val results = matcher.execute()
    assert(results == List(("fakePath", None)))
  }

I might never have learned about FlatSpec if it weren’t for this course, so I appreciate that.

DevTeach Review

I had the pleasure last week of attending and speaking at DevTeach.  This year’s DevTeach was in downtown Montreal, and here are a few scattered, unstructured thoughts on the event:

  • On day 1, Itzik Ben-Gan gave a talk on advanced SQL Server techniques, and I found one technique so interesting that I had to blog about it.  About 70% of his presentation was typing, and that takes chutzpah.
  • One of the overarching themes of this year’s DevTeach is DevOps.  Dylan Smith gave a great talk on the concept, as did Julien Stroheker.  There were a few other sessions scattered through the day which hit on the topic as well, and on Tuesday night, the DotNetRocks guys did a panel discussion on DevOps, what it means, and what the future holds.
  • I was impressed at how interested many of the attendees were in security.  Security and developers don’t typically go together very well, but there were several security-focused talks (including one of mine).  I think part of this goes back to the DevOps theme:  as developers get more intertwined into administrative affairs, they (hopefully) become more aware of the security ramifications of their decisions.  But maybe that’s the optimism talking…
  • Joel Hebert gave a couple of great enterprisey talks, one on logging and the other on transient fault handling.  I loved the logging talk, as it hit many really important points.  For example, Joel made mention of logging anything and everything; specifically, log incoming payloads, outgoing payloads, and any mappings (transformations) which occur during the process.  This results in a lot of writes, but gives you enough information to debug without running a debugger or trying to step through code.  If you log appropriately, you might even get the ability to pick up where you left off after you fix the underlying issue, which could save a huge amount of time when dealing with large processes.  I’ve been hitting similar topics at work lately, so this talk resonated with me; I ended up taking about a page of notes.  And right after that talk, he gave another one on transient fault handling, looking at retry logic.  His main theme was that you have to be smart about retries:  it doesn’t make sense to retry a persistent error, but it also doesn’t make sense to give up when there’s a blip, so catch specific exceptions (like timeouts), try again with a custom backoff strategy (e.g., 1 second, then 3 seconds, then 30 seconds, then 90 seconds), and implement the circuit breaker pattern.
  • I’ll end on a lighter note.  Because DevTeach is in Montreal, some of the sessions were in French and a lot of the speakers are native French speakers.  I can pick out a word here and there, but make no pretense of understanding French.  On days 2 and 3, both morning speakers started out in French even though their abstracts were in English and DevTeach had a specific code to show which sessions were to be in French.  Fortunately, they both switched to English very early on, so I was able to follow along and learn a bit on a couple of interesting topics.

I’d love to hit DevTeach again next year, so I’ll have to keep an eye out for next year’s CFP.

Penguatroll on the move

We’ll be moving at the end of this week. It’ll be a little bittersweet to leave Kansas, but it’s just time to move on and find something a little better for us. Probably no posts from me for a while, so hopefully you find Kevin’s posts on computinating brilliantly entertaining!

Why Use Polybase? Looking At Bigger Data Sets

So far, I’ve used two data sets in my Polybase series:  a set of 777 second basemen and a set of about 1.3 million flights.  Neither of those really helps us get to the value behind Polybase.

All(ish) Flights

As a result, I want to look at all flights from 1987 through 2008.  To do this, I’m creating a new external table, this time pointing to a folder rather than a single file:

USE [AirportData]
GO
CREATE EXTERNAL TABLE [dbo].[AllFlights]
(
    [year] int NULL,
    [month] int NULL,
    [dayofmonth] int NULL,
    [dayofweek] int NULL,
    deptime VARCHAR(100) NULL,
    crsdeptime VARCHAR(100) NULL,
    arrtime VARCHAR(100) NULL,
    crsarrtime VARCHAR(100) NULL,
    uniquecarrier VARCHAR(100) NULL,
    flightnum VARCHAR(100) NULL,
    tailnum VARCHAR(100) NULL,
    actualelapsedtime VARCHAR(100) NULL,
    crselapsedtime VARCHAR(100) NULL,
    airtime VARCHAR(100) NULL,
    arrdelay VARCHAR(100) NULL,
    depdelay VARCHAR(100) NULL,
    origin VARCHAR(100) NULL,
    dest VARCHAR(100) NULL,
    distance VARCHAR(100) NULL,
    taxiin VARCHAR(100) NULL,
    taxiout VARCHAR(100) NULL,
    cancelled VARCHAR(100) NULL,
    cancellationcode VARCHAR(100) NULL,
    diverted VARCHAR(100) NULL,
    carrierdelay VARCHAR(100) NULL,
    weatherdelay VARCHAR(100) NULL,
    nasdelay VARCHAR(100) NULL,
    securitydelay VARCHAR(100) NULL,
    lateaircraftdelay VARCHAR(100) NULL
)
WITH
(
    LOCATION = N'/tmp/airport/',
    DATA_SOURCE = HDP,
    FILE_FORMAT = CSVTextFileFormat,
    -- Up to 5000 rows can have bad values before Polybase returns an error.
    REJECT_TYPE = Value,
    REJECT_VALUE = 5000
);
GO

CREATE STATISTICS AllFlights_Dest ON dbo.AllFlights (Dest, [Year]) WITH FULLSCAN;

In the process of loading this data, I learned that the files I have for 2001 and 2002 are funky and were causing Polybase errors; I wasn’t able to load a single row from them without error.  Because I don’t need them for this, I took the easy way out and deleted the files.  As a result, I have data from 1987 through 2000 and 2003 through 2008.  This is approximately 123 million records in total, which is not bad for a “big data” example.

The Query

My query is a rather simple one:  give me all flights which landed in Columbus in 2005.  I’m going to run this three separate ways:  once using Polybase, once using a direct Hive query, and once using Hive over a linked server.  Let’s start with Polybase.

Polybase

My expectation here is that Polybase should push both parts of the predicate down to Hadoop and return only the number of flights.

SELECT
    f.origin,
	f.distance
FROM dbo.AllFlights f
WHERE
	f.year = 2005
	AND f.dest = 'CMH'

When I ran this query, it took 11 minutes and 51 seconds to return 37,946 rows.  Here’s the execution plan:

PolybaseBigQuery.png

Nearly 12 minutes doesn’t sound fantastic, but let’s remember that this is running on a single-node sandbox hosted on my laptop.  That’s hardly a fair setup for a distributed processing system.  Also, I have done nothing to optimize the files; I’m using compressed, comma-separated text files, have not partitioned the data in any meaningful way, and have taken the easy way out whenever possible.  This means that an optimized file structure running on a real cluster with powerful servers behind it could return the data set a lot faster…but for our purposes, that’s not very important.  I’m using the same hardware in all three cases, so in that sense this is a fair comp.

Polybase execution time:  11 minutes, 51 seconds

Hive

My next thought was to run this directly against Hive.  That will give me an idea of what it would take for a neutral third party application built on Hive to get the data back.  Previously, I’d migrated in flight data and put it into a table called flights in my default database.  As a result, the query is quite simple:

SELECT
    f.origin,
	f.distance
FROM flights f
WHERE
	f.year = 2005
	AND f.dest = 'CMH';

For this one, I have three conflicting answers.  First, the Hive log tells me that the query took 5 minutes and 22 seconds:

HiveLog

Second, when I ran the query, I kept track of how long it ran, and it did not finish returning results until about 12 minutes in.

Finally, if I look at the job tracker history, I get back 15 minutes and 10 seconds:

HiveElapsedTime

I’m not familiar enough yet to understand exactly why there’s a 10-minute window here, but I’m going to go with 12 minutes because that’s when I was able to look at data and everything was loaded on screen.

Hive:  12 minutes

Linked Server to Hive

My final example is a linked server with Hive.  I already have a linked server set up and can write a simple four-party query, similar to the one I wrote for Hive:

SELECT
    f.origin,
	f.distance
FROM [Hadoop].[HIVE].[default].[flights] f
WHERE
	f.year = 2005
	AND f.dest = 'CMH';

This query ran and ran and ran…taking 58 minutes and 54 seconds to return 37,946 records.  The reason is obvious from the query plan:

HiveLinkedServerPlan.png

This is using Hadoop exactly wrong.  We needed to load all 123 million records into the table, convert all of the results into SQL types (which is what the Compute Scalar function does), and then filter out all but the 37,946 records we need.

Linked Server to Hive:  58 minutes, 54 seconds

Conclusions

This is a quick example of how powerful Polybase can be.  In a simple, non-optimized case (which is the worst case scenario), Hive with Tez and Polybase performed about the same from a user’s perspective:  that is, getting data from Hadoop to my screen.  Both of these technologies were 5 times faster than the alternative, and with appropriate optimizations and running on stronger hardware, the differences would be even bigger.

TIL: Another Great Use Of APPLY

Yesterday, at DevTeach, I got to see Itzik Ben-Gan give a session on advanced T-SQL operations, mostly around gaps & islands but also including a couple other gems.  Something that caught my eye was an interesting way of using APPLY to unpivot data.  I know a few good uses of the APPLY operator and this was a new one on me.

First, the setup.  Suppose we have a table which has pivoted quantity and value metrics for a number of years, and we’re bringing it into SQL Server as-is.

DROP TABLE IF EXISTS #Sales;
CREATE TABLE #Sales
(
	Product VARCHAR(50),
	Qty2013 INT,
	Qty2014 INT,
	Qty2015 INT,
	Val2013 INT,
	Val2014 INT,
	Val2015 INT
);

INSERT INTO #Sales(Product, Qty2013, Qty2014, Qty2015, Val2013, Val2014, Val2015) VALUES
('P1', 200, 230, 255, 1995, 2448, 3006),
('P2', 126, 129, 127, 448, 463, 451),
('P3', 600, 16000, 38880, 750, 24000, 60000),
('P4', 390, 380, 370, 3000, 2900, 2800),
('P5', 125, 125, 125, 17008, 17008, 17008);

Performing any kind of aggregation on this is painful, so we’d like to unpivot the data. The method that I’m most familiar with is SUM(CASE) or MAX(CASE), but in this case, because we’re dealing with unique rows, we can skip the aggregation:

SELECT
	s.Product,
	y.[Year],
	CASE
		WHEN y.Year = 2013 THEN Qty2013
		WHEN y.Year = 2014 THEN Qty2014
		WHEN y.Year = 2015 THEN Qty2015
	END AS Quantity,
	CASE
		WHEN y.Year = 2013 THEN Val2013
		WHEN y.Year = 2014 THEN Val2014
		WHEN y.Year = 2015 THEN Val2015
	END AS [Value]
FROM #Sales s
	CROSS JOIN (VALUES(2013),(2014),(2015)) y([Year]);

This code works, but if you have dozens of years, it gets messy writing those case statements and you’re a bit more likely to make a mistake when refactoring code. Here’s a simpler version using CROSS APPLY:

SELECT
	s.Product,
	y.[Year],
	y.Quantity,
	y.[Value]
FROM #Sales s
	CROSS APPLY (VALUES
		(2013, [Qty2013], [Val2013]),
		(2014, [Qty2014], [Val2014]),
		(2015, [Qty2015], [Val2015])
	) y([Year], Quantity, [Value]);

It’s a little easier to read than the other version, and adding additional years is pretty straightforward.  That makes for a great tip when you’re trying to refactor poorly-thought-out tables or bring into your system potentially well-thought-out flat files.

Polybase: Forcing Predicate Pushdown

It’s been a few days since my last Polybase post, so time to get back on the saddle.

Reminder:  Where We Were

Going back a few posts, I wanted to join a SQL Server table to an HDFS-driven external table.  Here’s the query that stumped me:

SELECT
    *
FROM dbo.Flights f
    INNER JOIN dbo.airports ar
        ON f.dest = ar.IATA
WHERE
    ar.state = 'OH'
    AND ar.city = 'Columbus';

The query ran alright, but it did something that I did not wish to see out of Polybase:  it pulled all of the Flights table over to SQL Server and then joined that data set to the Airports table.  I know that, in this particular case, it makes a lot more sense to grab the four Columbus-based airports and make those destination codes part of the Polybase predicate.

Force External Pushdown

When looking at the Getting Started guide, I noticed a Polybase query hint:  FORCE EXTERNALPUSHDOWN.  This might be just the ticket!

Additional Research

Before jumping into this hint, I figured I’d do a Google search on the topic and it pulled back just a few results.  Blog posts from Sean Werick, Jonathan Allen, and Ayman El-Ghazali have the hint (although Ayman uses Azure WASB as his use case so technically doesn’t use the hint) and explain that it forces the Hadoop cluster to do compute rather than the SQL Server side.

Forcing the Push

Armed with this information, let’s modify the query just a little bit and force us some predicate pushdown:

SELECT
    *
FROM dbo.Flights f
    INNER JOIN dbo.airports ar
        ON f.dest = ar.IATA
WHERE
    ar.state = 'OH'
    AND ar.city = 'Columbus'
OPTION(FORCE EXTERNALPUSHDOWN);

As soon as I kick this off, I get an error:

Msg 7320, Level 16, State 110, Line 1
Cannot execute the query “Remote Query” against OLE DB provider “SQLNCLI11” for linked server “(null)”. Query processor could not produce a query plan because of the hints defined in this query. Resubmit the query without specifying any hints.

Well, that’s not good…  Checking sys.dm_exec_compute_node_errors gives me four error rows with stack trace results, none of which seems very useful to me (as opposed to a Microsoft support tech).

Doing A Simpler Push

Well, that seems weird, but maybe it’s due to the fact that the predicate really needs to happen on the SQL Server side—after all, we’re filtering on Airports, which is a SQL Server table.

So let’s query just the Flights table and see what happens here:

SELECT
    *
FROM dbo.Flights f
WHERE
    f.dest = 'CMH'
OPTION(FORCE EXTERNALPUSHDOWN);

The end result: the same error message!  Even when I limit my query just to a single external table, I’m still getting this error.

Disabling External Pushdown

Let’s go the opposite direction here and see if we can disable external pushdown.

SELECT
    *
FROM dbo.Flights f
WHERE
    f.dest = 'CMH'
OPTION(DISABLE EXTERNALPUSHDOWN);

In this case, the query does run and does return the 6638 rows we expect, but it does the opposite of what I’d like: it pulls all of the rows into SQL Server and only then parses them out, taking over a minute in the process.

This is definitely a step back from where I want to be, but at least disabling external pushdown works…right?

DisabledPushdown

The execution plan shows 6638 rows returned and the query takes the same amount of time whether I disable pushdown or not.

Conclusions

Unfortunately, I don’t have any conclusions here.  My hope was that forcing predicate pushdown would tell the optimizer that I really want to force as much as I can down to the resultant MapReduce jobs, but this seems not to have worked out as expected.

This looks like a case in which more research will be required.

News & Events

Here are a couple of quick updates and happenings.

DevTeach

As this post goes live, I’m going to be on an airplane to Montreal to attend DevTeach.  I’ll be presenting on Wednesday and Thursday, both during the 11:30-12:30, pre-lunch slot.

MVP

I got the official word on Friday that I am now a Data Platform MVP.

This is exciting news.  It’s an honor to get that appellation.

What’s Next?

This week, I have a couple of Polybase articles coming up.  I’m going to keep working on that as I try to unravel some of the mysteries of Polybase (though never quite unraveling the mystery of that capital B in the name PolyBase…I prefer a lower-cased b in there; so sue me…).

Aside from that, I’m going to try to put up a presentation on R that I’m going to deliver at SQL Saturday Columbus in just under a couple of weeks.  I’m going to cover that in a little bit of detail as a prep guide for my talk.