Update 2016-11-30: I have completely rewritten this post on using external pushdown. I’ll leave this post up for posterity’s sake, but you should read the new predicate pushdown post and not listen to any of the ravings of whichever madman wrote this work.
Update 2016-11-08: I ended up solving my predicate pushdown issue; it turns out that proper configuration is key. I’ll leave this post for posterity’s sake, but I do plan to rewrite it at some point now that I have a bit more knowledge.
It’s been a few days since my last Polybase post, so time to get back on the saddle.
Reminder: Where We Were
Going back a few posts, I wanted to join a SQL Server table to an HDFS-driven external table. Here’s the query that stumped me:
SELECT * FROM dbo.Flights f INNER JOIN dbo.airports ar ON f.dest = ar.IATA WHERE ar.state = 'OH' AND ar.city = 'Columbus';
The query ran alright, but it did something that I did not wish to see out of Polybase: it pulled all of the Flights table over to SQL Server and then joined that data set to the Airports table. I know that, in this particular case, it makes a lot more sense to grab the four Columbus-based airports and make those destination codes part of the Polybase predicate.
Force External Pushdown
When looking at the Getting Started guide, I noticed a Polybase query hint: FORCE EXTERNALPUSHDOWN. This might be just the ticket!
Before jumping into this hint, I figured I’d do a Google search on the topic and it pulled back just a few results. Blog posts from Sean Werick, Jonathan Allen, and Ayman El-Ghazali have the hint (although Ayman uses Azure WASB as his use case so technically doesn’t use the hint) and explain that it forces the Hadoop cluster to do compute rather than the SQL Server side.
Forcing the Push
Armed with this information, let’s modify the query just a little bit and force us some predicate pushdown:
SELECT * FROM dbo.Flights f INNER JOIN dbo.airports ar ON f.dest = ar.IATA WHERE ar.state = 'OH' AND ar.city = 'Columbus' OPTION(FORCE EXTERNALPUSHDOWN);
As soon as I kick this off, I get an error:
Msg 7320, Level 16, State 110, Line 1
Cannot execute the query “Remote Query” against OLE DB provider “SQLNCLI11” for linked server “(null)”. Query processor could not produce a query plan because of the hints defined in this query. Resubmit the query without specifying any hints.
Well, that’s not good… Checking sys.dm_exec_compute_node_errors gives me four error rows with stack trace results, none of which seems very useful to me (as opposed to a Microsoft support tech).
Doing A Simpler Push
Well, that seems weird, but maybe it’s due to the fact that the predicate really needs to happen on the SQL Server side—after all, we’re filtering on Airports, which is a SQL Server table.
So let’s query just the Flights table and see what happens here:
SELECT * FROM dbo.Flights f WHERE f.dest = 'CMH' OPTION(FORCE EXTERNALPUSHDOWN);
The end result: the same error message! Even when I limit my query just to a single external table, I’m still getting this error.
Disabling External Pushdown
Let’s go the opposite direction here and see if we can disable external pushdown.
SELECT * FROM dbo.Flights f WHERE f.dest = 'CMH' OPTION(DISABLE EXTERNALPUSHDOWN);
In this case, the query does run and does return the 6638 rows we expect, but it does the opposite of what I’d like: it pulls all of the rows into SQL Server and only then parses them out, taking over a minute in the process.
This is definitely a step back from where I want to be, but at least disabling external pushdown works…right?
The execution plan shows 6638 rows returned and the query takes the same amount of time whether I disable pushdown or not.
Unfortunately, I don’t have any conclusions here. My hope was that forcing predicate pushdown would tell the optimizer that I really want to force as much as I can down to the resultant MapReduce jobs, but this seems not to have worked out as expected.
This looks like a case in which more research will be required.