In the last post, we dug into how the MapReduce process works. Today, I want to spend a little bit of time talking about forcing MapReduce operations and disabling external pushdown.
What Can We Force?
Before we get into the mechanics behind forcing pushdown, it’s important to note that predicate pushdown is somewhat limited in the sense of what information we can pass to the MapReduce engine.
For example, let’s say that I have a set of flight details in Hadoop and a set of airports in SQL Server. Suppose I want to run the following query as quickly as possible:
SELECT * FROM dbo.Flights f INNER JOIN dbo.Airports ar ON f.dest = ar.IATA WHERE ar.state = 'OH' AND ar.city = 'Columbus';
Ideally, the engine would start with the Airports table, filtering by state and city and getting back the 3 records in that table which associate with Columbus airports. Then, knowing the IATA values for those 3 airports, the Polybase engine would feed those into the remote query predicate and we’d get back the fewest number of rows. There’s just one catch:
The Polybase engine does not work that way!
Unfortunately, my ideal scenario isn’t one that Polybase can give me (at least today; we’ll see if they can in the future). The only predicates which the Polybase engine can push are those which are explicit, so we can’t feed a set of results from one part of a query plan into Hadoop. If you did want to do something like that, I think the easiest solution would be to generate dynamic SQL and build an IN clause with the IATA/dest codes.
So with that in mind, let’s talk more about forcing and preventing pushdown.
Forcing Predicate Pushdown
As a reminder, in order to allow predicate pushdown to occur, we need to hit a Hadoop cluster; we can’t use predicate pushdown on other systems like Azure Blob Storage. Second, we need to have a resource manager link set up in our external data source. Third, we need to make sure that everything is configured correctly on the Polybase side. But once you have those items in place, it’s possible to use the FORCE EXTERNALPUSHDOWN command like so:
SELECT TOP(50) sb.FirstName, sb.LastName, sb.Age, a.AgeDesc, sb.Throws, sb.Bats, tsa.TopSalary FROM dbo.SecondBasemen sb CROSS APPLY (SELECT CASE WHEN Age > 30 THEN 'Old' ELSE 'Not Old' END AS AgeDesc) a INNER JOIN dbo.TopSalaryByAge tsa ON sb.Age = tsa.Age ORDER BY sb.LastName DESC OPTION(FORCE EXTERNALPUSHDOWN);
You might want to force external pushdown when you don’t have good external table stats but do have some extra knowledge the optimizer doesn’t have regarding the data set. For example, you might know that you’re asking for a tiny fraction of the total data set to get returned but the optimizer may not be aware of this, and so it pulls the entire data set over to SQL Server for the operation.
Be wary of using this operation, however; it will spin up a MapReduce job, and that can take 15-30 seconds to start up. If you can get your entire data set over the wire in 10 seconds, there’s little value in shunting the work off to your Hadoop cluster.
Disabling External Pushdown
The opposite of FORCE EXTERNALPUSHDOWN is DISABLE EXTERNALPUSHDOWN. On Azure Blob Storage and other external sources, this is a no-op. This is also a no-op if you have a Hadoop cluster but no resource manager set up.
With everything set up, it will ensure that you do not push down to your Hadoop cluster. This might be a good thing if you consistently see faster times by just pulling the entire data set over the wire and performing any filters or operations in SQL Server. Note that even with this option set, performance is still going to be a good bit better than using Hive over a linked server.
Let’s say that you want to run OPTION(FORCE EXTERNALPUSHDOWN) and you get an error as soon as the command starts:
Msg 7320, Level 16, State 110, Line 1
Cannot execute the query “Remote Query” against OLE DB provider “SQLNCLI11” for linked server “(null)”. Query processor could not produce a query plan because of the hints defined in this query. Resubmit the query without specifying any hints.
If you’re trying to perform this operation against a Hadoop cluster, you’ll want to re-read the configuration settings blog post. There are several files which need to change and if you don’t get all of them, you’ll end up with strange-looking error messages.