In this week’s edition of PolyBase Revealed, here’s something I found surprising about predicate pushdown.
Why Pushdown Matters
Before I start, let’s talk about predicate pushdown for a moment. The gist of it is that when you have data in two sources, you have two options for combining the data:
- Bring the data in its entirety from your remote source to your local target and work as though everything were in the local target to begin with. I’ll call this the streaming approach.
- Send as much of your query’s filters, projections, and pre-conditions to the remote source, have the remote source perform some of the work, and then have the remote source send its post-operative data to the local target. Then, the local target once more treats this as though it were simply local data. This is the pushdown approach because you push down those predicates (that is, filters, projections, and pre-conditions).
Option 1 is easy to do but struggles as your data sizes get larger. You spend time streaming that data from the remote source to your target and that can add up. Suppose you have 2 terabytes of data in HDFS. Over a 10 gigabit network, you can get 1.25GB/sec max assuming you swamp your network and nothing else is running. That’s pretty unrealistic so let’s cut it approximately into half and say about 650 MB/sec. It would take a bit over 3200 seconds just to move all of that data over (2 TB * 1024 GB/TB * 1024 MB/GB / 650 MB/sec). That’s before you do any filtering of the data.
Option 2 can be more efficient for giant data sets but tends to be much less efficient with small data sets, especially for a technology like Hadoop’s MapReduce, where there is a fixed cost just to start up a job. My rule of thumb is if you can stream the data and get an answer within 30 seconds, a MapReduce job won’t get an answer to you faster.
Now that we’ve talked about the concepts, we’re going to look at a tiny omission in the documentation which means a lot.
What Can You Push Down?
Check out Microsoft’s documentation on pushdown computations in PolyBase. That documentation is, as far as I can tell, correct. I just want to call out one piece which I’ve highlighted:
SQL Server allows the following basic expressions and operators for predicate pushdown.
* Binary comparison operators ( <, >, =, !=, <>, >=, <= ) for numeric, date, and time values.
* Arithmetic operators ( +, -, *, /, % ).
* Logical operators (AND, OR).
* Unary operators (NOT, IS NULL, IS NOT NULL).
That’s right: there’s no mention of strings. Because you can’t use strings in a pushdown operation.
Here’s an example which I use in the book (sneak preview alert!). I have a copy of New York City parking ticket data for a multi-year period (e.g., FY 2017). I decided to look for violators with Ohio tags and vehicles with model years between 2005 and 2015. As you’d expect, there aren’t that many instances of this: a couple hundred out of 33 million. This is a perfect case for predicate pushdown: I’d much rather have my Hadoop cluster pick up the slack and turn 33 million rows into a dozen (because I’m grouping by year) and take some of the pressure off of SQL Server. But because my most selective predicate is on vehicle state, I can’t push it down. I can push down the model year predicate if and only if I define the model year column as a numeric data type like
INT. That means instead of 33 million rows, I push 7.8 million rows over. That’s much better but still less than ideal.
But there may be cases where I don’t want to label the column as a numeric data type. For example, if I know there’s a lot of bad data that I nonetheless wish to keep, I would want to use a
NVARCHAR type. As soon as I do that, however, I lose all ability to perform any kind of predicate pushdown and my MapReduce jobs are essentially project operations. That can still be marginally useful because I’m probably projecting only a small number of the total columns in the data set, but the real gains come in letting SQL Server ignore large numbers of records.
Call to Action
If this sounds interesting to you, vote up my UserVoice entry on supporting string columns for predicate pushdown. If this doesn’t sound interesting to you, vote it up anyhow; what’s the worst that could happen?