Polybase Statistics, Round One

Yesterday, we built the SecondBasemen table.  That’s a really small table, so we’d expect pretty much any queries to run quickly.  So let’s scale it up just a little bit.

Looking At (Some) Flights

If you grabbed the flight data already, you’re in good shape.  Otherwise, at least get the 1987 data file, which is 12 MB.  I put all of the years into /tmp/airport/ and left each file compressed as .csv.bz2.

To test out our Polybase data set, I’m going to create an external table pointing just at 1987 data, as that’s a relatively small data set, just 1.3 million records.  So let’s get started and create a table:

CREATE EXTERNAL TABLE [dbo].[Flights]
(
	[year] int NULL,
	[month] int NULL,
	[dayofmonth] int NULL,
	[dayofweek] int NULL,
	deptime VARCHAR(100) NULL,
	crsdeptime VARCHAR(100) NULL,
	arrtime VARCHAR(100) NULL,
	crsarrtime VARCHAR(100) NULL,
	uniquecarrier VARCHAR(100) NULL,
	flightnum VARCHAR(100) NULL,
	tailnum VARCHAR(100) NULL,
	actualelapsedtime VARCHAR(100) NULL,
	crselapsedtime VARCHAR(100) NULL,
	airtime VARCHAR(100) NULL,
	arrdelay VARCHAR(100) NULL,
	depdelay VARCHAR(100) NULL,
	origin VARCHAR(100) NULL,
	dest VARCHAR(100) NULL,
	distance VARCHAR(100) NULL,
	taxiin VARCHAR(100) NULL,
	taxiout VARCHAR(100) NULL,
	cancelled VARCHAR(100) NULL,
	cancellationcode VARCHAR(100) NULL,
	diverted VARCHAR(100) NULL,
	carrierdelay VARCHAR(100) NULL,
	weatherdelay VARCHAR(100) NULL,
	nasdelay VARCHAR(100) NULL,
	securitydelay VARCHAR(100) NULL,
	lateaircraftdelay VARCHAR(100) NULL
)
WITH
(
	LOCATION = N'/tmp/airport/1987.csv.bz2',
	DATA_SOURCE = HDP,
	FILE_FORMAT = CSVTextFileFormat,
	-- Up to 5000 rows can have bad values before Polybase returns an error.
	REJECT_TYPE = Value,
	REJECT_VALUE = 5000
);
GO

This data set isn’t perfect, as it was designed for R, so NULL values show up as NA instead, and that means that values which are actually numeric (such as deptime or arrtime) can cause errors because the value is unknown.  As a result, for the purposes of this series, I’m going to treat most of those values as string data types.  We’ll see what happens when you don’t…

On A Jetplane To Columbus

Let’s look at all of the flights that went to Columbus, Ohio’s Port Columbus International Airport, whose airport code is CMH.  To spice things up a bit, I also downloaded supplemental data for airports and put it into SQL Server.  Here’s the first query, which grabs a bunch of data:

SELECT
	*
FROM dbo.Flights f
	INNER JOIN dbo.airports ar
		ON f.dest = ar.IATA
WHERE
	f.dest = 'CMH';

This gives us back 6638 results and the execution plan looks like this:

BasicPolybaseQuery

This is another case of us getting predicate pushdown for free:  we specified the destination as CMH and the subsequent MapReduce job included a filter on destination.  But what happens if we try to back our way into Columbus by specifying city and state on the Airports table?

SELECT
	*
FROM dbo.Flights f
	INNER JOIN dbo.airports ar
		ON f.dest = ar.IATA
WHERE
	ar.state = 'OH'
	AND ar.city = 'Columbus';

This query also returns 6638 results (even though there are three other airports in the Columbus area, but it appears that none of them had flights tracked in 1987).

Here’s how the execution plan looks:

BigPolybaseQuery

That’s exactly the opposite of what I want to see:  we’re bringing in 1.3 million rows (that is, the whole data set) and then hash matching it with our airports data set.  What I want to do is improve this query without specifying any of our Flights fields in the WHERE clause.

Let’s Index!

The first thing I want to do is put an index on dbo.Airports.  The IATA column is a valid primary key, so I just have to clean it up a little bit (by making the column non-nullable) and add a clustered primary key constraint:

ALTER TABLE dbo.Airports ALTER COLUMN IATA VARCHAR(4) NOT NULL;
ALTER TABLE dbo.Airports ADD CONSTRAINT [PK_Airports] PRIMARY KEY CLUSTERED(IATA);

This change changes the table scan to a clustered index scan, but otherwise did nothing to help us.  I don’t mind doing this, but it isn’t going to save our queries.

So how about we throw a non-clustered index into the mix?

CREATE NONCLUSTERED INDEX [IX_Airports_State_City] ON dbo.Airports
(
	State,
	City
);

How’d that work for us?

PolybaseIndexSeek

So…nope.

Let’s Rewrite!

Maybe if we coerce the optimizer into realizing that it should figure out the values for destination airport before hitting Polybase, we’ll get a nicer query plan.  One way to try to force this is to use a common table expression:

WITH ohioAirports AS
(
	SELECT
		ar.*
	FROM dbo.airports ar
	WHERE
		ar.state = 'OH'
		AND ar.city = 'Columbus'
)
SELECT
	*
FROM dbo.Flights f
	INNER JOIN ohioAirports ar
		ON f.dest = ar.IATA;

The end result?  No change in the execution plan.

So what about if I use the APPLY operator?

SELECT
	*
FROM dbo.airports ar
	CROSS APPLY
	(
		SELECT *
		FROM dbo.Flights f
		WHERE
			f.dest = ar.IATA
	) f
WHERE
	ar.state = 'OH'
	AND ar.city = 'Columbus';

Thus far, each of these queries has taken about 90 seconds to run on my laptop. This one was no different, and its execution plan was the same. So it seems that there aren’t any SQL-side changes we can make to get this going faster.

Statistics

Polybase offers the ability to create statistics on tables, the same way that you would on normal tables.  There are a few rules about statistics:

  1. Stats are not auto-created.  You need to create all statistics manually.
  2. Stats are not auto-updated.  You will need to update all statistics manually, and currently, the only way you can do that is to drop and re-create the stats.
  3. When you create statistics, SQL Server pulls the data into a temp table, so if you have a billion-row table, you’d better have the tempdb space to pull that off.  To mitigate this, you can run stats on a sample of the data.

With all this in mind, we’ll create a statistic on the Dest column using a full scan of the data (because it’s just 1.3 million rows):

CREATE STATISTICS Flights_Dest ON dbo.Flights (Dest) WITH FULLSCAN;

What does running this do?  Well…nothing.  The query plan stays the same.

But what happens if I run it on the first query, which does filter on f.dest?

PolybaseBasicQueryWithStats

We see the estimated number of rows has changed, but because Columbus is relatively less common an airport than somewhere like LAX or ORD, the estimate isn’t that great.  It doesn’t hurt our query, but also didn’t help much.

So Seriously, How Do We Improve That Query?

We’ve tried several tactics to improve this query’s performance, but it’s not going well so far.

What About A Temp Table?

Would it help to put the airports into a temp table and join that way?  Maybe then we’d tell the optimizer that yeah, we really want to get a list of airports and then see if Polybase pushes the predicate down successfully.

SELECT *
INTO #Airports
FROM dbo.Airports ar
WHERE
	ar.State = 'OH'
	AND ar.City = 'Columbus';
ALTER TABLE #Airports ADD PRIMARY KEY CLUSTERED(IATA);

SELECT
	*
FROM dbo.Flights f
	INNER JOIN #Airports ar
		ON f.dest = ar.IATA;

The execution plan was…a bit of a regression.

PolybaseTempTable

That’s definitely a step back.  And it leads me to wonder, what if I replace the hard-coded CMH with a variable?

DECLARE
	@dest VARCHAR(100) = 'CMH';

SELECT
	*
FROM dbo.Flights f
	INNER JOIN dbo.airports ar
		ON f.dest = ar.IATA
WHERE
	f.dest = @dest;

This execution plan makes sense, but is a little concerning:

PolybaseVariable

We pull in all 1.3 million records and the filter after the fact in SQL Server, rather than passing the variable value to Hadoop.  I would have preferred to see the optimizer send the values along.

Conclusion (For Now)

Let’s sum up what we’ve learned so far:

  1. Polybase can perform predicate pushdown on hard-coded values.
  2. When you cannot put the pushdown values directly into the WHERE clause—like in the scenario in which we filter on the Airports table—the optimizer struggles to realize that its best bet is to figure out that there are just a few records, pass those records in as part of the filter, and let Hadoop handle the filter.
  3. It looks like using a variable leads to bringing back all of the data and filtering later.  This might be due to the estimated number of rows being equal to 21 and so the optimizer figuring that it’d be cheaper just to bring 21 rows in and filter locally.

In the next post, I’m going to dig into this a little further and see if we can’t find some answers.  If you have any ideas you’d like me to try out, let me know and I’ll give it a try.

Advertisements

7 thoughts on “Polybase Statistics, Round One

  1. Hi Kevin, excellent post and very helpful. I’ve rebuild everything in my sandbox environment. BUT….

    Trying to create statistics on external table:
    CREATE STATISTICS Flights_Orig ON dltl.Flights (origin)
    I get this error:
    Incorrect syntax near ‘,’.
    In sys.dm_exec_distributed_request_steps I’ve found an explanation because Polybse Engine generates a CREATE STATISTICS on its temp table with this statement:
    CREATE STATISTICS [Flights_Orig] ON [PolybaseTest].[dltl].[Flights] ([origin]) WITH INPUT_SAMPLE 28,6495 PERCENT
    My SQL Server doesn’t like a comma as decimal point because locale is english. Already changed my Windows 2012 R2 version from german to english without leading to success.
    (BTW: I cancelled a FULLSCAN after it was running for 1 1/2 hour on 15 M. rows :-(( )
    Do you have any idea how to make Polybase create a syntactical correct statement?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s