So far, I’ve used two data sets in my Polybase series: a set of 777 second basemen and a set of about 1.3 million flights. Neither of those really helps us get to the value behind Polybase.
All(ish) Flights
As a result, I want to look at all flights from 1987 through 2008. To do this, I’m creating a new external table, this time pointing to a folder rather than a single file:
USE [AirportData] GO CREATE EXTERNAL TABLE [dbo].[AllFlights] ( [year] int NULL, [month] int NULL, [dayofmonth] int NULL, [dayofweek] int NULL, deptime VARCHAR(100) NULL, crsdeptime VARCHAR(100) NULL, arrtime VARCHAR(100) NULL, crsarrtime VARCHAR(100) NULL, uniquecarrier VARCHAR(100) NULL, flightnum VARCHAR(100) NULL, tailnum VARCHAR(100) NULL, actualelapsedtime VARCHAR(100) NULL, crselapsedtime VARCHAR(100) NULL, airtime VARCHAR(100) NULL, arrdelay VARCHAR(100) NULL, depdelay VARCHAR(100) NULL, origin VARCHAR(100) NULL, dest VARCHAR(100) NULL, distance VARCHAR(100) NULL, taxiin VARCHAR(100) NULL, taxiout VARCHAR(100) NULL, cancelled VARCHAR(100) NULL, cancellationcode VARCHAR(100) NULL, diverted VARCHAR(100) NULL, carrierdelay VARCHAR(100) NULL, weatherdelay VARCHAR(100) NULL, nasdelay VARCHAR(100) NULL, securitydelay VARCHAR(100) NULL, lateaircraftdelay VARCHAR(100) NULL ) WITH ( LOCATION = N'/tmp/airport/', DATA_SOURCE = HDP, FILE_FORMAT = CSVTextFileFormat, -- Up to 5000 rows can have bad values before Polybase returns an error. REJECT_TYPE = Value, REJECT_VALUE = 5000 ); GO CREATE STATISTICS AllFlights_Dest ON dbo.AllFlights (Dest, [Year]) WITH FULLSCAN;
In the process of loading this data, I learned that the files I have for 2001 and 2002 are funky and were causing Polybase errors; I wasn’t able to load a single row from them without error. Because I don’t need them for this, I took the easy way out and deleted the files. As a result, I have data from 1987 through 2000 and 2003 through 2008. This is approximately 123 million records in total, which is not bad for a “big data” example.
The Query
My query is a rather simple one: give me all flights which landed in Columbus in 2005. I’m going to run this three separate ways: once using Polybase, once using a direct Hive query, and once using Hive over a linked server. Let’s start with Polybase.
Polybase
My expectation here is that Polybase should push both parts of the predicate down to Hadoop and return only the number of flights.
SELECT f.origin, f.distance FROM dbo.AllFlights f WHERE f.year = 2005 AND f.dest = 'CMH'
When I ran this query, it took 11 minutes and 51 seconds to return 37,946 rows. Here’s the execution plan:
Nearly 12 minutes doesn’t sound fantastic, but let’s remember that this is running on a single-node sandbox hosted on my laptop. That’s hardly a fair setup for a distributed processing system. Also, I have done nothing to optimize the files; I’m using compressed, comma-separated text files, have not partitioned the data in any meaningful way, and have taken the easy way out whenever possible. This means that an optimized file structure running on a real cluster with powerful servers behind it could return the data set a lot faster…but for our purposes, that’s not very important. I’m using the same hardware in all three cases, so in that sense this is a fair comp.
Polybase execution time: 11 minutes, 51 seconds
Hive
My next thought was to run this directly against Hive. That will give me an idea of what it would take for a neutral third party application built on Hive to get the data back. Previously, I’d migrated in flight data and put it into a table called flights in my default database. As a result, the query is quite simple:
SELECT f.origin, f.distance FROM flights f WHERE f.year = 2005 AND f.dest = 'CMH';
For this one, I have three conflicting answers. First, the Hive log tells me that the query took 5 minutes and 22 seconds:
Second, when I ran the query, I kept track of how long it ran, and it did not finish returning results until about 12 minutes in.
Finally, if I look at the job tracker history, I get back 15 minutes and 10 seconds:
I’m not familiar enough yet to understand exactly why there’s a 10-minute window here, but I’m going to go with 12 minutes because that’s when I was able to look at data and everything was loaded on screen.
Hive: 12 minutes
Linked Server to Hive
My final example is a linked server with Hive. I already have a linked server set up and can write a simple four-party query, similar to the one I wrote for Hive:
SELECT f.origin, f.distance FROM [Hadoop].[HIVE].[default].[flights] f WHERE f.year = 2005 AND f.dest = 'CMH';
This query ran and ran and ran…taking 58 minutes and 54 seconds to return 37,946 records. The reason is obvious from the query plan:
This is using Hadoop exactly wrong. We needed to load all 123 million records into the table, convert all of the results into SQL types (which is what the Compute Scalar function does), and then filter out all but the 37,946 records we need.
Linked Server to Hive: 58 minutes, 54 seconds
Conclusions
This is a quick example of how powerful Polybase can be. In a simple, non-optimized case (which is the worst case scenario), Hive with Tez and Polybase performed about the same from a user’s perspective: that is, getting data from Hadoop to my screen. Both of these technologies were 5 times faster than the alternative, and with appropriate optimizations and running on stronger hardware, the differences would be even bigger.
3 thoughts on “Why Use Polybase? Looking At Bigger Data Sets”