For today’s post, I want to look at how well Polybase on Hadoop can handle different compression formats. I’m going to use the same compression codec (org.apache.hadoop.io.compress.DefaultCodec) for each. Each file will be a small, 777-record file set of second basemen, and I will run two queries, one which is a simple SELECT *, and one which is a SELECT * with forced external pushdown to ensure that I perform a MapReduce operation.
For all of my tests, I’m using the Hortonworks HDP 2.4 sandbox.
What follows is a simple script which I used for each of the file formats. I include an external file format (which I reuse on the other examples), a table create statement, and a pair of statements to query that data. For round one, I specify the file; for rounds two and three, I specify a folder containing a specified set of files.
USE [OOTP] GO CREATE EXTERNAL FILE FORMAT [CompressionTestFormat] WITH ( FORMAT_TYPE = DELIMITEDTEXT, FORMAT_OPTIONS ( FIELD_TERMINATOR = N',', USE_TYPE_DEFAULT = TRUE ), DATA_COMPRESSION = 'org.apache.hadoop.io.compress.DefaultCodec' ); GO CREATE EXTERNAL TABLE [dbo].[SecondBasemenCSV] ( [FirstName] [varchar](50) NULL, [LastName] [varchar](50) NULL, [Age] [int] NULL, [Throws] [varchar](5) NULL, [Bats] [varchar](5) NULL ) WITH ( DATA_SOURCE = [HDP2], LOCATION = N'/tmp/compressionTests/SecondBasemen.csv', FILE_FORMAT = [CompressionTestFormat], REJECT_TYPE = VALUE, REJECT_VALUE = 2 ); GO SELECT * FROM dbo.SecondBasemenCSV sb; GO SELECT * FROM dbo.SecondBasemenCSV sb OPTION (FORCE EXTERNALPUSHDOWN);
Round One: Single-File
In this first round of testing, I’m going to test a single file in a directory. The following table includes the compression format I chose, whether I was able to perform a SELECT *, and whether I was able to perform a SELECT * with a MapReduce job.
|Format||Simple Query||MapReduce Job|
|7Zip (.7z)||Does Not Work||Does Not Work|
|Tar (.tar)||Partially Works||Partially Works|
|Tar-GZip (.tar.gz)||Partially Works||Partially Works|
|Zip (.zip)||Does Not Work||Does Not Work|
This is a very interesting set of results. First, 7Zip archived files do not work with the default encoding. I’m not particularly surprised by this result, as 7Zip support is relatively scarce across the board and it’s a niche file format (though a very efficient format).
The next failure case is tar. Tar is a weird case because it missed the first row in the file but was able to collect the remaining 776 records. Same goes for .tar.gz. I unpackaged the .tar file and the constituent SecondBasemen.csv file did in fact have all 777 records, so it’s something weird about the codec.
Finally, I’m surprised that Zip did not work either; I would have expected it to succeed, given the popularity of .zip files. To give you an idea of what’s going on with Zip, I did the same Zip file ingestion into Hive and got these results:
That’s because there is no native Hive InputFormat which handles Zip files. You can write your own, but Polybase couldn’t use it, so we’ll not try to go down that road today.
So at this point, my recommendation would be to use GZip and BZip2 instead of other file compression formats. This makes rounds two and three simpler.
Round Two: Two Files
For round two, I’m going to combine uncompressed files, GZip compressed files, and BZip2 compressed files together. I already know it’s possible to combine multiple files in a folder and have them work (having done so with flight data in Bzip2 format), so there’s no need to test those cases here as well.
|Format 1||Format 2||Simple Query||MapReduce Job|
I’m happier about these results: it appears that you can mix and match pairs of supported formats without error. But now let’s try running this test on all three types.
Round Three: Three Files
In the final round, I have only one test: a file with no compression, a file with GZip compression, and a file with Bzip2 compression, all in the same directory. Not surprisingly, this worked for both the simple query and the MapReduce job.
If you do plan on using delimited files in a Polybase solution, my recommendation is to stick with GZip or BZ2 compression. Compression formats like Zip may be more popular and 7z may give you better compression ratios, but you’ll end up needing to decompress the files mid-stream. Cloudera provides some guidance on file formats: GZip is a bit less CPU-intensive (and thus faster) than BZ2, but BZ2 gets better compression ratios and BZ2 files can be split out and handled concurrently, making it easier to distribute data across data nodes and take advantage of Hadoop’s capabilities.
Hey, What About That Codec?
Remember that I used the DefaultCodec for everything today. If you name your files correctly, then you can safely use DefaultCodec without an issue. But if you take a .gz file and rename it to .csv:
Msg 8680, Level 17, State 26, Line 19
Internal Query Processor Error: The query processor encountered an unexpected error during the processing of a remote query phase.
This particular error happens because the file extension helps the engine figure out how to decompress. I strongly advise you not to change file extensions, as it causes nothing but pain.