Last week, I looked at flat file compression formats and noted that when you I tarred a file, I lost the top row. Today, I’m going to have a short test in which I try to tar several files and see what happens.
I have four copies of my Second Basemen data set, each saved as a CSV. Each copy has 777 records in it. I have put them all into one tar archive and uploaded into HDFS as /tmp/tartest/SecondBasemen.tar. I then created an external table like so:
CREATE EXTERNAL FILE FORMAT [CompressionTestFormat] WITH ( FORMAT_TYPE = DELIMITEDTEXT, FORMAT_OPTIONS ( FIELD_TERMINATOR = N',', USE_TYPE_DEFAULT = TRUE ), DATA_COMPRESSION = 'org.apache.hadoop.io.compress.DefaultCodec' ); GO CREATE EXTERNAL TABLE [dbo].[SecondBasemenTar] ( [FirstName] [varchar](50) NULL, [LastName] [varchar](50) NULL, [Age] [int] NULL, [Throws] [varchar](5) NULL, [Bats] [varchar](5) NULL ) WITH ( DATA_SOURCE = [HDP2], LOCATION = N'/tmp/tartest/SecondBasemen.csv', FILE_FORMAT = [CompressionTestFormat], REJECT_TYPE = VALUE, REJECT_VALUE = 5 ); GO SELECT * FROM dbo.SecondBasemenTar sb; GO SELECT * FROM dbo.SecondBasemenTar sb OPTION (FORCE EXTERNALPUSHDOWN);
The select statement returned 3104 records, exactly 4 shy of the 3108 I would have expected (777 * 4 = 3108). In each case, the missing row was the first, meaning when I search for LastName = ‘Turgeon’ (the first player in my data set), I get zero rows. When I search for another second basemen in the set, I get back four rows, exactly as I would have expected.
What’s really interesting is the result I get back from Wireshark when I run a query without pushdown: it does actually return the row for Casey Turgeon.
When I run a predicate pushdown operation (forcing external pushdown), I get the same zero rows, but what comes back over the wire is a PPAX file, whose format I haven’t yet figured out how to decode (and the Polybase team probably wants that to remain the case!).
If you want to tar together delimited flat files, I recommend adding a header row. There is a bit of irony here in that normally, you can’t skip header rows in Polybase. If you’re feeling really cheeky, you might have gotten your workaround…