More Polybase With Tar

Last week, I looked at flat file compression formats and noted that when you I tarred a file, I lost the top row.  Today, I’m going to have a short test in which I try to tar several files and see what happens.

The Setup

I have four copies of my Second Basemen data set, each saved as a CSV.  Each copy has 777 records in it.  I have put them all into one tar archive and uploaded into HDFS as /tmp/tartest/SecondBasemen.tar.  I then created an external table like so:

CREATE EXTERNAL FILE FORMAT [CompressionTestFormat] WITH
(
    FORMAT_TYPE = DELIMITEDTEXT,
    FORMAT_OPTIONS
    (
        FIELD_TERMINATOR = N',',
        USE_TYPE_DEFAULT = TRUE
    ),
    DATA_COMPRESSION = 'org.apache.hadoop.io.compress.DefaultCodec'
);
GO
CREATE EXTERNAL TABLE [dbo].[SecondBasemenTar]
(
    [FirstName] [varchar](50) NULL,
    [LastName] [varchar](50) NULL,
    [Age] [int] NULL,
    [Throws] [varchar](5) NULL,
    [Bats] [varchar](5) NULL
)
WITH
(
    DATA_SOURCE = [HDP2],
    LOCATION = N'/tmp/tartest/SecondBasemen.csv',
    FILE_FORMAT = [CompressionTestFormat],
    REJECT_TYPE = VALUE,
    REJECT_VALUE = 5
);
GO
SELECT *
FROM dbo.SecondBasemenTar sb;
GO
SELECT *
FROM dbo.SecondBasemenTar sb
OPTION (FORCE EXTERNALPUSHDOWN);

Results

The select statement returned 3104 records, exactly 4 shy of the 3108 I would have expected (777 * 4 = 3108).  In each case, the missing row was the first, meaning when I search for LastName = ‘Turgeon’ (the first player in my data set), I get zero rows.  When I search for another second basemen in the set, I get back four rows, exactly as I would have expected.

What’s really interesting is the result I get back from Wireshark when I run a query without pushdown:  it does actually return the row for Casey Turgeon.

caseyturgeon

When I run a predicate pushdown operation (forcing external pushdown), I get the same zero rows, but what comes back over the wire is a PPAX file, whose format I haven’t yet figured out how to decode (and the Polybase team probably wants that to remain the case!).

ppax

Conclusions

If you want to tar together delimited flat files, I recommend adding a header row.  There is a bit of irony here in that normally, you can’t skip header rows in Polybase.  If you’re feeling really cheeky, you might have gotten your workaround…