Last week, I looked at flat file compression formats and noted that when you I tarred a file, I lost the top row.  Today, I’m going to have a short test in which I try to tar several files and see what happens.

The Setup

I have four copies of my Second Basemen data set, each saved as a CSV.  Each copy has 777 records in it.  I have put them all into one tar archive and uploaded into HDFS as /tmp/tartest/SecondBasemen.tar.  I then created an external table like so:

CREATE EXTERNAL FILE FORMAT [CompressionTestFormat] WITH
(
    FORMAT_TYPE = DELIMITEDTEXT,
    FORMAT_OPTIONS
    (
        FIELD_TERMINATOR = N',',
        USE_TYPE_DEFAULT = TRUE
    ),
    DATA_COMPRESSION = 'org.apache.hadoop.io.compress.DefaultCodec'
);
GO
CREATE EXTERNAL TABLE [dbo].[SecondBasemenTar]
(
    [FirstName] [varchar](50) NULL,
    [LastName] [varchar](50) NULL,
    [Age] [int] NULL,
    [Throws] [varchar](5) NULL,
    [Bats] [varchar](5) NULL
)
WITH
(
    DATA_SOURCE = [HDP2],
    LOCATION = N'/tmp/tartest/SecondBasemen.csv',
    FILE_FORMAT = [CompressionTestFormat],
    REJECT_TYPE = VALUE,
    REJECT_VALUE = 5
);
GO
SELECT *
FROM dbo.SecondBasemenTar sb;
GO
SELECT *
FROM dbo.SecondBasemenTar sb
OPTION (FORCE EXTERNALPUSHDOWN);

Results

The select statement returned 3104 records, exactly 4 shy of the 3108 I would have expected (777 * 4 = 3108).  In each case, the missing row was the first, meaning when I search for LastName = ‘Turgeon’ (the first player in my data set), I get zero rows.  When I search for another second basemen in the set, I get back four rows, exactly as I would have expected.

What’s really interesting is the result I get back from Wireshark when I run a query without pushdown:  it does actually return the row for Casey Turgeon.

caseyturgeon

When I run a predicate pushdown operation (forcing external pushdown), I get the same zero rows, but what comes back over the wire is a PPAX file, whose format I haven’t yet figured out how to decode (and the Polybase team probably wants that to remain the case!).

ppax

Conclusions

If you want to tar together delimited flat files, I recommend adding a header row.  There is a bit of irony here in that normally, you can’t skip header rows in Polybase.  If you’re feeling really cheeky, you might have gotten your workaround…

One thought on “More Polybase With Tar

Leave a comment