External File Formats

In today’s post, we’ll spend some time looking at the different external file formats available to Polybase.  My goal in this post is to cover each of the file formats and give some idea of when and why we might want to use each format.  Let’s start with the MSDN page.  On this page, we see that there are four file types which Polybase supports:  delimited text, RCFile, ORC, and Parquet.  Let’s take each in order, but first a quick digression on file formats in general.

Order In A Chaotic World

One premise of Hadoop is that your data producers don’t necessarily need to adhere to a specific structure, that it is up to the consumer to apply structure at the time of ingest. As a simple example, you might have a log file with different levels of messages, including informational messages, warnings, errors, and critical errors.  Each of these might contain different fields:  an error message will probably have an error code, whereas an informational message should not.  If you’re writing your own MapReduce job, you can write rather complex mappers which apply rule to incoming lines of code to help determine what its structure should look like.

But in our SQL Server world, we’re a little more limited.  Instead of writing mappers, we let the Polybase code write the mapper for us and just tell it whether it did a good job.  But for it to do its job, we need to do our job and explain how the base file looks.  We’re also going to need to define structure in that file, but we’ll save that for later.

Delimited Text

The simplest file format for us to understand is delimited text.

CREATE EXTERNAL FILE FORMAT file_format_name  
WITH (  
    FORMAT_TYPE = DELIMITEDTEXT  
    [ , FORMAT_OPTIONS ( <format_options> [ ,...n  ] ) ]  
    [ , DATA_COMPRESSION = {  
           'org.apache.hadoop.io.compress.GzipCodec'  
         | 'org.apache.hadoop.io.compress.DefaultCodec'  
        }  
     ]);  
  
<format_options> ::=  
{  
    FIELD_TERMINATOR = field_terminator  
    | STRING_DELIMITER = string_delimiter  
    | DATE_FORMAT = datetime_format  
    | USE_TYPE_DEFAULT = { TRUE | FALSE }  
}  

Delimited text is exactly as it sounds:  you can use a comma, tab, pipe, tilde, or any other delimiter (including multi-character delimiters).  So let’s go through the options here.  First, FORMAT_TYPE must be DELIMITEDTEXT.  From there, we have a few FORMAT_OPTIONS.  I mentioned FIELD_TERMINATOR, which is how we separate the values in a record.  We can also use STRING_DELIMITER if there are quotes or other markers around our string values.

DATE_FORMAT makes it easier for Polybase to understand how dates are formatted in your file.  The MSDN document gives you hints on how to use specific date formats, but you can’t define a custom format today, or even use multiple date formats.

The last option you can specify is USE_TYPE_DEFAULT.  If you set this to TRUE, then whenever you’re scanning a file and the value is missing, Polybase will insert a pre-defined default, like 0 for a numeric column.

After defining format options, you can also specify DATA_COMPRESSION.  For delimited text, you can specify the DefaultCodec or GzipCodec.  In my next Polybase post, I’ll look at some delimited file format compression options, but this post will be long enough today without that.

Why Use This Format?

Flat files are probably the most common file format for interchange, even still today.  They’re easy to write, easy to read, and have relatively low packing overhead (as compared to something like XML or JSON).  This makes it easy to take files and dump them into HDFS for processing in Hadoop.

Why Not Use This Format?

Even with compression, flat files are less efficient than some of the other formats we’ll look at.  Other formats re-define file structure to make it easier to scan data and get the results you want faster, so one easy way of improving performance may be to use a different file format.

RCFile

The next file format is RCFile, which stands for Record Columnar File.  For more information on the RCFile format, check out Christian Prokopp’s blog post.

CREATE EXTERNAL FILE FORMAT file_format_name  
WITH (  
    FORMAT_TYPE = RCFILE,  
    SERDE_METHOD = {  
        'org.apache.hadoop.hive.serde2.columnar.LazyBinaryColumnarSerDe'  
      | 'org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe'  
    }  
    [ , DATA_COMPRESSION = 'org.apache.hadoop.io.compress.DefaultCodec' ]); 

With RCFile, we see two options:  SERDE_METHOD and DATA_COMPRESSION.  We can only use the default codec for data compression, so that’s not interesting.  As far as SERDE_METHOD goes, there are two.

Before I get into those, let’s talk about SerDes.  A SerDe is a Serializer/Deserializer.  It’s a way of telling the Hadoop how to read and write data.  You can write your own SerDe pretty easily, but there are a number of them freely available.  As far as RCFile goes, Polybase supports two:  LazyBinaryColumnar and Columnar.  LazyBinaryColumnar has been the default since 2013 and is probably going to be your preferred choice because it’s a little more CPU efficient.

Why Use This Format?

RCFile provides a performance improvement over a classic rowstore file format and is usually smaller in size (repeating blog post link here).

Why Not Use This Format?

ORC is way better.  If you’re going to use a Hive format, use ORC.  Speaking of which…

ORC

Optimized Row Columnar is a newer, more efficient row storage format and is strictly superior to RCFile, in that you get faster performance and smaller data sizes.  Christian Prokopp has a nice blog post on this format as well.  There is also an Apache project page for ORC if you want to dig into the details.

CREATE EXTERNAL FILE FORMAT file_format_name  
WITH (  
    FORMAT_TYPE = ORC  
     [ , DATA_COMPRESSION = {  
        'org.apache.hadoop.io.compress.SnappyCodec'  
      | 'org.apache.hadoop.io.compress.DefaultCodec'      }  
    ]);  

For ORC, we only have the DATA_COMPRESSION option, where we can use a default codec or the Snappy compression codec.  Back in 2014, it looks like Snappy was faster, but Zlib has caught up and Zstandard is up and coming.  Orthogonal to that, ORC performance itself has improved in several ways.

Why Use This Format?

This is a great format for aggregating data and for using Hive.  If you plan to read data from the Hive warehouse for your Polybase queries, then you should save the data in ORC format.

Why Not Use This Format?

There are a couple of reasons not to use this format.  First, there are times when columnar data storage is slower than row storage—think of cases in which you need to access lots of columns and the columnar data is almost entirely unrelated (so little to no no intra-column compression).  Second, if you do a lot more writing than reading, ORC is one of the slower file formats you could use.

Parquet

Parquet is another Apache project which offers a columnar data storage format.  Cloudera is a major backer of the Parquet format, whereas Hortonworks tends to prefer the ORC format.

CREATE EXTERNAL FILE FORMAT file_format_name  
WITH (  
    FORMAT_TYPE = PARQUET  
     [ , DATA_COMPRESSION = {  
        'org.apache.hadoop.io.compress.SnappyCodec'  
      | 'org.apache.hadoop.io.compress.GzipCodec'      }  
    ]);  

There are only two codecs available:  Snappy and GZip.  Snappy is the same as ORC’s Snappy, so it has the same pros and cons.

Why Use This Format?

Parquet works better for nested data, whereas ORC works better for flatter file formats.

Why Not Use This Format?

ORC can be more efficient in terms of size.  As far as performance goes, they seem to be pretty similar.

Avro?

 

There is one very popular format which won’t work with Polybase:  Avro.  Avro is a great format for serialization according to a schema and is available in binary or JSON encoding.  Unlike Parquet and ORC, Avro is row-based and so performs better for writes.

There is a Connect item to support Avro.  Please upvote.

Conclusion and Additional Reading

We’ve taken a look at the file formats available in Polybase.  For more details, Andrew Peterson has a good writeup regarding formats.

If you’re interested in more details on choosing file formats, Matthew Rathbone has a nice post.  I agree with his comment that if you’re new, just use delimited files.  Once you get a bit more experience under your belt, check out Stephen O’Sullivan’s StampedeCon 2015 presentation, which covers several formats and includes performance testing and Stephen’s and Silvia Oliveros’s great blog post on data formats.

Advertisements

3 thoughts on “External File Formats

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s