This week’s PolyBase Revealed post is all about a couple of academic papers that Microsoft Research published a few years back.
In their first paper, entitled Split Query Processing in Polybase (note the pre-marketing capitalization), David DeWitt, et al, take us through some of the challenges and competitive landscape around PolyBase during 2013. One of my favorite pieces from this paper was the discussion on where, specifically, there were differences between the C# and C++ heavy SQL Server infrastructure and the Java language used for Hadoop. This includes things such as null checking, implicit conversions, and join mechanics. One of the things they didn’t point out in the paper but which is important is rounding: different languages and products round things differently, such that you might get a result of $499.99 when using SQL Server’s rounding methods but $500 or $500.02 when bouncing out to a MapReduce job. In cases where an estimate is good enough and being off by a hundredth of a percent is no big deal, that’s fine, but this kind of seemingly non-deterministic behavior can lead to erroneous bug reports. Ensuring that these rounding rules remain consistent across platforms is important for that reason, but it is also complex given how many ways different languages can estimate slightly differently.
One other thing of note in the first paper is that they were thinking of this with regard to Parallel Data Warehouse, so you can see some of the differences when they brought PolyBase to Enterprise Edition in 2016 around naming semantics, available data formats (no ORC or Parquet in the 2013 paper), etc. It was also during the Hadoop 1.0 days, so although there are a couple of mentions of YARN at the end, none of those improvements had hit the mainstream yet.
The second paper, from Vinitha Reddy Gankidi, et al, is called Indexing HDFS Data in PDW: Splitting the data from the index and came out in 2014. The impetus for this paper is something we’re well familiar with today: companies have the desire to store and query massive amounts of data, but waiting minutes or hours for batch queries to finish just isn’t in the cards. So what can we do to improve the performance of queries of data stored in HDFS? Their answer in the paper is, create indexes in the relational database for data in HDFS and query the indexes whenever possible. This works great if you need to do point lookups or pull a couple of columns from a giant dataset and the paper explains how they can keep the index in sync with an external Hadoop filesystem.
What’s really interesting to me is that this is a feature which never made it to Enterprise/Standard Edition. We have the ability to create statistics on external tables but no ability to create indexes. If you prefer a good slide deck over reading the paper, they’ve got you covered.