I just recently worked through an error in which predicate pushdown would work for flat files but would fail with a weird error on ORC files.

tl;dr

If you’re hitting Hive 3, make sure you’re using SQL Server 2019 CTP 2.3 (or later).

The Equipment

  • HDP 3.0.1.0-187 running standalone. This includes HDFS 3.1.1, Hive 3.1.0, and YARN 3.1.1.
  • SQL Server 2019 CTP 2.2 on a Windows VM.

The Issue

I have two copies of the same data set, one in CSV format and one in ORC format.  When running PolyBase queries, here’s a table of what I get for outputs:

 No PushdownForce ExternalPushdown
CSVQuery succeedsQuery succeeds
ORCQuery succeedsQuery fails

Three out of four scenarios work just fine, but that last one was a doozy. SQL Server would barf out an error telling me to check out the Hadoop job server. I check the error log and I see this doozy:

2019-02-20 19:33:29,219 INFO [main] com.microsoft.pp.converter.ToPaxBlockConverter: Creating input PPax with inblock size 1 MB, offblock size 3 MB
2019-02-20 19:33:29,236 INFO [main] com.microsoft.pp.converter.ToPaxBlockConverter: Created PPAX with tuple count limit 3811 on a schema with 3 required columns of which 0 are LOBs and 0 are large varlen columns.
2019-02-20 19:33:29,246 INFO [main] org.apache.hadoop.conf.Configuration.deprecation: No unit for dfs.client.datanode-restart.timeout(30) assuming SECONDS
2019-02-20 19:33:29,376 INFO [main] org.apache.hadoop.hive.ql.io.orc.ReaderImpl: Reading ORC rows from hdfs://clusterino:8020/PolyBaseData/NYCParkingTicketsORC/000000_0 with {include: [false, false, false, true, false, false, false, false, false, false, false, false, false, false, false, true, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, true, false, false, false, false, false, false, false], offset: 0, length: 439403459}
2019-02-20 19:33:29,395 ERROR [main] org.apache.hadoop.mapred.YarnChild: Error running child : java.lang.ExceptionInInitializerError
	at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.<init>(RecordReaderImpl.java:195)
	at org.apache.hadoop.hive.ql.io.orc.ReaderImpl.rowsOptions(ReaderImpl.java:539)
	at com.microsoft.pp.converter.ORCFileToPaxBlockConverter.<init>(ORCFileToPaxBlockConverter.java:95)
	at com.microsoft.polybase.storage.input.ORCFileToPaxBlockInputFormat$OrcFileToPaxBlockRecordReader.initialize(ORCFileToPaxBlockInputFormat.java:64)
	at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:560)
	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:798)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:347)
	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168)
Caused by: java.lang.IllegalArgumentException: Unrecognized Hadoop major version number: 3.1.1.3.0.1.0-187
	at org.apache.hadoop.hive.shims.ShimLoader.getMajorVersion(ShimLoader.java:174)
	at org.apache.hadoop.hive.shims.ShimLoader.loadShims(ShimLoader.java:139)
	at org.apache.hadoop.hive.shims.ShimLoader.getHadoopShims(ShimLoader.java:100)
	at org.apache.hadoop.hive.conf.HiveConf$ConfVars.<clinit>(HiveConf.java:368)
	... 12 more

2019-02-20 19:33:29,497 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping MapTask metrics system...
2019-02-20 19:33:29,498 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: MapTask metrics system stopped.
2019-02-20 19:33:29,498 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: MapTask metrics system shutdown complete.

The Strategy

My initial plan was to google things. The specific error: java.lang.IllegalArgumentException: Unrecognized Hadoop major version number. That pops up HIVE-15326 and HIVE-15016 but gave me no immediate joy.

After reaching out to James Rowland-Jones (t), we (by which I mean he) eventually figured out the issue.

While he was figuring out the issue, I found a more apropos error and the chat discussion really helped flesh out the problem. In this Stack Overflow case, the root cause was that a user-defined function was configured to use Hive 3.1.1 but the user’s version of Hive was 1.2.1. My speculation was that maybe something similar was happening with PolyBase but I wasn’t going to be able to go any further.

The Solution

The solution here is to upgrade to SQL Server 2019 CTP 2.3 on those test machines. There were some quiet updates to PolyBase and that resolves the issue at hand. For 2017 and 2016, I can’t speculate on what will happen there, but my guess is that most people in production are still using HDP 2.x and haven’t moved to 3 just yet.

One thought on “PolyBase Revealed: Hive Shim Errors

Leave a comment