A while back, I had a post on configuring Polybase in which I copied yarn-site.xml values from my HDP sandbox into Polybase. That led to pushdown errors which vexed me for a while. At PASS Summit, I had a chance to talk to Murshed Zaman, author of a blog post on common Polybase errors. He walked me through the issues.
What I Had
I followed the MSDN documentation on Polybase configuration and modified my yarn-site.xml file, making the following incorrect change:
<property> <name>yarn.application.classpath</name> <value>$HADOOP_CONF_DIR,/usr/hdp/current/hadoop-client/*,/usr/hdp/current/hadoop-client/lib/*,/usr/hdp/current/hadoop-hdfs-client/*,/usr/hdp/current/hadoop-hdfs-client/lib/*,/usr/hdp/current/hadoop-yarn-client/*,/usr/hdp/current/hadoop-yarn-client/lib/*</value> </property>
This was incorrect.
What’s Wrong Here?
The short answer is, I’d get errors like the following when I try to run a MapReduce job:
Log Type: stderr
Log Upload Time: Thu Oct 27 00:16:23 +0000 2016
Log Length: 88
Error: Could not find or load main class org.apache.hadoop.mapreduce.v2.app.MRAppMaster
Hey, Wait, Where Did That Come From?
We’re using Hortonworks, so the answer comes from Ambari. Let’s walk through it step by step, starting by going to my Yarn portal (often hosted on port 8088):
Note that I have some failed jobs. If you click on the application link, you’ll see the attempts:
On the attempt page, we can click the Logs link to see the text output of the logs. Scroll down close to the bottom and you’ll see syserr:
This is where I was stuck until PASS Summit.
So How Do We Fix This?
It turns out that the answer is simple: fix the configuration! So let’s explain how the configuration settings above are wrong, even though they work for yarn-site.xml in Ambari. Notice that we’re pointing to a bunch of /usr/hdp/current/ folders. Let’s see what that folder structure looks like.
First, here’s my Putty instance:
Let’s look at the current folder:
There are a bunch of symbolic links here. So let’s look at the other folder:
Actual folders instead of symlinks. It seems that with my setup, the MapReduce operator could not find an appropriate jar file and thus we got our error. With that said, here are my config changes. As a note, these changes are to the SQL Server Polybase configuration files, not the Hadoop cluster itself. My configuration folder is the default location: C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\MSSQL\Binn\Polybase\Hadoop\conf.
The first change I need to make is yarn-site.xml.
<property> <description>CLASSPATH for YARN applications. A comma-separated list of CLASSPATH entries</description> <name>yarn.application.classpath</name> <value>/usr/hdp/18.104.22.168-227/hadoop/*,/usr/hdp/22.214.171.124-227/hadoop/lib/*,/usr/hdp/126.96.36.199-227/hadoop-hdfs/*,/usr/hdp/188.8.131.52-227/hadoop-hdfs/lib/*,/usr/hdp/184.108.40.206-227/hadoop-yarn/*,/usr/hdp/220.127.116.11-227/hadoop-yarn/lib/*,/usr/hdp/18.104.22.168-227/hadoop-mapreduce/*,/usr/hdp/22.214.171.124-227/hadoop-mapreduce/lib/*</value> </property>
Note that the version of HDP that you’re using matters. I have 126.96.36.199-227 for one Hadoop cluster and 188.8.131.52-169 for another.
This is important, but not sufficient. We also need to make another change.
This part is the other part I needed for predicate pushdown to work correctly and comes from Murshed’s article (linked again for your convenience). The properties I needed to add are as follows:
<property> <name>yarn.app.mapreduce.am.staging-dir</name> <value>/user</value> </property> <property> <name>mapreduce.jobhistory.done-dir</name> <value>/mr-history/done</value> </property> <property> <name>mapreduce.jobhistory.intermediate-done-dir</name> <value>/mr-history/tmp</value> </property>
Getting Polybase configured can be a challenge, especially if you’re not very familiar with Hadoop administration and configuration. These settings will work for the Hortonworks Data Platform, but your mileage may vary.