Running Basic Polybase Queries On Hadoop

Following the Polybase series, we have created an external table and we can now use it like any other table.  For example:

SELECT TOP(50)
	sb.FirstName,
	sb.LastName,
	sb.Age,
	a.AgeDesc,
	sb.Throws,
	sb.Bats,
	tsa.TopSalary
FROM dbo.SecondBasemen sb
	CROSS APPLY (SELECT CASE WHEN Age > 30 THEN 'Old' ELSE 'Not Old' END AS AgeDesc) a
	INNER JOIN dbo.TopSalaryByAge tsa
		ON sb.Age = tsa.Age
ORDER BY
	sb.LastName DESC;

In this query, dbo.SecondBasemen is an external table and TopSalaryByAge is a table in SQL Server.  I haven’t found any T-SQL syntax which does not work, and although I haven’t tried everything, I’ve tried a few wacky ideas.

With that in mind, let’s dig a little deeper into what’s actually happening when we kick off basic queries.

The Basic Query

For the purposes of discussion, I’m going to label “basic” queries as any query which does not generate MapReduce jobs.  What happens is that all of the data is streamed over the wire to our SQL Server instance and stored in a temp table (or multiple temp tables if you have a scale-out Polybase cluster).  The database engine then continues along its merry way, performing any joins or other T-SQL-specific operations.  Let’s look at an example in action by running the query in my intro section.

We’re going to look at a couple of sources to see what’s happening with our request:  a Wireshark packet capture and DMV outputs.

Step 1:  Communicating With Hadoop

First, let’s check out the Wireshark packet capture:

basicpacketcapture

We have three TCP streams running here, two from my port 49423 to the Hadoop node’s 8020, and one from my port 49439 to the Hadoop node’s 50010.  The first TCP stream connects to the name node and asks for details regarding a file located at /tmp/secondbasemen.csv.

nonmapreducefileinfo

The name node responds with some basic information which confirms that the file actually exists (and, incidentally, that the name node is up and running).  Then, the Polybase engine requests information on who owns copies of that file, specifically by calling the getBlockLocations method:

nonmapreducegetblocklocations

The name node returns information on which machines in the Hadoop cluster own which blocks (with block names starting with BP).  The Polybase engine also wants the server defaults to know of any particular client protocol oddities, and the name node returns its results.

The final stream is where the Polybase engine connects directly to the data node on port 50010 and requests the block contents.

nonmapreducetcpstream

In this case, it runs a non-MapReduce job, meaning it just wants the data node to stream its block to the Polybase engine.  The data node complies, sending our data in plaintext to our SQL Server instance.

Step 2:  Putting It Together

Once the Polybase engine has retrieved the data, it feeds that data into a temp table and continues the operation.  We can see this clearly from a pair of DMVs:

nonmapreduceworkresults

The dm_exec_external_work DMV tells us which execution we care about; in this case, I ended up running the same query twice, but I decided to look at the first run of it.  Then, I can get step information from dm_exec_distributed_request_steps.  This shows that we created a table in tempdb called TEMP_ID_14 and streamed results into it.  The engine also created some statistics (though I’m not quite sure where it got the 24 rows from), and then we perform a round-robin query.  Each Polybase compute node queries its temp table and streams the data back to the head node.  Even though our current setup only has one compute node, the operation is the same as if we had a dozen Polybase compute nodes.

From there, we’ve satisfied the Remote Query operation in the execution plan, and the rest is just like any other SQL Server query:

nonmapreduceexecutionplan

Note that there are some interesting Remote Query elements in the execution plan, and I plan to look at them in more detail later in the series.  For now, though, this should suffice.

Conclusion

Today’s post covers how to figure out what’s going on with non-MapReduce queries in Polybase.  I originally intended to cover MapReduce queries as well, but I realized that there’s a lot more going on and that it deserves its own blog post, and this post will appear tomorrow.  Stay tuned!

Advertisements

3 thoughts on “Running Basic Polybase Queries On Hadoop

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s