A few days ago, I made mention of how the Remote Query in a Polybase execution plan contained interesting details.

A preview of interesting details
A preview of interesting details

Today we’re going to look at what we can find in the execution plan XML.  We’ll start with a basic (i.e., non-MapReduce) query and follow up with a MapReduce query and see how they change.  As a note, this is for a single-node Polybase cluster pointing to a single-node Hadoop cluster.  As I move to multi-node queries later in this series, it’ll be interesting to see how this changes, if it does at all.

Basic Queries

My basic query will be incredibly simple:  a select statement against the SecondBasemen table joined to the TopSalaryByAge table.

SELECT TOP(50)
    sb.FirstName,
    sb.LastName,
    sb.Age,
    sb.Throws,
    sb.Bats,
    tsa.TopSalary
FROM dbo.SecondBasemen sb
    INNER JOIN dbo.TopSalaryByAge tsa
        ON sb.Age = tsa.Age
ORDER BY
    sb.LastName DESC;

The execution plan XML is as follows:

<?xml version="1.0" encoding="utf-16"?>
<ShowPlanXML xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" Version="1.5" Build="13.0.1722.0" xmlns="http://schemas.microsoft.com/sqlserver/2004/07/showplan">
  <BatchSequence>
    <Batch>
      <Statements>
        <StmtSimple StatementCompId="1" StatementEstRows="50" StatementId="1" StatementOptmLevel="FULL" StatementOptmEarlyAbortReason="GoodEnoughPlanFound" CardinalityEstimationModelVersion="130" StatementSubTreeCost="0.100037" StatementText="SELECT TOP(50)
 sb.FirstName,
 sb.LastName,
 sb.Age,
 sb.Throws,
 sb.Bats,
 tsa.TopSalary
FROM dbo.SecondBasemen sb
 INNER JOIN dbo.TopSalaryByAge tsa
 ON sb.Age = tsa.Age
ORDER BY
 sb.LastName DESC" StatementType="SELECT" QueryHash="0xED18D81F92FC4F53" QueryPlanHash="0x77D2563391A21679" RetrievedFromCache="true" SecurityPolicyApplied="false">
          <StatementSetOptions ANSI_NULLS="true" ANSI_PADDING="true" ANSI_WARNINGS="true" ARITHABORT="true" CONCAT_NULL_YIELDS_NULL="true" NUMERIC_ROUNDABORT="false" QUOTED_IDENTIFIER="true" />
          <QueryPlan DegreeOfParallelism="0" NonParallelPlanReason="NoParallelForPDWCompilation" MemoryGrant="1024" CachedPlanSize="120" CompileTime="86" CompileCPU="83" CompileMemory="216">
            <MemoryGrantInfo SerialRequiredMemory="512" SerialDesiredMemory="560" RequiredMemory="512" DesiredMemory="560" RequestedMemory="1024" GrantWaitTime="0" GrantedMemory="1024" MaxUsedMemory="72" />
            <OptimizerHardwareDependentProperties EstimatedAvailableMemoryGrant="417021" EstimatedPagesCached="104255" EstimatedAvailableDegreeOfParallelism="2" />
            <RelOp AvgRowSize="83" EstimateCPU="5E-06" EstimateIO="0" EstimateRebinds="0" EstimateRewinds="0" EstimatedExecutionMode="Row" EstimateRows="50" LogicalOp="Top" NodeId="0" Parallel="false" PhysicalOp="Top" EstimatedTotalSubtreeCost="0.100037">
              <OutputList>
                <ColumnReference Database="[OOTP]" Schema="[dbo]" Table="[SecondBasemen]" Alias="[sb]" Column="FirstName" />
                <ColumnReference Database="[OOTP]" Schema="[dbo]" Table="[SecondBasemen]" Alias="[sb]" Column="LastName" />
                <ColumnReference Database="[OOTP]" Schema="[dbo]" Table="[SecondBasemen]" Alias="[sb]" Column="Age" />
                <ColumnReference Database="[OOTP]" Schema="[dbo]" Table="[SecondBasemen]" Alias="[sb]" Column="Throws" />
                <ColumnReference Database="[OOTP]" Schema="[dbo]" Table="[SecondBasemen]" Alias="[sb]" Column="Bats" />
                <ColumnReference Database="[OOTP]" Schema="[dbo]" Table="[TopSalaryByAge]" Alias="[tsa]" Column="TopSalary" />
              </OutputList>
              <RunTimeInformation>
                <RunTimeCountersPerThread Thread="0" ActualRows="50" Batches="0" ActualEndOfScans="1" ActualExecutions="1" ActualExecutionMode="Row" ActualElapsedms="6334" ActualCPUms="3083" />
              </RunTimeInformation>
              <Top RowCount="false" IsPercent="false" WithTies="false">
                <TopExpression>
                  <ScalarOperator ScalarString="(50)">
                    <Const ConstValue="(50)" />
                  </ScalarOperator>
                </TopExpression>
                <RelOp AvgRowSize="83" EstimateCPU="0.0008151" EstimateIO="0" EstimateRebinds="0" EstimateRewinds="0" EstimatedExecutionMode="Row" EstimateRows="50" LogicalOp="Inner Join" NodeId="1" Parallel="false" PhysicalOp="Nested Loops" EstimatedTotalSubtreeCost="0.100032">
                  <OutputList>
                    <ColumnReference Database="[OOTP]" Schema="[dbo]" Table="[SecondBasemen]" Alias="[sb]" Column="FirstName" />
                    <ColumnReference Database="[OOTP]" Schema="[dbo]" Table="[SecondBasemen]" Alias="[sb]" Column="LastName" />
                    <ColumnReference Database="[OOTP]" Schema="[dbo]" Table="[SecondBasemen]" Alias="[sb]" Column="Age" />
                    <ColumnReference Database="[OOTP]" Schema="[dbo]" Table="[SecondBasemen]" Alias="[sb]" Column="Throws" />
                    <ColumnReference Database="[OOTP]" Schema="[dbo]" Table="[SecondBasemen]" Alias="[sb]" Column="Bats" />
                    <ColumnReference Database="[OOTP]" Schema="[dbo]" Table="[TopSalaryByAge]" Alias="[tsa]" Column="TopSalary" />
                  </OutputList>
                  <RunTimeInformation>
                    <RunTimeCountersPerThread Thread="0" ActualRows="50" Batches="0" ActualEndOfScans="0" ActualExecutions="1" ActualExecutionMode="Row" ActualElapsedms="6334" ActualCPUms="3083" />
                  </RunTimeInformation>
                  <NestedLoops Optimized="false">
                    <OuterReferences>
                      <ColumnReference Database="[OOTP]" Schema="[dbo]" Table="[SecondBasemen]" Alias="[sb]" Column="Age" />
                    </OuterReferences>
                    <RelOp AvgRowSize="75" EstimateCPU="0.00241422" EstimateIO="0.0112613" EstimateRebinds="0" EstimateRewinds="0" EstimatedExecutionMode="Row" EstimateRows="50" LogicalOp="Sort" NodeId="2" Parallel="false" PhysicalOp="Sort" EstimatedTotalSubtreeCost="0.0886755">
                      <OutputList>
                        <ColumnReference Database="[OOTP]" Schema="[dbo]" Table="[SecondBasemen]" Alias="[sb]" Column="FirstName" />
                        <ColumnReference Database="[OOTP]" Schema="[dbo]" Table="[SecondBasemen]" Alias="[sb]" Column="LastName" />
                        <ColumnReference Database="[OOTP]" Schema="[dbo]" Table="[SecondBasemen]" Alias="[sb]" Column="Age" />
                        <ColumnReference Database="[OOTP]" Schema="[dbo]" Table="[SecondBasemen]" Alias="[sb]" Column="Throws" />
                        <ColumnReference Database="[OOTP]" Schema="[dbo]" Table="[SecondBasemen]" Alias="[sb]" Column="Bats" />
                      </OutputList>
                      <MemoryFractions Input="1" Output="1" />
                      <RunTimeInformation>
                        <RunTimeCountersPerThread Thread="0" ActualRebinds="1" ActualRewinds="0" ActualRows="50" Batches="0" ActualEndOfScans="0" ActualExecutions="1" ActualExecutionMode="Row" ActualElapsedms="6333" ActualCPUms="3083" ActualScans="0" ActualLogicalReads="0" ActualPhysicalReads="0" ActualReadAheads="0" ActualLobLogicalReads="0" ActualLobPhysicalReads="0" ActualLobReadAheads="0" InputMemoryGrant="1024" OutputMemoryGrant="640" UsedMemoryGrant="72" />
                      </RunTimeInformation>
                      <Sort Distinct="false">
                        <OrderBy>
                          <OrderByColumn Ascending="false">
                            <ColumnReference Database="[OOTP]" Schema="[dbo]" Table="[SecondBasemen]" Alias="[sb]" Column="LastName" />
                          </OrderByColumn>
                        </OrderBy>
                        <RelOp AvgRowSize="75" EstimateCPU="0.075" EstimateIO="0" EstimateRebinds="0" EstimateRewinds="0" EstimatedExecutionMode="Row" EstimateRows="195" LogicalOp="Compute Scalar" NodeId="3" Parallel="false" PhysicalOp="Compute Scalar" EstimatedTotalSubtreeCost="0.075">
                          <OutputList>
                            <ColumnReference Database="[OOTP]" Schema="[dbo]" Table="[SecondBasemen]" Alias="[sb]" Column="FirstName" />
                            <ColumnReference Database="[OOTP]" Schema="[dbo]" Table="[SecondBasemen]" Alias="[sb]" Column="LastName" />
                            <ColumnReference Database="[OOTP]" Schema="[dbo]" Table="[SecondBasemen]" Alias="[sb]" Column="Age" />
                            <ColumnReference Database="[OOTP]" Schema="[dbo]" Table="[SecondBasemen]" Alias="[sb]" Column="Throws" />
                            <ColumnReference Database="[OOTP]" Schema="[dbo]" Table="[SecondBasemen]" Alias="[sb]" Column="Bats" />
                          </OutputList>
                          <ComputeScalar>
                            <DefinedValues>
                              <DefinedValue>
                                <ColumnReference Database="[OOTP]" Schema="[dbo]" Table="[SecondBasemen]" Alias="[sb]" Column="FirstName" />
                                <ScalarOperator ScalarString="[OOTP].[dbo].[SecondBasemen].[FirstName] as [sb].[FirstName]">
                                  <Identifier>
                                    <ColumnReference Database="[OOTP]" Schema="[dbo]" Table="[SecondBasemen]" Alias="[sb]" Column="FirstName" />
                                  </Identifier>
                                </ScalarOperator>
                              </DefinedValue>
                              <DefinedValue>
                                <ColumnReference Database="[OOTP]" Schema="[dbo]" Table="[SecondBasemen]" Alias="[sb]" Column="LastName" />
                                <ScalarOperator ScalarString="[OOTP].[dbo].[SecondBasemen].[LastName] as [sb].[LastName]">
                                  <Identifier>
                                    <ColumnReference Database="[OOTP]" Schema="[dbo]" Table="[SecondBasemen]" Alias="[sb]" Column="LastName" />
                                  </Identifier>
                                </ScalarOperator>
                              </DefinedValue>
                              <DefinedValue>
                                <ColumnReference Database="[OOTP]" Schema="[dbo]" Table="[SecondBasemen]" Alias="[sb]" Column="Age" />
                                <ScalarOperator ScalarString="[OOTP].[dbo].[SecondBasemen].[Age] as [sb].[Age]">
                                  <Identifier>
                                    <ColumnReference Database="[OOTP]" Schema="[dbo]" Table="[SecondBasemen]" Alias="[sb]" Column="Age" />
                                  </Identifier>
                                </ScalarOperator>
                              </DefinedValue>
                              <DefinedValue>
                                <ColumnReference Database="[OOTP]" Schema="[dbo]" Table="[SecondBasemen]" Alias="[sb]" Column="Throws" />
                                <ScalarOperator ScalarString="[OOTP].[dbo].[SecondBasemen].[Throws] as [sb].[Throws]">
                                  <Identifier>
                                    <ColumnReference Database="[OOTP]" Schema="[dbo]" Table="[SecondBasemen]" Alias="[sb]" Column="Throws" />
                                  </Identifier>
                                </ScalarOperator>
                              </DefinedValue>
                              <DefinedValue>
                                <ColumnReference Database="[OOTP]" Schema="[dbo]" Table="[SecondBasemen]" Alias="[sb]" Column="Bats" />
                                <ScalarOperator ScalarString="[OOTP].[dbo].[SecondBasemen].[Bats] as [sb].[Bats]">
                                  <Identifier>
                                    <ColumnReference Database="[OOTP]" Schema="[dbo]" Table="[SecondBasemen]" Alias="[sb]" Column="Bats" />
                                  </Identifier>
                                </ScalarOperator>
                              </DefinedValue>
                            </DefinedValues>
                            <RelOp AvgRowSize="75" EstimateCPU="0.075" EstimateIO="0" EstimateRebinds="0" EstimateRewinds="0" EstimatedExecutionMode="Row" EstimateRows="195" LogicalOp="Remote Query" NodeId="4" Parallel="false" PhysicalOp="Remote Query" EstimatedTotalSubtreeCost="0.075">
                              <OutputList>
                                <ColumnReference Database="[OOTP]" Schema="[dbo]" Table="[SecondBasemen]" Alias="[sb]" Column="FirstName" />
                                <ColumnReference Database="[OOTP]" Schema="[dbo]" Table="[SecondBasemen]" Alias="[sb]" Column="LastName" />
                                <ColumnReference Database="[OOTP]" Schema="[dbo]" Table="[SecondBasemen]" Alias="[sb]" Column="Age" />
                                <ColumnReference Database="[OOTP]" Schema="[dbo]" Table="[SecondBasemen]" Alias="[sb]" Column="Throws" />
                                <ColumnReference Database="[OOTP]" Schema="[dbo]" Table="[SecondBasemen]" Alias="[sb]" Column="Bats" />
                              </OutputList>
                              <RunTimeInformation>
                                <RunTimeCountersPerThread Thread="0" ActualRebinds="1" ActualRewinds="0" ActualRows="777" Batches="0" ActualEndOfScans="1" ActualExecutions="1" ActualExecutionMode="Row" ActualElapsedms="6332" ActualCPUms="3082" ActualScans="0" ActualLogicalReads="0" ActualPhysicalReads="0" ActualReadAheads="0" ActualLobLogicalReads="0" ActualLobPhysicalReads="0" ActualLobReadAheads="0" />
                              </RunTimeInformation>
                              <RemoteQuery RemoteSource="Polybase_ExternalComputation" RemoteQuery="&lt;?xml version=&quot;1.0&quot; encoding=&quot;utf-8&quot;?&gt;
&lt;dsql_query number_nodes=&quot;1&quot; number_distributions=&quot;8&quot; number_distributions_per_node=&quot;8&quot;&gt;
 &lt;sql&gt;ExecuteMemo explain query&lt;/sql&gt;
 &lt;dsql_operations total_cost=&quot;0&quot; total_number_operations=&quot;6&quot;&gt;
 &lt;dsql_operation operation_type=&quot;RND_ID&quot;&gt;
 &lt;identifier&gt;TEMP_ID_40&lt;/identifier&gt;
 &lt;/dsql_operation&gt;
 &lt;dsql_operation operation_type=&quot;ON&quot;&gt;
 &lt;location permanent=&quot;false&quot; distribution=&quot;AllDistributions&quot; /&gt;
 &lt;sql_operations&gt;
 &lt;sql_operation type=&quot;statement&quot;&gt;CREATE TABLE [tempdb].[dbo].[TEMP_ID_40] ([FirstName] VARCHAR(50) COLLATE SQL_Latin1_General_CP1_CI_AS, [LastName] VARCHAR(50) COLLATE SQL_Latin1_General_CP1_CI_AS, [Age] INT, [Throws] VARCHAR(5) COLLATE SQL_Latin1_General_CP1_CI_AS, [Bats] VARCHAR(5) COLLATE SQL_Latin1_General_CP1_CI_AS ) WITH(DATA_COMPRESSION=PAGE);&lt;/sql_operation&gt;
 &lt;/sql_operations&gt;
 &lt;/dsql_operation&gt;
 &lt;dsql_operation operation_type=&quot;ON&quot;&gt;
 &lt;location permanent=&quot;false&quot; distribution=&quot;AllDistributions&quot; /&gt;
 &lt;sql_operations&gt;
 &lt;sql_operation type=&quot;statement&quot;&gt;EXEC [tempdb].[sys].[sp_addextendedproperty] @name=N'IS_EXTERNAL_STREAMING_TABLE', @value=N'true', @level0type=N'SCHEMA', @level0name=N'dbo', @level1type=N'TABLE', @level1name=N'TEMP_ID_40'&lt;/sql_operation&gt;
 &lt;/sql_operations&gt;
 &lt;/dsql_operation&gt;
 &lt;dsql_operation operation_type=&quot;ON&quot;&gt;
 &lt;location permanent=&quot;false&quot; distribution=&quot;AllDistributions&quot; /&gt;
 &lt;sql_operations&gt;
 &lt;sql_operation type=&quot;statement&quot;&gt;UPDATE STATISTICS [tempdb].[dbo].[TEMP_ID_40] WITH ROWCOUNT = 24, PAGECOUNT = 1&lt;/sql_operation&gt;
 &lt;/sql_operations&gt;
 &lt;/dsql_operation&gt;
 &lt;dsql_operation operation_type=&quot;MULTI&quot;&gt;
 &lt;dsql_operation operation_type=&quot;STREAMING_RETURN&quot;&gt;
 &lt;operation_cost cost=&quot;1&quot; accumulative_cost=&quot;1&quot; average_rowsize=&quot;114&quot; output_rows=&quot;195&quot; /&gt;
 &lt;location distribution=&quot;AllDistributions&quot; /&gt;
 &lt;select&gt;SELECT [T1_1].[FirstName] AS [FirstName],
 [T1_1].[LastName] AS [LastName],
 [T1_1].[Age] AS [Age],
 [T1_1].[Throws] AS [Throws],
 [T1_1].[Bats] AS [Bats]
FROM [tempdb].[dbo].[TEMP_ID_40] AS T1_1&lt;/select&gt;
 &lt;/dsql_operation&gt;
 &lt;dsql_operation operation_type=&quot;ExternalRoundRobinMove&quot;&gt;
 &lt;operation_cost cost=&quot;0.80028&quot; accumulative_cost=&quot;1.80028&quot; average_rowsize=&quot;114&quot; output_rows=&quot;195&quot; /&gt;
 &lt;external_uri&gt;hdfs://sandbox.hortonworks.com:8020/tmp/ootp/secondbasemen.csv&lt;/external_uri&gt;
 &lt;destination_table&gt;[TEMP_ID_40]&lt;/destination_table&gt;
 &lt;/dsql_operation&gt;
 &lt;/dsql_operation&gt;
 &lt;dsql_operation operation_type=&quot;ON&quot;&gt;
 &lt;location permanent=&quot;false&quot; distribution=&quot;AllDistributions&quot; /&gt;
 &lt;sql_operations&gt;
 &lt;sql_operation type=&quot;statement&quot;&gt;DROP TABLE [tempdb].[dbo].[TEMP_ID_40]&lt;/sql_operation&gt;
 &lt;/sql_operations&gt;
 &lt;/dsql_operation&gt;
 &lt;/dsql_operations&gt;
&lt;/dsql_query&gt;" />
                            </RelOp>
                          </ComputeScalar>
                        </RelOp>
                      </Sort>
                    </RelOp>
                    <RelOp AvgRowSize="15" EstimateCPU="0.0001581" EstimateIO="0.003125" EstimateRebinds="46.6667" EstimateRewinds="3.33333" EstimatedExecutionMode="Row" EstimateRows="1" LogicalOp="Clustered Index Seek" NodeId="16" Parallel="false" PhysicalOp="Clustered Index Seek" EstimatedTotalSubtreeCost="0.0111881" TableCardinality="22">
                      <OutputList>
                        <ColumnReference Database="[OOTP]" Schema="[dbo]" Table="[TopSalaryByAge]" Alias="[tsa]" Column="TopSalary" />
                      </OutputList>
                      <RunTimeInformation>
                        <RunTimeCountersPerThread Thread="0" ActualRows="50" ActualRowsRead="50" Batches="0" ActualEndOfScans="0" ActualExecutions="50" ActualExecutionMode="Row" ActualElapsedms="0" ActualCPUms="0" ActualScans="0" ActualLogicalReads="100" ActualPhysicalReads="1" ActualReadAheads="0" ActualLobLogicalReads="0" ActualLobPhysicalReads="0" ActualLobReadAheads="0" />
                      </RunTimeInformation>
                      <IndexScan Ordered="true" ScanDirection="FORWARD" ForcedIndex="false" ForceSeek="false" ForceScan="false" NoExpandHint="false" Storage="RowStore">
                        <DefinedValues>
                          <DefinedValue>
                            <ColumnReference Database="[OOTP]" Schema="[dbo]" Table="[TopSalaryByAge]" Alias="[tsa]" Column="TopSalary" />
                          </DefinedValue>
                        </DefinedValues>
                        <Object Database="[OOTP]" Schema="[dbo]" Table="[TopSalaryByAge]" Index="[PK_TopSalaryByAge]" Alias="[tsa]" IndexKind="Clustered" Storage="RowStore" />
                        <SeekPredicates>
                          <SeekPredicateNew>
                            <SeekKeys>
                              <Prefix ScanType="EQ">
                                <RangeColumns>
                                  <ColumnReference Database="[OOTP]" Schema="[dbo]" Table="[TopSalaryByAge]" Alias="[tsa]" Column="Age" />
                                </RangeColumns>
                                <RangeExpressions>
                                  <ScalarOperator ScalarString="[OOTP].[dbo].[SecondBasemen].[Age] as [sb].[Age]">
                                    <Identifier>
                                      <ColumnReference Database="[OOTP]" Schema="[dbo]" Table="[SecondBasemen]" Alias="[sb]" Column="Age" />
                                    </Identifier>
                                  </ScalarOperator>
                                </RangeExpressions>
                              </Prefix>
                            </SeekKeys>
                          </SeekPredicateNew>
                        </SeekPredicates>
                      </IndexScan>
                    </RelOp>
                  </NestedLoops>
                </RelOp>
              </Top>
            </RelOp>
          </QueryPlan>
        </StmtSimple>
      </Statements>
    </Batch>
  </BatchSequence>
</ShowPlanXML>

Even for a simple query, I’m not going to expect you to read 174 lines of XML; I’m not a sadist, after all…

What follows is a look at significant lines and my commentary.

Line 8:
<QueryPlan DegreeOfParallelism=”0″ NonParallelPlanReason=”NoParallelForPDWCompilation” MemoryGrant=”1024″ CachedPlanSize=”120″ CompileTime=”86″ CompileCPU=”83″ CompileMemory=”216″>

For Polybase plans, our database engine operations cannot go parallel.  There’s a joke in there somewhere.

Lines 71-126 describe the Compute Scalar and Remote Query together.

computescalar

Inside the Compute Scalar XML, we have the remote query.

The Remote Query

Let’s take that encoded XML in the remote query and decode it so that we can see the innards for ourselves:

<?xml version="1.0" encoding="utf-8"?>
<dsql_query number_nodes="1" number_distributions="8" number_distributions_per_node="8">
  <sql>ExecuteMemo explain query</sql>
  <dsql_operations total_cost="0" total_number_operations="6">
    <dsql_operation operation_type="RND_ID">
      <identifier>TEMP_ID_40</identifier>
    </dsql_operation>
    <dsql_operation operation_type="ON">
      <location permanent="false" distribution="AllDistributions" />
      <sql_operations>
        <sql_operation type="statement">CREATE TABLE [tempdb].[dbo].[TEMP_ID_40] ([FirstName] VARCHAR(50) COLLATE SQL_Latin1_General_CP1_CI_AS, [LastName] VARCHAR(50) COLLATE SQL_Latin1_General_CP1_CI_AS, [Age] INT, [Throws] VARCHAR(5) COLLATE SQL_Latin1_General_CP1_CI_AS, [Bats] VARCHAR(5) COLLATE SQL_Latin1_General_CP1_CI_AS ) WITH(DATA_COMPRESSION=PAGE);</sql_operation>
      </sql_operations>
    </dsql_operation>
    <dsql_operation operation_type="ON">
      <location permanent="false" distribution="AllDistributions" />
      <sql_operations>
        <sql_operation type="statement">EXEC [tempdb].[sys].[sp_addextendedproperty] @name=N'IS_EXTERNAL_STREAMING_TABLE', @value=N'true', @level0type=N'SCHEMA', @level0name=N'dbo', @level1type=N'TABLE', @level1name=N'TEMP_ID_40'</sql_operation>
      </sql_operations>
    </dsql_operation>
    <dsql_operation operation_type="ON">
      <location permanent="false" distribution="AllDistributions" />
      <sql_operations>
        <sql_operation type="statement">UPDATE STATISTICS [tempdb].[dbo].[TEMP_ID_40] WITH ROWCOUNT = 24, PAGECOUNT = 1</sql_operation>
      </sql_operations>
    </dsql_operation>
    <dsql_operation operation_type="MULTI">
      <dsql_operation operation_type="STREAMING_RETURN">
        <operation_cost cost="1" accumulative_cost="1" average_rowsize="114" output_rows="195" />
        <location distribution="AllDistributions" />
        <select>SELECT [T1_1].[FirstName] AS [FirstName],
       [T1_1].[LastName] AS [LastName],
       [T1_1].[Age] AS [Age],
       [T1_1].[Throws] AS [Throws],
       [T1_1].[Bats] AS [Bats]
FROM   [tempdb].[dbo].[TEMP_ID_40] AS T1_1</select>
      </dsql_operation>
      <dsql_operation operation_type="ExternalRoundRobinMove">
        <operation_cost cost="0.80028" accumulative_cost="1.80028" average_rowsize="114" output_rows="195" />
        <external_uri>hdfs://sandbox.hortonworks.com:8020/tmp/ootp/secondbasemen.csv</external_uri>
        <destination_table>[TEMP_ID_40]</destination_table>
      </dsql_operation>
    </dsql_operation>
    <dsql_operation operation_type="ON">
      <location permanent="false" distribution="AllDistributions" />
      <sql_operations>
        <sql_operation type="statement">DROP TABLE [tempdb].[dbo].[TEMP_ID_40]</sql_operation>
      </sql_operations>
    </dsql_operation>
  </dsql_operations>
</dsql_query>

Yay, more XML!  Again, I’m no sadist, so let’s focus on the interesting bits inside this block.

Line 2:
<dsql_query number_nodes=”1″ number_distributions=”8″ number_distributions_per_node=”8″>

We have one Polybase node in our scaleout cluster.  Azure SQL Data Warehouse has a similar output; this Grant Fritchey post shows an example with number_distributions=”60″ instead of 8.  For more information on how this number_distributions matters, check out my post from June 27th on the topic.

Line 4:
<dsql_operations total_cost=”0″ total_number_operations=”6″>

There are going to be six operations in this query plan.  They start on lines 5, 8, 14, 20, 26, and 43, respectively.

Line 5:
<dsql_operation operation_type=”RND_ID”><identifier>TEMP_ID_40</identifier></dsql_operation>

This looks like the operation which generates the name for the temp table.

Line 8 looks to be the operation which generates the create statement for the temp table.  Line 14 then builds extended properties.  Line 20 creates external stats.  Line 26 creates the SELECT statement to retrieve data from the temp table.  Line 37 performs the Round Robin retrieval from Hadoop into TMP_ID_40.  Line 43 drops the table.  Hey, wait, this looks familiar!

nonmapreduceworkresults

Yep, these are the same details you get out of sys.dm_exec_distributed_request_steps, though I’m reusing the image here so the table IDs won’t be quite the same.  Look carefully at the image and you’ll see eight steps, but the explain plan XML above only shows six items.  It appears that step index 4 does not show up in the explain plan.  Step 6 in the request steps DMV selects data from TEMP_ID_## and calls the result T1_1; this is what shows up on line 27 of the explain plan, with an operation whose type is STREAMING_RETURN.  Step 5 in the request steps DMV selects from SecondBasemen as T1_1, so I think that matches the ExternalRoundRobinMove operation in line 37 of the explain plan.

MapReduce Queries

I’m just going to add an OPTION(FORCE EXTERNALPUSHDOWN) hint to my query.  That way, we have a plan which is as close to the original plan as we can get while still doing a MapReduce job.

This execution plan is also 174 lines long, so I won’t do a straight copy-paste like I did the first plan.  The graphical shape of the plan is the same and everything aside from the remote query looks to be the same.  The remote query’s explain plan, not surprisingly is different, so here it is:

<?xml version="1.0" encoding="utf-8"?>
<dsql_query number_nodes="1" number_distributions="8" number_distributions_per_node="8">
  <sql>ExecuteMemo explain query</sql>
  <dsql_operations total_cost="257703.37296" total_number_operations="8">
    <dsql_operation operation_type="RND_ID">
      <identifier>TEMP_ID_56</identifier>
    </dsql_operation>
    <dsql_operation operation_type="HadoopOperation">
      <operation_cost cost="257703.37296" accumulative_cost="257703.37296" average_rowsize="114" output_rows="195" />
      <Config>
  <JobTrackerName>sandbox.hortonworks.com</JobTrackerName>
  <JobTrackerPort>8050</JobTrackerPort>
  <Plan>
    <RelOp NodeId="6" LogicalOp="Materialize">
      <OutputList>
        <ColumnReference Database="[OOTP]" Schema="[dbo]" Table="[SecondBasemen]" Column="FirstName" />
        <ColumnReference Database="[OOTP]" Schema="[dbo]" Table="[SecondBasemen]" Column="LastName" />
        <ColumnReference Database="[OOTP]" Schema="[dbo]" Table="[SecondBasemen]" Column="Age" />
        <ColumnReference Database="[OOTP]" Schema="[dbo]" Table="[SecondBasemen]" Column="Throws" />
        <ColumnReference Database="[OOTP]" Schema="[dbo]" Table="[SecondBasemen]" Column="Bats" />
      </OutputList>
      <Children>
        <RelOp NodeId="1" LogicalOp="Scan">
          <OutputList>
            <ColumnReference Database="[OOTP]" Schema="[dbo]" Table="[SecondBasemen]" Column="FirstName" />
            <ColumnReference Database="[OOTP]" Schema="[dbo]" Table="[SecondBasemen]" Column="LastName" />
            <ColumnReference Database="[OOTP]" Schema="[dbo]" Table="[SecondBasemen]" Column="Age" />
            <ColumnReference Database="[OOTP]" Schema="[dbo]" Table="[SecondBasemen]" Column="Throws" />
            <ColumnReference Database="[OOTP]" Schema="[dbo]" Table="[SecondBasemen]" Column="Bats" />
          </OutputList>
          <Object Database="[OOTP]" Schema="[dbo]" Table="[SecondBasemen]">
            <Uri>hdfs://sandbox.hortonworks.com:8020/tmp/ootp/secondbasemen.csv</Uri>
            <FileFormat Type="DelimitedText">
<options>
                <FieldTerminator>,</FieldTerminator>
                <UseTypeDefault>True</UseTypeDefault>
                <Encoding>UTF8</Encoding>
              </Options>
            </FileFormat>
          </Object>
          <Children />
        </RelOp>
      </Children>
    </RelOp>
  </Plan>
</Config>
    </dsql_operation>
    <dsql_operation operation_type="RND_ID">
      <identifier>TEMP_ID_57</identifier>
    </dsql_operation>
    <dsql_operation operation_type="ON">
      <location permanent="false" distribution="AllDistributions" />
      <sql_operations>
        <sql_operation type="statement">CREATE TABLE [tempdb].[dbo].[TEMP_ID_57] ([FirstName] VARCHAR(50) COLLATE SQL_Latin1_General_CP1_CI_AS, [LastName] VARCHAR(50) COLLATE SQL_Latin1_General_CP1_CI_AS, [Age] INT, [Throws] VARCHAR(5) COLLATE SQL_Latin1_General_CP1_CI_AS, [Bats] VARCHAR(5) COLLATE SQL_Latin1_General_CP1_CI_AS ) WITH(DATA_COMPRESSION=PAGE);</sql_operation>
      </sql_operations>
    </dsql_operation>
    <dsql_operation operation_type="ON">
      <location permanent="false" distribution="AllDistributions" />
      <sql_operations>
        <sql_operation type="statement">EXEC [tempdb].[sys].[sp_addextendedproperty] @name=N'IS_EXTERNAL_STREAMING_TABLE', @value=N'true', @level0type=N'SCHEMA', @level0name=N'dbo', @level1type=N'TABLE', @level1name=N'TEMP_ID_57'</sql_operation>
      </sql_operations>
    </dsql_operation>
    <dsql_operation operation_type="ON">
      <location permanent="false" distribution="AllDistributions" />
      <sql_operations>
        <sql_operation type="statement">UPDATE STATISTICS [tempdb].[dbo].[TEMP_ID_57] WITH ROWCOUNT = 24, PAGECOUNT = 1</sql_operation>
      </sql_operations>
    </dsql_operation>
    <dsql_operation operation_type="MULTI">
      <dsql_operation operation_type="STREAMING_RETURN">
        <operation_cost cost="1" accumulative_cost="257704.37296" average_rowsize="114" output_rows="195" />
        <location distribution="AllDistributions" />
        <select>SELECT [T1_1].[FirstName] AS [FirstName],
       [T1_1].[LastName] AS [LastName],
       [T1_1].[Age] AS [Age],
       [T1_1].[Throws] AS [Throws],
       [T1_1].[Bats] AS [Bats]
FROM   [tempdb].[dbo].[TEMP_ID_57] AS T1_1</select>
      </dsql_operation>
      <dsql_operation operation_type="ExternalRoundRobinMove">
        <operation_cost cost="0.80028" accumulative_cost="257705.17324" average_rowsize="114" output_rows="195" />
        <external_uri>hdfs://sandbox.hortonworks.com:8020/TEMP_ID_56_OUTPUT_DIR</external_uri>
        <destination_table>[TEMP_ID_57]</destination_table>
      </dsql_operation>
    </dsql_operation>
    <dsql_operation operation_type="ON">
      <location permanent="false" distribution="AllDistributions" />
      <sql_operations>
        <sql_operation type="statement">DROP TABLE [tempdb].[dbo].[TEMP_ID_57]</sql_operation>
      </sql_operations>
    </dsql_operation>
  </dsql_operations>
</dsql_query>

The first difference between the MapReduce query and the basic query is that the MapReduce query has a cost and does 8 operations, whereas the basic query had a cost value of 0 and performed 6 operations.

We also get to see a HadoopOperation, which includes JobTracker details as well as details on the file format of the second basemen file in HDFS.  What interests me here are the two LogicalOp operators on lines 14 and 23, respectively:  Materialize (the set of columns on OOTP.dbo.SecondBaseman) and Scan (the secondbaseman.csv file).

There’s another difference down in lines 81-82.  Here, instead of querying secondbaseman.csv like the basic query did, we’re reading the results of the MapReduce directory, labeled as TEMP_ID_56_OUTPUT_DIR in my example above.

Conclusion

When you combine this post with the posts on digging into basic and MapReduce queries, I think the picture starts getting clearer and clearer.  Hopefully at this point the Polybase engine is looking less like a black box and more like a “normal” data mover service with several inspection points available to us.

One thought on “Reading Polybase Execution Plan Details

Leave a comment