Presentation Goals For 2017

In light of my 2016 goals, I’ve decided to make my 2017 speaking goals a little bit more ambitious and keep them public. To wit:

Speak at 20 SQL Saturday and 10 User Groups

Last year, I broke both of these barriers, so you might wonder how this is particularly ambitious. The way I figure it, when combined with the other goals, getting out to 20 separate SQL Saturdays will be a bit harder, as there are only so many weeks (and weekends) in the conference season.

So far, I am locked into three SQL Saturdays: Nashville, Vienna, and Cleveland.

Speak at 2 Paid Conferences

Last year, my goal was to speak at one paid conference. This year, I’m pushing it up to two. This is a harder goal than SQL Saturday-related goals mostly because paid conferences usually mean travel during the week, and being in management—even as low a level of management as I’m in—makes that a little harder.

Give 6 Webinars

Here’s the big one for me. I have presented two webinars to date. Webinars are an area I have neglected in the past, focusing instead on in-person presentations. Speaking in person is a bit easier than giving a webinar for a few reasons. First, you get to warm up the audience before the session, priming them for a good talk. You also get to see how well the material is going over: if they’re laughing at your jokes and looking interested, you know you’re in the groove; if they seem confused, you can slow down and repeat things.

Without those kinds of audience cues, I want to create more of a subdued webinar experience. My goal is still to be educational and entertaining (to some extent, at least!), but I have stripped out some of the humor. I haven’t figured out how exactly to replace those gaps entirely smoothly, so that’s going to be part of my goal. After all, if I end up being terrible at webinars, I probably won’t hit my goal of 6!

Do A Full-Length, Pictures-Only Talk

I created a couple lightning talks this year which incorporate hand-drawn pictures, stealing the idea from David Neal, who does a much better job at it than I do. Regardless, I like the format and would like to do an entire talk in that style. It’s a bit of a risk, as I really want to do everything—including demos—in hand-drawn picture format. I think it would be an interesting, one-off style.

This isn’t something that I would want to do for every talk, but as I go over the talks I have today, I know that I’m a bit short on graphics, and I know of several places in talks where a simple picture can help me explain the concept better. Maybe this talk would help me get to that point.

Advertisements

Inserting Into Azure Blob Storage

This is a continuation of my Polybase series.

My goal in today’s post is to insert data into Azure Blob Storage using Polybase and see what happens.  My expectation is that it will behave the same as inserting into Hadoop.

Creating A New External Table

Our first step is to create a new external table.  In this case, I want to store information on second basemen.  I already have the data locally in a table called Player.SecondBasemen, and I’d like to move it up to Azure Blob Storage into a folder called ootp.

secondbasemen

To do this, we’ll create a new external table:

USE [OOTP]
GO
CREATE EXTERNAL TABLE [dbo].[SecondBasemenWASB]
(
	[FirstName] [varchar](50) NULL,
	[LastName] [varchar](50) NULL,
	[Age] [int] NULL,
	[Throws] [varchar](5) NULL,
	[Bats] [varchar](5) NULL
)
WITH
(
	DATA_SOURCE = [WASBFlights],
	LOCATION = N'ootp/',
	FILE_FORMAT = [CsvFileFormat],
	REJECT_TYPE = VALUE,
	REJECT_VALUE = 5
)
GO

Inserting Data

Inserting data is pretty easy.

INSERT INTO dbo.SecondBasemenWASB
(
	FirstName,
	LastName,
	Age,
	Throws,
	Bats
)
SELECT
	sb.FirstName,
	sb.LastName,
	sb.Age,
	sb.Bats,
	sb.Throws
FROM Player.SecondBasemen sb;

A few seconds later and we have some rows in the table:

rowsinserted

We can check to make sure that everything looks fine:

secondbasemenwasb

I get back the same 777 results.

Looking at Azure Management Studio, I can see that the Polybase engine created a logical folder named ootp and 8 separate files, of which 4 have data.

filesuploaded

One interesting finding is that there is also an empty block blob with the name ootp:

emptyblockblob

The timestamp for it is the same as for the initial upload, so these were created at the same time.

Uploading A Larger Data Set

I decided to take advantage of my Azure Blob Storage connection and create a larger data set.  In this case, I’m going to cross join the second basemen table, creating 777 records for each of the 777 second basemen.  Those will all go into Azure Blob Storage and I’ll be able to see how the Polybase engine deals with larger data sets.  I’m running two tests, one which will generate 777 * 777 = 603,729 records, and one which will generate 777 * 777 * 777 = 469,097,433 records.  Here’s my query for the latter:

INSERT INTO dbo.SecondBasemenWASB
(
	FirstName,
	LastName,
	Age,
	Throws,
	Bats
)
SELECT
	sb.FirstName,
	sb.LastName,
	sb.Age,
	sb.Bats,
	sb.Throws
FROM Player.SecondBasemen sb
	CROSS JOIN Player.SecondBasemen sb2
	CROSS JOIN Player.SecondBasemen sb3;

Generating and uploading 600K records was pretty fast, but building 469 million records was a little slower.  During this time, the Polybase engine was streaming results out, meaning it didn’t appear to write everything to a local file.  The mpdwsvc service did take up about 2 GB of RAM during this time and was pushing out at about 25-30 Mbps, which is probably my Wifi connection’s limitation.  All in all, it took my laptop approximately 80 minutes to generate and push those results.

The end results were not particularly shocking:

multipleuploads

I’ve sliced the first, second, and third uploads into separate batches to make it clearer.  In the first batch, as I mentioned above, only four of the files ended up with data in them.  In the second batch, each file had somewhere around 2MB of write activity.  For the third batch, each was about 1.5 GB.  I should note that for the second run, the spread between the smallest and largest files was approximately 17.6%, whereas it was 12.3% for the third run.

What About Fixed-Width Lines?

One additional question I have involves whether the process for loading data is round-robin on a row-by-row basis.  My conjecture is that it is not (particularly given that our first example had 4 files with zero records in them!), but I figured I’d create a new table and test.  In this case, I’m using three fixed-width data types and loading 10 million identical records.  I chose to use identical record values to make sure that the text length of the columns in this line were exactly the same; the reason is that we’re taking data out of SQL Server (where an int is stored in a 4-byte block) and converting that int to a string (where each numeric value in the int is stored as a one-byte character).  I chose 10 million because I now that’s well above the cutoff point for data to go into each of the eight files, so if there’s special logic to handle tiny row counts, I’d get past it.

CREATE TABLE dbo.LocalFixed
(
	ID INT NOT NULL,
	SomeNum INT NOT NULL,
	SomeChar CHAR(12) NOT NULL
);

INSERT INTO dbo.LocalFixed
(
	ID,
	SomeNum,
	SomeChar
)
SELECT TOP (10000000)
	1,
	0,
	'abcdefghijkl'
FROM Player.SecondBasemen sb
	CROSS JOIN Player.SecondBasemen sb1
	CROSS JOIN Player.SecondBasemen sb2;

From here, we can create another external table and insert our 10 million rows into it:

CREATE EXTERNAL TABLE [dbo].[RemoteFixed]
(
	ID INT NOT NULL,
	SomeNum INT NOT NULL,
	SomeChar CHAR(12) NOT NULL
)
WITH
(
	DATA_SOURCE = [WASBFlights],
	LOCATION = N'fixed/',
	FILE_FORMAT = [CsvFileFormat],
	REJECT_TYPE = VALUE,
	REJECT_VALUE = 5
);

INSERT INTO dbo.RemoteFixed
(
	ID,
	SomeNum,
	SomeChar
)
SELECT
	lf.ID,
	lf.SomeNum,
	lf.SomeChar
FROM dbo.LocalFixed lf;

 

And here are the results in Blob Storage:

fixedwidthfiles

Even when I guarantee that rows are exactly the same size, there are still minor differences:  here it’s 20.48 MB versus 20.02 MB.  This indicates to me that the Polybase file write system is not round robin on a row-by-row basis, as there is a difference of 28,193 records separating the largest from the smallest file (1,262,965 rows with data in the largest file versus 1,234,772 in the smallest file).

Conclusion

Just like the Hadoop scenario, inserting into Azure Blob Storage splits the data into a set of files and uploads those files separately.  There weren’t any surprises here, so you can read the Hadoop version to get the rest of the story.  We did take the opportunity to learn a bit more about the way the Polybase file writer engine operates and can conclusively reject the idea that it evenly distributes records based either on data length or number of rows.

This wraps up my Polybase investigations of Azure Blob Storage for now.  Next up, I’m going to look at Azure SQL Data Warehouse.

Understanding Blob Storage Behavior

This is a continuation of my Polybase series.

In the last chapter of the Polybase series, I looked at creating external tables pointing to files in Azure Blob Storage.  Today, we’re going to take a deeper look at how the Polybase engine interacts with Azure Blob Storage.  To do this is as simple as running a basic query.  The query that I will run is the following:

SELECT TOP(1000)
	f.month,
	f.arrdelay
FROM dbo.Flights2008 f
WHERE
	f.dest = N'CMH'
	AND f.dayofmonth = 14;

Execution Plan Details

The execution plan was pretty simple:

executionplan

As far as the XML goes, here’s the full plan if you’re curious:

<?xml version="1.0" encoding="utf-16"?>
<ShowPlanXML xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" Version="1.5" Build="13.0.1722.0" xmlns="http://schemas.microsoft.com/sqlserver/2004/07/showplan">
  <BatchSequence>
    <Batch>
      <Statements>
        <StmtSimple StatementCompId="1" StatementEstRows="67.2391" StatementId="1" StatementOptmLevel="FULL" StatementOptmEarlyAbortReason="GoodEnoughPlanFound" CardinalityEstimationModelVersion="130" StatementSubTreeCost="0.032413" StatementText="SELECT TOP(1000)
 f.month,
 f.arrdelay
FROM dbo.Flights2008 f
WHERE
 f.dest = N'CMH'
 AND f.dayofmonth = 14" StatementType="SELECT" QueryHash="0x955C4C3E381C844B" QueryPlanHash="0x53368F42C59FAD79" RetrievedFromCache="true" SecurityPolicyApplied="false">
          <StatementSetOptions ANSI_NULLS="true" ANSI_PADDING="true" ANSI_WARNINGS="true" ARITHABORT="true" CONCAT_NULL_YIELDS_NULL="true" NUMERIC_ROUNDABORT="false" QUOTED_IDENTIFIER="true" />
          <QueryPlan DegreeOfParallelism="0" NonParallelPlanReason="NoParallelForPDWCompilation" CachedPlanSize="120" CompileTime="50" CompileCPU="49" CompileMemory="208">
            <Warnings>
              <PlanAffectingConvert ConvertIssue="Seek Plan" Expression="CONVERT_IMPLICIT(nvarchar(100),[f].[dest],0)=N'CMH'" />
            </Warnings>
            <MemoryGrantInfo SerialRequiredMemory="0" SerialDesiredMemory="0" />
            <OptimizerHardwareDependentProperties EstimatedAvailableMemoryGrant="417021" EstimatedPagesCached="104255" EstimatedAvailableDegreeOfParallelism="2" />
            <RelOp AvgRowSize="65" EstimateCPU="0.032413" EstimateIO="0" EstimateRebinds="0" EstimateRewinds="0" EstimatedExecutionMode="Row" EstimateRows="67.2391" LogicalOp="Compute Scalar" NodeId="0" Parallel="false" PhysicalOp="Compute Scalar" EstimatedTotalSubtreeCost="0.032413">
              <OutputList>
                <ColumnReference Database="[OOTP]" Schema="[dbo]" Table="[Flights2008]" Alias="[f]" Column="month" />
                <ColumnReference Database="[OOTP]" Schema="[dbo]" Table="[Flights2008]" Alias="[f]" Column="arrdelay" />
              </OutputList>
              <ComputeScalar>
                <DefinedValues>
                  <DefinedValue>
                    <ColumnReference Database="[OOTP]" Schema="[dbo]" Table="[Flights2008]" Alias="[f]" Column="month" />
                    <ScalarOperator ScalarString="[OOTP].[dbo].[Flights2008].[month] as [f].[month]">
                      <Identifier>
                        <ColumnReference Database="[OOTP]" Schema="[dbo]" Table="[Flights2008]" Alias="[f]" Column="month" />
                      </Identifier>
                    </ScalarOperator>
                  </DefinedValue>
                  <DefinedValue>
                    <ColumnReference Database="[OOTP]" Schema="[dbo]" Table="[Flights2008]" Alias="[f]" Column="arrdelay" />
                    <ScalarOperator ScalarString="[OOTP].[dbo].[Flights2008].[arrdelay] as [f].[arrdelay]">
                      <Identifier>
                        <ColumnReference Database="[OOTP]" Schema="[dbo]" Table="[Flights2008]" Alias="[f]" Column="arrdelay" />
                      </Identifier>
                    </ScalarOperator>
                  </DefinedValue>
                </DefinedValues>
                <RelOp AvgRowSize="65" EstimateCPU="0.032413" EstimateIO="0" EstimateRebinds="0" EstimateRewinds="0" EstimatedExecutionMode="Row" EstimateRows="67.2391" LogicalOp="Remote Query" NodeId="1" Parallel="false" PhysicalOp="Remote Query" EstimatedTotalSubtreeCost="0.032413">
                  <OutputList>
                    <ColumnReference Database="[OOTP]" Schema="[dbo]" Table="[Flights2008]" Alias="[f]" Column="month" />
                    <ColumnReference Database="[OOTP]" Schema="[dbo]" Table="[Flights2008]" Alias="[f]" Column="arrdelay" />
                  </OutputList>
                  <RunTimeInformation>
                    <RunTimeCountersPerThread Thread="0" ActualRebinds="1" ActualRewinds="0" ActualRows="1000" Batches="0" ActualEndOfScans="1" ActualExecutions="1" ActualExecutionMode="Row" ActualElapsedms="265523" ActualCPUms="80822" ActualScans="0" ActualLogicalReads="0" ActualPhysicalReads="0" ActualReadAheads="0" ActualLobLogicalReads="0" ActualLobPhysicalReads="0" ActualLobReadAheads="0" />
                  </RunTimeInformation>
                  <RemoteQuery RemoteSource="Polybase_ExternalComputation" RemoteQuery="&lt;?xml version=&quot;1.0&quot; encoding=&quot;utf-8&quot;?&gt;
&lt;dsql_query number_nodes=&quot;1&quot; number_distributions=&quot;8&quot; number_distributions_per_node=&quot;8&quot;&gt;
 &lt;sql&gt;ExecuteMemo explain query&lt;/sql&gt;
 &lt;dsql_operations total_cost=&quot;1&quot; total_number_operations=&quot;10&quot;&gt;
 &lt;dsql_operation operation_type=&quot;RND_ID&quot;&gt;
 &lt;identifier&gt;TEMP_ID_23&lt;/identifier&gt;
 &lt;/dsql_operation&gt;
 &lt;dsql_operation operation_type=&quot;ON&quot;&gt;
 &lt;location permanent=&quot;false&quot; distribution=&quot;AllDistributions&quot; /&gt;
 &lt;sql_operations&gt;
 &lt;sql_operation type=&quot;statement&quot;&gt;CREATE TABLE [tempdb].[dbo].[TEMP_ID_23] ([month] INT, [dayofmonth] INT, [arrdelay] VARCHAR(100) COLLATE SQL_Latin1_General_CP1_CI_AS, [dest] VARCHAR(100) COLLATE SQL_Latin1_General_CP1_CI_AS ) WITH(DATA_COMPRESSION=PAGE);&lt;/sql_operation&gt;
 &lt;/sql_operations&gt;
 &lt;/dsql_operation&gt;
 &lt;dsql_operation operation_type=&quot;ON&quot;&gt;
 &lt;location permanent=&quot;false&quot; distribution=&quot;AllDistributions&quot; /&gt;
 &lt;sql_operations&gt;
 &lt;sql_operation type=&quot;statement&quot;&gt;EXEC [tempdb].[sys].[sp_addextendedproperty] @name=N'IS_EXTERNAL_STREAMING_TABLE', @value=N'true', @level0type=N'SCHEMA', @level0name=N'dbo', @level1type=N'TABLE', @level1name=N'TEMP_ID_23'&lt;/sql_operation&gt;
 &lt;/sql_operations&gt;
 &lt;/dsql_operation&gt;
 &lt;dsql_operation operation_type=&quot;ON&quot;&gt;
 &lt;location permanent=&quot;false&quot; distribution=&quot;AllDistributions&quot; /&gt;
 &lt;sql_operations&gt;
 &lt;sql_operation type=&quot;statement&quot;&gt;UPDATE STATISTICS [tempdb].[dbo].[TEMP_ID_23] WITH ROWCOUNT = 5651, PAGECOUNT = 143&lt;/sql_operation&gt;
 &lt;/sql_operations&gt;
 &lt;/dsql_operation&gt;
 &lt;dsql_operation operation_type=&quot;RND_ID&quot;&gt;
 &lt;identifier&gt;TEMP_ID_24&lt;/identifier&gt;
 &lt;/dsql_operation&gt;
 &lt;dsql_operation operation_type=&quot;ON&quot;&gt;
 &lt;location permanent=&quot;false&quot; distribution=&quot;Control&quot; /&gt;
 &lt;sql_operations&gt;
 &lt;sql_operation type=&quot;statement&quot;&gt;CREATE TABLE [tempdb].[dbo].[TEMP_ID_24] ([month] INT, [arrdelay] VARCHAR(100) COLLATE SQL_Latin1_General_CP1_CI_AS ) WITH(DATA_COMPRESSION=PAGE);&lt;/sql_operation&gt;
 &lt;/sql_operations&gt;
 &lt;/dsql_operation&gt;
 &lt;dsql_operation operation_type=&quot;MULTI&quot;&gt;
 &lt;dsql_operation operation_type=&quot;PARTITION_MOVE&quot;&gt;
 &lt;operation_cost cost=&quot;1.678287936&quot; accumulative_cost=&quot;1.678287936&quot; average_rowsize=&quot;104&quot; output_rows=&quot;67.2391&quot; /&gt;
 &lt;location distribution=&quot;AllDistributions&quot; /&gt;
 &lt;source_statement&gt;SELECT [T1_1].[month] AS [month],
 [T1_1].[arrdelay] AS [arrdelay]
FROM (SELECT TOP (CAST ((1000) AS BIGINT)) [T2_1].[month] AS [month],
 [T2_1].[arrdelay] AS [arrdelay]
 FROM (SELECT CONVERT (NVARCHAR (100), [T3_1].[dest], 0) COLLATE SQL_Latin1_General_CP1_CI_AS AS [col],
 [T3_1].[month] AS [month],
 [T3_1].[arrdelay] AS [arrdelay]
 FROM [tempdb].[dbo].[TEMP_ID_23] AS T3_1
 WHERE ([T3_1].[dayofmonth] = CAST ((14) AS INT))) AS T2_1
 WHERE ([T2_1].[col] = CAST (N'CMH' COLLATE SQL_Latin1_General_CP1_CI_AS AS NVARCHAR (3)) COLLATE SQL_Latin1_General_CP1_CI_AS)) AS T1_1&lt;/source_statement&gt;
 &lt;destination&gt;Control&lt;/destination&gt;
 &lt;destination_table&gt;[TEMP_ID_24]&lt;/destination_table&gt;
 &lt;/dsql_operation&gt;
 &lt;dsql_operation operation_type=&quot;ExternalRoundRobinMove&quot;&gt;
 &lt;operation_cost cost=&quot;964.621896&quot; accumulative_cost=&quot;966.300183936&quot; average_rowsize=&quot;208&quot; output_rows=&quot;45211&quot; /&gt;
 &lt;external_uri&gt;wasbs://csflights@cspolybase.blob.core.windows.net/historical/2008.csv.bz2&lt;/external_uri&gt;
 &lt;destination_table&gt;[TEMP_ID_23]&lt;/destination_table&gt;
 &lt;/dsql_operation&gt;
 &lt;/dsql_operation&gt;
 &lt;dsql_operation operation_type=&quot;ON&quot;&gt;
 &lt;location permanent=&quot;false&quot; distribution=&quot;AllDistributions&quot; /&gt;
 &lt;sql_operations&gt;
 &lt;sql_operation type=&quot;statement&quot;&gt;DROP TABLE [tempdb].[dbo].[TEMP_ID_23]&lt;/sql_operation&gt;
 &lt;/sql_operations&gt;
 &lt;/dsql_operation&gt;
 &lt;dsql_operation operation_type=&quot;STREAMING_RETURN&quot;&gt;
 &lt;operation_cost cost=&quot;1&quot; accumulative_cost=&quot;967.300183936&quot; average_rowsize=&quot;104&quot; output_rows=&quot;67.2391&quot; /&gt;
 &lt;location distribution=&quot;Control&quot; /&gt;
 &lt;select&gt;SELECT [T1_1].[month] AS [month],
 [T1_1].[arrdelay] AS [arrdelay]
FROM (SELECT TOP (CAST ((1000) AS BIGINT)) [T2_1].[month] AS [month],
 [T2_1].[arrdelay] AS [arrdelay]
 FROM [tempdb].[dbo].[TEMP_ID_24] AS T2_1) AS T1_1&lt;/select&gt;
 &lt;/dsql_operation&gt;
 &lt;dsql_operation operation_type=&quot;ON&quot;&gt;
 &lt;location permanent=&quot;false&quot; distribution=&quot;Control&quot; /&gt;
 &lt;sql_operations&gt;
 &lt;sql_operation type=&quot;statement&quot;&gt;DROP TABLE [tempdb].[dbo].[TEMP_ID_24]&lt;/sql_operation&gt;
 &lt;/sql_operations&gt;
 &lt;/dsql_operation&gt;
 &lt;/dsql_operations&gt;
&lt;/dsql_query&gt;" />
                </RelOp>
              </ComputeScalar>
            </RelOp>
          </QueryPlan>
        </StmtSimple>
      </Statements>
    </Batch>
  </BatchSequence>
</ShowPlanXML>

And here is the decoded remote query:

<?xml version="1.0" encoding="utf-8"?>
<dsql_query number_nodes="1" number_distributions="8" number_distributions_per_node="8">
  <sql>ExecuteMemo explain query</sql>
  <dsql_operations total_cost="1" total_number_operations="10">
    <dsql_operation operation_type="RND_ID">
      <identifier>TEMP_ID_23</identifier>
    </dsql_operation>
    <dsql_operation operation_type="ON">
      <location permanent="false" distribution="AllDistributions" />
      <sql_operations>
        <sql_operation type="statement">CREATE TABLE [tempdb].[dbo].[TEMP_ID_23] ([month] INT, [dayofmonth] INT, [arrdelay] VARCHAR(100) COLLATE SQL_Latin1_General_CP1_CI_AS, [dest] VARCHAR(100) COLLATE SQL_Latin1_General_CP1_CI_AS ) WITH(DATA_COMPRESSION=PAGE);</sql_operation>
      </sql_operations>
    </dsql_operation>
    <dsql_operation operation_type="ON">
      <location permanent="false" distribution="AllDistributions" />
      <sql_operations>
        <sql_operation type="statement">EXEC [tempdb].[sys].[sp_addextendedproperty] @name=N'IS_EXTERNAL_STREAMING_TABLE', @value=N'true', @level0type=N'SCHEMA', @level0name=N'dbo', @level1type=N'TABLE', @level1name=N'TEMP_ID_23'</sql_operation>
      </sql_operations>
    </dsql_operation>
    <dsql_operation operation_type="ON">
      <location permanent="false" distribution="AllDistributions" />
      <sql_operations>
        <sql_operation type="statement">UPDATE STATISTICS [tempdb].[dbo].[TEMP_ID_23] WITH ROWCOUNT = 5651, PAGECOUNT = 143</sql_operation>
      </sql_operations>
    </dsql_operation>
    <dsql_operation operation_type="RND_ID">
      <identifier>TEMP_ID_24</identifier>
    </dsql_operation>
    <dsql_operation operation_type="ON">
      <location permanent="false" distribution="Control" />
      <sql_operations>
        <sql_operation type="statement">CREATE TABLE [tempdb].[dbo].[TEMP_ID_24] ([month] INT, [arrdelay] VARCHAR(100) COLLATE SQL_Latin1_General_CP1_CI_AS ) WITH(DATA_COMPRESSION=PAGE);</sql_operation>
      </sql_operations>
    </dsql_operation>
    <dsql_operation operation_type="MULTI">
      <dsql_operation operation_type="PARTITION_MOVE">
        <operation_cost cost="1.678287936" accumulative_cost="1.678287936" average_rowsize="104" output_rows="67.2391" />
        <location distribution="AllDistributions" />
        <source_statement>SELECT [T1_1].[month] AS [month],
       [T1_1].[arrdelay] AS [arrdelay]
FROM   (SELECT TOP (CAST ((1000) AS BIGINT)) [T2_1].[month] AS [month],
                                             [T2_1].[arrdelay] AS [arrdelay]
        FROM   (SELECT CONVERT (NVARCHAR (100), [T3_1].[dest], 0) COLLATE SQL_Latin1_General_CP1_CI_AS AS [col],
                       [T3_1].[month] AS [month],
                       [T3_1].[arrdelay] AS [arrdelay]
                FROM   [tempdb].[dbo].[TEMP_ID_23] AS T3_1
                WHERE  ([T3_1].[dayofmonth] = CAST ((14) AS INT))) AS T2_1
        WHERE  ([T2_1].[col] = CAST (N'CMH' COLLATE SQL_Latin1_General_CP1_CI_AS AS NVARCHAR (3)) COLLATE SQL_Latin1_General_CP1_CI_AS)) AS T1_1</source_statement>
        <destination>Control</destination>
        <destination_table>[TEMP_ID_24]</destination_table>
      </dsql_operation>
      <dsql_operation operation_type="ExternalRoundRobinMove">
        <operation_cost cost="964.621896" accumulative_cost="966.300183936" average_rowsize="208" output_rows="45211" />
        <external_uri>wasbs://csflights@cspolybase.blob.core.windows.net/historical/2008.csv.bz2</external_uri>
        <destination_table>[TEMP_ID_23]</destination_table>
      </dsql_operation>
    </dsql_operation>
    <dsql_operation operation_type="ON">
      <location permanent="false" distribution="AllDistributions" />
      <sql_operations>
        <sql_operation type="statement">DROP TABLE [tempdb].[dbo].[TEMP_ID_23]</sql_operation>
      </sql_operations>
    </dsql_operation>
    <dsql_operation operation_type="STREAMING_RETURN">
      <operation_cost cost="1" accumulative_cost="967.300183936" average_rowsize="104" output_rows="67.2391" />
      <location distribution="Control" />
      <select>SELECT [T1_1].[month] AS [month],
       [T1_1].[arrdelay] AS [arrdelay]
FROM   (SELECT TOP (CAST ((1000) AS BIGINT)) [T2_1].[month] AS [month],
                                             [T2_1].[arrdelay] AS [arrdelay]
        FROM   [tempdb].[dbo].[TEMP_ID_24] AS T2_1) AS T1_1</select>
    </dsql_operation>
    <dsql_operation operation_type="ON">
      <location permanent="false" distribution="Control" />
      <sql_operations>
        <sql_operation type="statement">DROP TABLE [tempdb].[dbo].[TEMP_ID_24]</sql_operation>
      </sql_operations>
    </dsql_operation>
  </dsql_operations>
</dsql_query>

A couple of interesting details.  First, this plan looks pretty similar to the Hadoop plan when I did not induce a MapReduce job.  We can see 8 distributions per node, just like in on-prem Hadoop.  Second, as expected, our external URI points to the WASBS (secure WASB) blob location.  Given that I’m using the WASBS protocol, I’m expecting the connection to pull this data down to use SSL/TLS.  And there’s one easy way to find out.

Wireshark Details

While running this query, I also pulled a packet capture.  After figuring out the IP address (which wasn’t hard, given that there were about 93,000 packets in my packet capture associated with it), I was able to get the following details.

Overall Picture

packetcapture

We can see in the image above that Polybase uses TLS v 1.2 to communicate with Azure Blob Storage when you access Blob Storage using the WASBS protocol.  The connection is encrypted, and so we don’t get much useful information out of it unless you set up Wireshark to deal with encrypted data.  For what we want to do, that’s not neccessary.

The Download

What I do want to see is whether my statement that the Polybase engine needs to download the entire file is correct.  To do that, I can filter on the TCP stream associated with my file download and look for any with a length more than 1000 bytes.

downloadsize

In this case, all of those packets were 1514 bytes, so it’s an easy multiplication problem to see that we downloaded approximately 113 MB.  The 2008.csv.bz2 file itself is 108 MB, so factoring in TCP packet overhead and that there were additional, smaller packets in the stream, I think that’s enough to show that we did in fact download the entire file.  Just like in the Hadoop scenario without MapReduce, the Polybase engine needs to take all of the data and load it into a temp table (or set of temp tables if you’re using a Polybase scale-out cluster) before it can pull out the relevant rows based on our query.

Conclusion

Looking at the execution plan and packet capture, it’s pretty clear that using Polybase with Azure Blob Storage behaves similarly to Polybase with Hadoop as long as you don’t use MapReduce.

Connecting To Blob Storage With Polybase

This is a continuation of my Polybase series.

To this point, I have focused my Polybase series on interactions with on-premises Hadoop, as it’s the use case most apropos to me.  I want to start expanding that out to include other interaction mechanisms, and I’m going to start with one of the easiest:  Azure Blob Storage.

Ayman El-Ghazali has a great blog post the topic, which he turned into a full-length talk.  As such, this post will fill in the gaps rather than start from scratch.  In today’s post, my intention is to retrieve data from Azure Blob Storage and get an idea of what’s happening.  From there, we’ll spend a couple more posts on Azure Blob Storage, looking a bit deeper into the process.  That said, my expectation going into this series is that much of what we do with Azure Blob Storage will mimic what we did with Hadoop, as there are no Polybase core concepts unique to Azure Blob Storage, at least any of which I am aware.

Starting Point:  Files In Blob Storage

I have copied up the flight data from 2003 through 2008 to Azure Blob Storage using Azure Management Studio.  Note that I’ve put my uploads into a folder called historical; that way, when I create external tables later, I can segregate those files from other files I upload.

uploadedfiles

I also have SQL Server 2016 and the Polybase engine installed on my local machine, configured the same way I did for on-prem Hadoop.  Note that the Polybase Connectivity Configuration option 7 will work for Azure Blob Storage as well as modern versions of Hortonworks, so I don’t need to change that.

External Data Source

My first step is to set up an external data source for Azure Blob Storage.  If we go back to the MSDN documentation, we can see that the syntax for creating an external data source which uses Azure Blob Storage is pretty straightforward:

-- PolyBase only: Azure Storage Blob as data source   
-- (on SQL Server 2016 and Azure SQL Data Warehouse)  
CREATE EXTERNAL DATA SOURCE data_source_name  
    WITH (   
        TYPE = HADOOP,  
        LOCATION = 'wasb[s]://container@account_name.blob.core.windows.net'
        [, CREDENTIAL = credential_name ]
    )  
[;]

We specify the type to be HADOOP and point to the storage account, along with a credential.  In my case, the container is called csflights and my account name is cspolybase, as you can see in the picture above.

But before I can create the data source, I need to create a master key for my database (if I don’t have one already) and create a database-scoped credential which contains my secret for Azure Blob Storage.  To get the secret, I need to go to the Azure Portal and select the “Access keys” option in settings and copy one of the keys they generate.

accesskey

Once I have that, I can fill out the T-SQL statements to create my data source:

USE OOTP
GO
CREATE MASTER KEY ENCRYPTION BY PASSWORD = '<password>';
GO
CREATE DATABASE SCOPED CREDENTIAL AzureStorageCredential  
WITH IDENTITY = 'cspolybase',
SECRET = '<access key>';
GO
CREATE EXTERNAL DATA SOURCE WASBFlights
WITH (   
    TYPE = HADOOP,  
    LOCATION = 'wasbs://csflights@cspolybase.blob.core.windows.net',
	CREDENTIAL = AzureStorageCredential
);

Something important to note here is that Azure Blob Storage does not have a resource manager location.  This makes sense, as the resource manager (YARN) exists to allow you to push code to a Hadoop cluster to perform MapReduce jobs remotely.  All we can do with Azure Blob Storage is pull the data over the wire and perform all filters and joins within SQL Server.

External File Format

The external file format is the same as what we would use for on-premises Hadoop.  Just in case you don’t have one set up, here it is:

USE [OOTP]
GO
CREATE EXTERNAL FILE FORMAT [CsvFileFormat] WITH
(
	FORMAT_TYPE = DELIMITEDTEXT,
	FORMAT_OPTIONS
	(
		FIELD_TERMINATOR = N',',
		USE_TYPE_DEFAULT = True
	)
);
GO

External Tables

Now that we have the data source and file format set up, we can create some external tables.  The first table I want to create revolves around 2008 data.

USE [OOTP]
GO
CREATE EXTERNAL TABLE [dbo].[Flights2008]
(
    [year] int NULL,
    [month] int NULL,
    [dayofmonth] int NULL,
    [dayofweek] int NULL,
    deptime VARCHAR(100) NULL,
    crsdeptime VARCHAR(100) NULL,
    arrtime VARCHAR(100) NULL,
    crsarrtime VARCHAR(100) NULL,
    uniquecarrier VARCHAR(100) NULL,
    flightnum VARCHAR(100) NULL,
    tailnum VARCHAR(100) NULL,
    actualelapsedtime VARCHAR(100) NULL,
    crselapsedtime VARCHAR(100) NULL,
    airtime VARCHAR(100) NULL,
    arrdelay VARCHAR(100) NULL,
    depdelay VARCHAR(100) NULL,
    origin VARCHAR(100) NULL,
    dest VARCHAR(100) NULL,
    distance VARCHAR(100) NULL,
    taxiin VARCHAR(100) NULL,
    taxiout VARCHAR(100) NULL,
    cancelled VARCHAR(100) NULL,
    cancellationcode VARCHAR(100) NULL,
    diverted VARCHAR(100) NULL,
    carrierdelay VARCHAR(100) NULL,
    weatherdelay VARCHAR(100) NULL,
    nasdelay VARCHAR(100) NULL,
    securitydelay VARCHAR(100) NULL,
    lateaircraftdelay VARCHAR(100) NULL
)
WITH
(
    LOCATION = N'historical/2008.csv.bz2',
    DATA_SOURCE = WASBFlights,
    FILE_FORMAT = CsvFileFormat,
    -- Up to 5000 rows can have bad values before Polybase returns an error.
    REJECT_TYPE = Value,
    REJECT_VALUE = 5000
);
GO

My first table just looks at flight information for 2008 so I can double-check my work.  A quick top 10 records looks good:

SELECT TOP(10) *
FROM Flights2008 f;

top10

The second table I want to create is all data, not just for a particular year.

USE [OOTP]
GO
CREATE EXTERNAL TABLE [dbo].[FlightsAll]
(
    [year] int NULL,
    [month] int NULL,
    [dayofmonth] int NULL,
    [dayofweek] int NULL,
    deptime VARCHAR(100) NULL,
    crsdeptime VARCHAR(100) NULL,
    arrtime VARCHAR(100) NULL,
    crsarrtime VARCHAR(100) NULL,
    uniquecarrier VARCHAR(100) NULL,
    flightnum VARCHAR(100) NULL,
    tailnum VARCHAR(100) NULL,
    actualelapsedtime VARCHAR(100) NULL,
    crselapsedtime VARCHAR(100) NULL,
    airtime VARCHAR(100) NULL,
    arrdelay VARCHAR(100) NULL,
    depdelay VARCHAR(100) NULL,
    origin VARCHAR(100) NULL,
    dest VARCHAR(100) NULL,
    distance VARCHAR(100) NULL,
    taxiin VARCHAR(100) NULL,
    taxiout VARCHAR(100) NULL,
    cancelled VARCHAR(100) NULL,
    cancellationcode VARCHAR(100) NULL,
    diverted VARCHAR(100) NULL,
    carrierdelay VARCHAR(100) NULL,
    weatherdelay VARCHAR(100) NULL,
    nasdelay VARCHAR(100) NULL,
    securitydelay VARCHAR(100) NULL,
    lateaircraftdelay VARCHAR(100) NULL
)
WITH
(
    LOCATION = N'historical/',
    DATA_SOURCE = WASBFlights,
    FILE_FORMAT = CsvFileFormat,
    -- Up to 5000 rows can have bad values before Polybase returns an error.
    REJECT_TYPE = Value,
    REJECT_VALUE = 5000
);
GO

We can run a check to ensure that records look okay here. In this scenario, instead of just pulling down 10 results, I want to make sure that we get data for each year.

SELECT
	fa.[year],
	COUNT(1) AS NumberOfRecords
FROM dbo.FlightsAll fa
GROUP BY
	fa.[year]
ORDER BY
	fa.[year];

Naturally, this query takes a bit longer to run, as I have to download the files, decompress them, process them, and aggregate their results locally.  During this process, the mpdwsvc service was my biggest resource user, followed by a stretch in which SQL Server ate up a healthy amount of CPU time.

pdwsvc

When the query finished, it gave me my expected results:

aggregateresults

This particular query took me just over 5 minutes to run, getting data for approximately 42 million rows.

Conclusion

At this point, we’ve set up a Polybase connection to Azure Blob Storage.  In the next Polybase series post, we’re going to look at execution plans, packet captures, and other tools to gain a better understanding of the process.  After that, I will put together another post on inserting data into Azure Blob Storage.

Brave New World

For the past three years, I’ve worked as a Database Engineer—in other words, as a database developer—at ChannelAdvisor.  2 1/2 years of that time was spent working in the digital marketing space.  Coming into this job, I had worked at small-scale organizations:  the smallest cabinet-level department in Ohio, followed by a relatively small subsidiary of a large insurance company.  Working at ChannelAdvisor helped me build up skills as a database developer, figuring out that things which work well with a hundred thousand rows in a table don’t necessarily work well when that table hits a billion rows.

Well, come January 1st, I will no longer be a Database Engineer.  That’s because I’m going to be the Engineering Manager of a new predictive analytics team.

Wait, Management?

Yeah, this feels a little crazy for me as well.  The me of five years ago would never have wanted to be a manager, and the reasoning would have been the same as for other technical people:  I enjoy being on the front line, doing things rather than filling out paperwork.

Since then, I would not say that my priorities have changed much:  I still want to be on the front line, using technology to solve business problems.  What I get, though, is a force multiplier:  I now have two more people who can help me accomplish great things.

Vision

Something I’ve observed during the last few years of work is that we have a tremendous amount of interesting data at the company, and we throw away even more due to space and budget constraints.  What we have not been so good at was taking full advantage of that data to help customers.  Most of our systems are designed around a single customer’s data.  Obviously, our transactional systems are keyed toward individual customers, rather than aggregating their results.  What’s interesting is that even our warehouses tend to be customer-focused rather than company-focused.

My vision on predictive analytics is to blow out our company’s disk budget.

It is also to take advantage of this data and solve problems for customers, between customers, and for the company as a whole.  We have the data, and it will be my job to put together the tools to collect (in a place which does not harm our transactional processes), process, aggregate, and analyze the data.  Without getting into specifics, I want to close the internal gap between what we could conceivably do versus what we can do in practice.

Plan of Action:  Data Engineering

In order to pull off my vision, I’ve got to build up skills on a number of fronts, all at the same time.  There are four major quadrants I need to hit; the good news is that I’m building a team to help me with two of them.  I’m going to start with the Data Engineering Crucible, in which I (hopefully) take people with complementary skills across the following three axes and build up people with strong skills across all three.

Statistics

Doing a few analytics projects has reminded me that I need to re-take some stats courses.  My last statistics course was in graduate school, and that was mostly statistics for economists, meaning lots of regressions.  I’m bringing in an employee who has a pretty strong background in this, and so I plan to lean on that person pretty heavily (and push that person to get better at the same time).

Development

My .NET skills have stagnated the last few years.  That makes sense, as I don’t write as much .NET code as before.  The good news is that by hanging around the .NET user group and working on projects both at work and for presentations, I haven’t forgotten much there.  I also want to have my other employee bring in a strong development background to help the team get better.

Aside from .NET development (F# for life!), we’ll use other languages too.  I have some experience with R and Python, and that experience is about to grow significantly.  I have a lot of experience with SQL that I will share with the team, as well as some Java/Scala experience and whatever other crazy languages we decide on.

Subject Matter Expertise

I’ve worked on digital marketing for the past 2 1/2 years, but that’s only a part of what the company does.  My job will be to work with my team to train them on DM but also learn more about the rest of the company’s data and processes.

Plan of Action:  Management

Aside from hitting the Data Engineering trifecta, it’s time to ramp up management skills.  I have some ideas, and I’ll probably share more as I do a bit more.  Right now, it involves reading some books and thinking through various plans, like how I want to run meetings or drive sprint work.  After a few months, I hope to have a post up which describes some of this and see how things work out.

Conclusion

Over the past year, I have been itching for a team lead position.  Now that I have it, I’ll be like a dog with a chew toy, though probably a little less destructive.

New Talks For 2017

2017 is just around the corner, and my intent is to revamp my presentation lineup for the new year.  Here is a quick look at how my presentation lineup looks for 2017, given the proviso that by August, the lineup will probably change.

Headline Talks

These are the talks I lead with.  They’re fun and fit a particular niche that not too many people are hitting.

  1. R for the SQL Developer.  Already finished.  I may tweak the talk a little bit, but I think at its core, it’s a good talk.
  2. Kafka for .NET Developers.  Already finished.  I’ve given this a couple of times and have an idea of how it flows.  I’m really looking forward to pushing it next year.
  3. APPLY Yourself.  Already finished.  I’ve been giving this talk for a few years now, and I’ll be giving it a few years from now too.
  4. Much Ado About Hadoop.  Need to revamp.  The old version of my talk was written in 2014 and is woefully out of date now.  I intend this to be a high-level overview of the Hadoop ecosystem, focusing on Spark and streaming more than “classic” MapReduce operations.
  5. Polybase.  Need to write.  I also need a good title for this.  I think Polybase will be a big talk for me in the second half of 2017.

Midline Talks

These are the talks that I’m still happy to give, or which I want to create but don’t think they’ll see huge play next year.

  1. Peanut Butter and Chocolate:  Integrating Hadoop and SQL Server.  Already finished.  It’s a stable talk and still quite relevant, but I gave it 14 times last year, so that limits the number of places I can submit it.
  2. Securing SQL Server.  Already finished.  It’s nice having a security talk.  I (semi-)jokingly describe my set of talks as the weird side of SQL Server, but having this talk and the APPLY operator talk keeps me somewhat grounded.
  3. Big Data, Small Data, and Everything In Between.  Already finished.  I will probably need to revamp the lists, but I’m still happy with this high-level overview of the data platform space.
  4. Rent-A-Cluster:  HDInsight In Action.  Need to write.  I wanted to have a talk specifically on HDInsight, as most of my Hadoop talks focus around using the local sandbox.
  5. Who Needs A Cluster?  AWS Aurora and Azure Data Lake In Action.  Need to write.  This feeds off of my Rent-A-Cluster talk and focuses on two Platform-as-a-Service offerings which give you parts of Hadoop without dealing with the administration effort.

Other Talks

These are still good talks (at least in my biased opinion!), but I see them as more of niche offerings that I’d submit under certain specific circumstances.

  1. Client Migration with Biml.  Already finished.  I’ve enjoyed this talk and it’s still a viable topic, as familiarity with Biml is still surprisingly low among Integration Services developers.
  2. Working Effectively with Legacy SQL.  Needs revisions.  I tried pushing this one in 2016 and other talks were more popular.  I think that’s fair, although this does fit into a mainline SQL topic:  cleaning up older code which might have been a good decision for SQL Server 2005 but doesn’t make as much sense in SQL Server 2016.  Speaking of which, I’m going to revamp the talk a bit and include some SQL Server 2016 functionality as well.
  3. Power BI Custom Visuals.  Already finished.  I gave this talk a couple of times and it was fun.  The problem is that it can’t crack the top ten talks, so it probably would be a niche topic for user groups.

Inserting Into Hadoop

See the entire Polybase series.

Today, we are going to explore inserting data into a Hadoop cluster using Polybase.  As always, I am using a Hortonworks sandbox, but as long as you are using a supported distribution, that’s not particularly relevant.

Step One:  Configuration

The first thing we need to do is enable data modification using Polybase.  I’m not quite sure why this is a configuration setting, but maybe we’ll find out as the series continues…  Fortunately, configuration is easy:  it’s just a setting in sp_configure.

USE OOTP
GO
EXEC sp_configure 'allow polybase export', 1;
GO
RECONFIGURE
GO

This does not require a server reboot, so as soon as I run the script, we’re ready to go.  Note that all of my work will be in the OOTP database, but you can use whichever database you’d like.

Step Two:  Working On A New Table

Create A Table

The next thing I’d like to do is create a new external table to support my insertions.  I could use an existing external table, but just in case I mess things up, I want to create a new one…  In this scenario, I’m going to suppose that I want to archive information on second basemen that I already have in a Player.SecondBasemen table in SQL Server.  I’d like to use the same structure as my SQL Server table (which was itself a table whose data I retrieved from Hadoop, but never mind that):

CREATE EXTERNAL TABLE [dbo].[SecondBasemenTest]
(
	[FirstName] [varchar](50) NULL,
	[LastName] [varchar](50) NULL,
	[Age] [int] NULL,
	[Throws] [varchar](5) NULL,
	[Bats] [varchar](5) NULL
)
WITH
(
	DATA_SOURCE = [HDP2],
	LOCATION = N'/tmp/ootp/SecondBasemenTest/',
	FILE_FORMAT = [TextFileFormat],
	REJECT_TYPE = VALUE,
	REJECT_VALUE = 5
)
GO

As a quick reminder, my data source and file format are Polybase settings I have already created.

After running the script above, I can check Ambari and see that I have a new folder:

1-secondbasementestfoldercreated

That folder is currently empty, and its owner is pdw_user, which is the account that Polybase tries to use by default.

Load Some Data

Once I have the table, I would like to run an insert statement to load data from Player.SecondBasemen into my SecondBasemenTest external table.

INSERT INTO dbo.SecondBasemenTest
(
	FirstName,
	LastName,
	Age,
	Throws,
	Bats
)
SELECT
	sb.FirstName,
	sb.LastName,
	sb.Age,
	sb.Bats,
	sb.Throws
FROM Player.SecondBasemen sb;

This query successfully inserts 777 records into the dbo.SecondBasemenTest folder.  If I check Ambari, I can see that the Polybase engine created eight separate files, only one of which it used this time:

2-filescreatedoninsert

If I click on the first file, I can see a preview of that file:

3 - SavedAsCSV.png

As we expected, the file is in comma-separated values format, as that’s what I defined TextFileFormat to be.

Loading More Data

Continuing on, I can insert more rows into this table.  To separate the old rows from the new rows, I added a few characters to each first name and last name:

INSERT INTO dbo.SecondBasemenTest
(
	FirstName,
	LastName,
	Age,
	Throws,
	Bats
)
SELECT
	sb.FirstName + 'elo',
	sb.LastName + 'elo',
	sb.Age,
	sb.Bats,
	sb.Throws
FROM Player.SecondBasemen sb;

The results are pretty interesting.  First, Polybase created eight new files rather than updating existing files:

4-newfiles

As if that weren’t interesting enough, this time it decided to split the data between two files instead of pushing it all into one file.  This says to me that the insert pattern is not trivial, that there are several factors that go into how the Polybase engine decides to write data to disk.

Reading Data From Disk

The next question is, does this affect how I query this external table?

SELECT
	sb.FirstName,
	sb.LastName,
	sb.Age,
	sb.Bats,
	sb.Throws
FROM dbo.SecondBasemenTest sb;

The query is the same as if I had loaded the data manually (or using some other process), and the results are what we would expect:

5-queryresults

Something of mild interest is that it does not iterate through all of the files one-by-one; if it did that, I would have expected the first “elo” players starting at row 778.  It does read in chunks, though, instead of interleaving records from each file.

Testing Data Modification

Now that we’ve played around with inserting data, I’d like to see if I can update or delete data.  Let’s try updating my data:

UPDATE sb
SET
	FirstName = FirstName + 'reddy'
FROM dbo.SecondBasemenTest sb;

The answer is apparent in the subsequent message:

Msg 46519, Level 16, State 16, Line 70
DML Operations are not supported with external tables.

DELETE
FROM dbo.SecondBasemenTest;

End result: same error message.  In short, you can insert data but you cannot update or delete that data through Polybase.  If you need to modify data later, use other methods.  For example, if you have partitioned data out into separate files and want to delete older partitions of data, you can delete the files themselves.  Alternatively, you could migrate rows that you want to keep into a new table and drop and re-create your external table script.

Execution Plan Differences

I was curious at this point what the execution plan would look like for an insert statement, and here it is:

6-executionplan

The new operator here is a Put operator, and we can hover over it to get more details:

7-putoperator

Note the number of distributions, which is equal to number of nodes * distributions per node.  I believe the number of distributions is what controls the number of files which get created.

Inserting Into A Single File

The last thing I wanted to try was to insert into a single file rather than a folder.  Ideally, I’d like one big file to which I’d keep appending data.  It was obvious that creating an external table and pointing it to a folder wouldn’t do the trick, so how about the following?

CREATE EXTERNAL TABLE [dbo].[SecondBasemenFileTest]
(
	[FirstName] [varchar](50) NULL,
	[LastName] [varchar](50) NULL,
	[Age] [int] NULL,
	[Throws] [varchar](5) NULL,
	[Bats] [varchar](5) NULL
)
WITH
(
	DATA_SOURCE = [HDP2],
	LOCATION = N'/tmp/ootp/SecondBasemenFileTest/secondbasemenfile.csv',
	FILE_FORMAT = [TextFileFormat],
	REJECT_TYPE = VALUE,
	REJECT_VALUE = 5
)
GO

--Insert into new table
INSERT INTO dbo.SecondBasemenFileTest
(
	FirstName,
	LastName,
	Age,
	Throws,
	Bats
)
SELECT
	sb.FirstName,
	sb.LastName,
	sb.Age,
	sb.Bats,
	sb.Throws
FROM Player.SecondBasemen sb;

The results were…not what I expected.

8-secondbasemenfirstfileattempt

Instead of creating a CSV file, it created a folder with the name secondbasemenfile.csv.  So let’s teach it a lesson.  Instead of letting it create a new folder, I would create a file called secondbasemen.csv and have it point to that file.

DROP EXTERNAL TABLE dbo.SecondBasemenFileTest;

--Moved an actual file into SecondBasemenFileTest/secondbasemen.csv
CREATE EXTERNAL TABLE [dbo].[SecondBasemenFileTest]
(
	[FirstName] [varchar](50) NULL,
	[LastName] [varchar](50) NULL,
	[Age] [int] NULL,
	[Throws] [varchar](5) NULL,
	[Bats] [varchar](5) NULL
)
WITH (DATA_SOURCE = [HDP2],LOCATION = N'/tmp/ootp/SecondBasemenFileTest/secondbasemen.csv', FILE_FORMAT = [TextFileFormat], REJECT_TYPE = VALUE, REJECT_VALUE = 5)
GO

--First, show that we have records in here.
SELECT
	sb.FirstName,
	sb.LastName,
	sb.Age,
	sb.Throws,
	sb.Bats
FROM dbo.SecondBasemenFileTest sb;

--Now try to insert:
INSERT INTO dbo.SecondBasemenFileTest
(
	FirstName,
	LastName,
	Age,
	Throws,
	Bats
)
SELECT
	sb.FirstName,
	sb.LastName,
	sb.Age,
	sb.Bats,
	sb.Throws
FROM Player.SecondBasemen sb;

In this case, it successfully dropped the old table and created a new external table based off of my CSV.  I ran the select statement to prove that I could query the data, and did in fact get 777 results.  When I tried to run the insert statement, however:

Msg 7320, Level 16, State 102, Line 162
Cannot execute the query “Remote Query” against OLE DB provider “SQLNCLI11” for linked server “SQLNCLI11”. EXTERNAL TABLE access failed because the specified path name ‘hdfs://sandbox.hortonworks.com:8020/tmp/ootp/SecondBasemenFileTest/secondbasemen.csv’ does not exist. Enter a valid path and try again.

What’s interesting is the error message itself is correct, but could be confusing.  Note that it’s looking for a path with this name, but it isn’t seeing a path; it’s seeing a file with that name.  Therefore, it throws an error.

This proves that you cannot control insertion into a single file by specifying the file at create time.  If you do want to keep the files nicely packed (which is a good thing for Hadoop!), you could run a job on the Hadoop cluster to concatenate all of the results of the various files into one big file and delete the other files.  You might do this as part of a staging process, where Polybase inserts into a staging table and then something kicks off an append process to put the data into the real tables.

Conclusion

In this post, we looked at inserting data into external tables which live on HDFS.  The only DML operation we can run is INSERT, and whenever we insert data, the Polybase engine creates new files with the inserted data.