Working With Azure SQL DW

This is a continuation of my Polybase series.

The first step to working with Polybase in Azure SQL Data Warehouse is to provision an Azure SQL Data Warehouse instance.  I’m going to follow along with the data I used in my Azure Blob Storage examples and load that data into Azure SQL Data Warehouse using Polybase.

The best way to connect to Azure SQL Data Warehouse is not SQL Server Management Studio, but rather SQL Server Data Tools.  So we’ll connect using SSDT.

In my scenario, I have an Azure SQL Data Warehouse instance named csdwdb.

connection

After putting in the correct details, I can see my instance in the SQL Server Object Explorer in Visual Studio.

connected

My plan is to do two things:  first, I want to pull all of my flight data into Azure SQL Data Warehouse, putting it into a table with a clustered columnstore index.  Second, I want to perform some processing of the data and put the results back into Azure Blob Storage to allow me to migrate data back to an on-prem SQL Server instance.

Step One:  Loading Data

External Types

The first thing I need to do is to create an external data source, external file format, and external table for my Azure Blob Storage flights folder.  Looking back at my post on Azure Blob Storage, I can repurpose most of that code for this scenario as well:

CREATE MASTER KEY ENCRYPTION BY PASSWORD = '<password>';
GO
CREATE DATABASE SCOPED CREDENTIAL AzureStorageCredential
WITH IDENTITY = 'cspolybase',
SECRET = '<access key>';
GO
CREATE EXTERNAL DATA SOURCE WASBFlights
WITH (
    TYPE = HADOOP,
    LOCATION = 'wasbs://csflights@cspolybase.blob.core.windows.net',
    CREDENTIAL = AzureStorageCredential
);
GO
CREATE EXTERNAL FILE FORMAT [CsvFileFormat] WITH
(
    FORMAT_TYPE = DELIMITEDTEXT,
    FORMAT_OPTIONS
    (
        FIELD_TERMINATOR = N',',
        USE_TYPE_DEFAULT = True
    )
);
GO
CREATE EXTERNAL TABLE [dbo].[Flights2008]
(
    [year] int NULL,
    [month] int NULL,
    [dayofmonth] int NULL,
    [dayofweek] int NULL,
    deptime VARCHAR(100) NULL,
    crsdeptime VARCHAR(100) NULL,
    arrtime VARCHAR(100) NULL,
    crsarrtime VARCHAR(100) NULL,
    uniquecarrier VARCHAR(100) NULL,
    flightnum VARCHAR(100) NULL,
    tailnum VARCHAR(100) NULL,
    actualelapsedtime VARCHAR(100) NULL,
    crselapsedtime VARCHAR(100) NULL,
    airtime VARCHAR(100) NULL,
    arrdelay VARCHAR(100) NULL,
    depdelay VARCHAR(100) NULL,
    origin VARCHAR(100) NULL,
    dest VARCHAR(100) NULL,
    distance VARCHAR(100) NULL,
    taxiin VARCHAR(100) NULL,
    taxiout VARCHAR(100) NULL,
    cancelled VARCHAR(100) NULL,
    cancellationcode VARCHAR(100) NULL,
    diverted VARCHAR(100) NULL,
    carrierdelay VARCHAR(100) NULL,
    weatherdelay VARCHAR(100) NULL,
    nasdelay VARCHAR(100) NULL,
    securitydelay VARCHAR(100) NULL,
    lateaircraftdelay VARCHAR(100) NULL
)
WITH
(
    LOCATION = N'historical/2008.csv.bz2',
    DATA_SOURCE = WASBFlights,
    FILE_FORMAT = CsvFileFormat,
    -- Up to 5000 rows can have bad values before Polybase returns an error.
    REJECT_TYPE = Value,
    REJECT_VALUE = 5000
);
GO
CREATE EXTERNAL TABLE [dbo].[FlightsAll]
(
    [year] int NULL,
    [month] int NULL,
    [dayofmonth] int NULL,
    [dayofweek] int NULL,
    deptime VARCHAR(100) NULL,
    crsdeptime VARCHAR(100) NULL,
    arrtime VARCHAR(100) NULL,
    crsarrtime VARCHAR(100) NULL,
    uniquecarrier VARCHAR(100) NULL,
    flightnum VARCHAR(100) NULL,
    tailnum VARCHAR(100) NULL,
    actualelapsedtime VARCHAR(100) NULL,
    crselapsedtime VARCHAR(100) NULL,
    airtime VARCHAR(100) NULL,
    arrdelay VARCHAR(100) NULL,
    depdelay VARCHAR(100) NULL,
    origin VARCHAR(100) NULL,
    dest VARCHAR(100) NULL,
    distance VARCHAR(100) NULL,
    taxiin VARCHAR(100) NULL,
    taxiout VARCHAR(100) NULL,
    cancelled VARCHAR(100) NULL,
    cancellationcode VARCHAR(100) NULL,
    diverted VARCHAR(100) NULL,
    carrierdelay VARCHAR(100) NULL,
    weatherdelay VARCHAR(100) NULL,
    nasdelay VARCHAR(100) NULL,
    securitydelay VARCHAR(100) NULL,
    lateaircraftdelay VARCHAR(100) NULL
)
WITH
(
    LOCATION = N'historical/',
    DATA_SOURCE = WASBFlights,
    FILE_FORMAT = CsvFileFormat,
    -- Up to 5000 rows can have bad values before Polybase returns an error.
    REJECT_TYPE = Value,
    REJECT_VALUE = 5000
);
GO

Note that there is no database defined anymore. I’m running the query against my Azure SQL DW database, so I’m not going to specify a separate database here.

I want to make sure that everything’s working, so a quick select statement is in order:

SELECT TOP(100) * FROM dbo.FlightsAll;

top100

So, “quick” might be an overstatement, but I did get results I wanted and proof that we have data available.

Loading SQL Data Warehouse

My next step is to take this data and load it into Azure SQL Data Warehouse.  The smart way to create tables in Azure SQL Data Warehouse is to use the CREATE TABLE AS SELECT syntax.  I’m going to use that, as well as the advice they give on getting a fairly unique distribution key, and create my 2008 flights table as follows:

CREATE TABLE[dbo].[Flights2008DW]
WITH
(
	CLUSTERED COLUMNSTORE INDEX,
	DISTRIBUTION = HASH(tailnum)
)
AS SELECT * FROM dbo.Flights2008;
GO

After a little over a minute, I get my results:

Query completed. Rows were rejected while reading from external source(s).
1 row rejected from external table [Flights2008] in plan step 3 of query execution:
Location: ‘/historical/2008.csv.bz2’ Column ordinal: 0, Expected data type: INT, Offending value: Year (Column Conversion Error), Error: Error converting data type NVARCHAR to INT.

That’s a header row, and I’m okay with it not making its way in.  As a quick aside, I should note that I picked tailnum as my distribution key.  The airplane’s tail number is unique to that craft, so there absolutely will be more than 60 distinct values, and as I recall, this data set didn’t have too many NULL values.  After loading the 2008 data, I loaded all years’ data the same way, except selecting from dbo.Flights instead of Flights2008.

Querying The Data

So let’s do some quick queries and make sure that everything looks alright.  Going back to earlier this year, I had an example of a query against a larger data set.  My subset of flights is not exactly the same here as it was there—I only have flight data for the years 2003 through 2008 in Azure SQL Data Warehouse—but I figured it’d be a good start.

Here’s my slightly revised query:

SELECT
    f.origin,
    f.distance
FROM dbo.FlightsAllDW f
WHERE
    f.year = 2005
    AND f.dest = 'CMH';

And against an Azure SQL Data Warehouse instance with 200 DWUs, it took 6 seconds to load the results.

2005flights

Again, that’s not a fair comp:  I’m getting 200 DWU for data partitioned across 60 servers versus running a query on my laptop.  That said, it was pleasant seeing results come back that quickly.

Migrating Data Out Using Polybase

I’m going to show one last trick here:  outmigration using CETAS:  CREATE EXTERNAL TABLE AS SELECT.

CREATE EXTERNAL TABLE [dbo].[CMHFlights]
WITH
(
	LOCATION = N'columbus/',
    DATA_SOURCE = WASBFlights,
    FILE_FORMAT = CsvFileFormat,
    -- Up to 5000 rows can have bad values before Polybase returns an error.
    REJECT_TYPE = Value,
    REJECT_VALUE = 5000
)
AS SELECT * FROM dbo.FlightsAllDW WHERE dest = 'CMH';
GO

This query takes data from my FlightsAllDW Azure SQL DW table and pumps them into Azure Blob Storage into a new external table.

cmhflights

In eight seconds, it figured out the 220K records that interested me and wrote them to a new external table called CMHFlights, and in a folder marked columbus.  If I look at the folder, I can see that the Polybase engine created 60 files:

cetas

Remember that when we did this ourselves using on-prem Polybase (e.g., writing to Hadoop), we ended up the data split out into 8 files.  Azure SQL Data Warehouse is special.

Conclusion

Polybase inside Azure SQL Data Warehouse isn’t that different from on-prem Polybase.  Its primary focus is to get data into and out of Azure SQL Data Warehouse for cases in which you don’t want to use Azure Data Factory.  One really cool thing you can do with Polybase in Azure SQL DW that you can’t do on-prem is CREATE EXTERNAL TABLE AS SELECT.  Otherwise, if you understand how the on-prem product works, you’ll get how it works in SQL Data Warehouse.

Partitioned Views With Polybase

This is a continuation of my Polybase series.

In today’s Polybase post, I want to see if I can replicate “SQL 2000 partitioning” using Polybase.  Back before SQL Server 2005, there was no concept of partitioned tables; the only way you could partition a table was to separate each logical subset of data into its own table and create a view which unioned those subsets together.  The reason you might want to do this is that if there were constraints on each individual table, you would be able to perform partition elimination, only querying the table(s) with potentially relevant data.

With that in mind, let’s take a look at what we can do.

Statistics, Not Constraints

The first important thing to note is that we cannot create constraints after the fact.  Taking an external table with a NOT NULL column:

ALTER TABLE dbo.RemoteFixed ADD CONSTRAINT [PK_RemoteFixed] PRIMARY KEY CLUSTERED (ID);

I get the following error:

Msg 46518, Level 16, State 2, Line 3
The feature ‘ALTER TABLE’ is not supported with external tables.

In fact, I can’t create a primary key constraint on an external table either:

CREATE EXTERNAL TABLE [dbo].[RemoteFixedPK]
(
    ID INT NOT NULL PRIMARY KEY CLUSTERED,
    SomeNum INT NOT NULL,
    SomeChar CHAR(12) NOT NULL
)
WITH
(
    DATA_SOURCE = [WASBFlights],
    LOCATION = N'fixed/',
    FILE_FORMAT = [CsvFileFormat],
    REJECT_TYPE = VALUE,
    REJECT_VALUE = 5
);

Msg 103010, Level 16, State 1, Line 1
Parse error at line: 3, column: 21: Incorrect syntax near ‘PRIMARY’.

I leave it as an exercise to the reader to look at check constraints; I did and got the same basic error messages when trying to modify or create a table with a check constraint.

Why This Makes Sense

At first, you might think that this is a terrible limitation and something that needs “fixed.”  I don’t think I agree with that assessment, as these are external tables.  The value in primary key and check constraints comes from the fact that the database engine controls the rows which go into a table and can validate these constraints whenever you try to insert rows.  With external tables, however, there are multiple ways for data to get into that table:  it can come directly from SQL Server insert statements, but (much) more often, it will come from some external process.  That external process likely will not have the concepts of primary key or check constraints, and so data could get in which fails these constraints.

That said, I think there’s a fair argument in rejecting rows upon data retrieval based on constraints, but that behavior would differ from “normal” tables, so I’m not sure it would be the best choice.

Are Statistics The Key?

So knowing that we cannot create check constraints but we can create statistics, will that help?  I’ve created a series of flight tables, one for each year, which follow the following pattern:

USE [OOTP]
GO
CREATE EXTERNAL TABLE [dbo].[Flights2006]
(
    [year] int NULL,
    [month] int NULL,
    [dayofmonth] int NULL,
    [dayofweek] int NULL,
    deptime VARCHAR(100) NULL,
    crsdeptime VARCHAR(100) NULL,
    arrtime VARCHAR(100) NULL,
    crsarrtime VARCHAR(100) NULL,
    uniquecarrier VARCHAR(100) NULL,
    flightnum VARCHAR(100) NULL,
    tailnum VARCHAR(100) NULL,
    actualelapsedtime VARCHAR(100) NULL,
    crselapsedtime VARCHAR(100) NULL,
    airtime VARCHAR(100) NULL,
    arrdelay VARCHAR(100) NULL,
    depdelay VARCHAR(100) NULL,
    origin VARCHAR(100) NULL,
    dest VARCHAR(100) NULL,
    distance VARCHAR(100) NULL,
    taxiin VARCHAR(100) NULL,
    taxiout VARCHAR(100) NULL,
    cancelled VARCHAR(100) NULL,
    cancellationcode VARCHAR(100) NULL,
    diverted VARCHAR(100) NULL,
    carrierdelay VARCHAR(100) NULL,
    weatherdelay VARCHAR(100) NULL,
    nasdelay VARCHAR(100) NULL,
    securitydelay VARCHAR(100) NULL,
    lateaircraftdelay VARCHAR(100) NULL
)
WITH
(
    LOCATION = N'historical/2006.csv.bz2',
    DATA_SOURCE = WASBFlights,
    FILE_FORMAT = CsvFileFormat,
    -- Up to 5000 rows can have bad values before Polybase returns an error.
    REJECT_TYPE = Value,
    REJECT_VALUE = 5000
);

CREATE STATISTICS [st_flights2006_year] ON [dbo].[Flights2006]([year]);
GO

For each year, I created the table, pointed it to the relevant year file, and created a single statistics histogram based on the year.  For 2006, that histogram is simple:

flights2006statistics

We can see that there’s data just for one year.  So when I run the following query, my hope would be that the contradiction checker would say that this is not going to return any results and turn the query into a constant scan:

SELECT TOP 10
	*
FROM dbo.Flights2006
WHERE
	year = 2008;

The answer, a couple minutes later, was “close, but no cigar.”  As Benjamin Nevarez points out, contradiction detection is on a constraint, so just because our statistics say there are no records outside of 2006, doesn’t mean there really are no records outside of 2006, and so the contradiction detection logic won’t come into play here.

Creating A View

The next idea is to create a view, specifying our “constraints” in the WHERE clause of each segment:

CREATE VIEW dbo.AllFlights AS
SELECT * FROM dbo.Flights2006 WHERE year = 2006
UNION ALL
SELECT * FROM dbo.Flights2007 WHERE year = 2007
UNION ALL
SELECT * FROM dbo.Flights2008 WHERE year = 2008;

This does pass the obvious contradiction test:

contradiction

For a year outside of our range, we get no results back.  How about when we look at a contradiction in terms of two separate years?

contradiction2

I was a little concerned that we’d hit something like the scenario Conor Cunningham described back in 2008, but it looks like the engine was smart enough to figure this out.

So now for the big question:  can we get the equivalent of partition elimination?  The answer is yes.  When I ran the query, the execution plan’s remote query made reference to only one file:  wasbs://csflights@cspolybase.blob.core.windows.net/historical/2006.csv.bz2.  And my Wireshark packet capture only found 79,116 packets of at least 1400 bytes.

flights2006packets

As we saw earlier, that is consistent with reading just a single file.  So how about we look at two years; will we see two files?

SELECT
	year,
	COUNT(1) AS NumberOfFlights
FROM dbo.AllFlights af
WHERE
	year IN (2006, 2007)
GROUP BY
	year
ORDER BY
	year;

The Wireshark capture looks a bit different now:

multiyearflightspackets

And my execution plan now includes two wasbs blocks.  Here’s the remote query XML block which shows that:

<?xml version="1.0" encoding="utf-8"?>
<dsql_query number_nodes="1" number_distributions="8" number_distributions_per_node="8">
  <sql>ExecuteMemo explain query</sql>
  <dsql_operations total_cost="696.59944" total_number_operations="15">
    <dsql_operation operation_type="RND_ID">
      <identifier>TEMP_ID_40</identifier>
    </dsql_operation>
    <dsql_operation operation_type="ON">
      <location permanent="false" distribution="AllDistributions" />
      <sql_operations>
        <sql_operation type="statement">CREATE TABLE [tempdb].[dbo].[TEMP_ID_40] ([year] INT ) WITH(DATA_COMPRESSION=PAGE);</sql_operation>
      </sql_operations>
    </dsql_operation>
    <dsql_operation operation_type="ExternalRoundRobinMove">
      <operation_cost cost="695.59944" accumulative_cost="695.59944" average_rowsize="4" output_rows="45715" />
      <external_uri>wasbs://csflights@cspolybase.blob.core.windows.net/historical/2006.csv.bz2</external_uri>
      <destination_table>[TEMP_ID_40]</destination_table>
    </dsql_operation>
    <dsql_operation operation_type="RND_ID">
      <identifier>TEMP_ID_41</identifier>
    </dsql_operation>
    <dsql_operation operation_type="ON">
      <location permanent="false" distribution="AllDistributions" />
      <sql_operations>
        <sql_operation type="statement">CREATE TABLE [tempdb].[dbo].[TEMP_ID_41] ([year] INT ) WITH(DATA_COMPRESSION=PAGE);</sql_operation>
      </sql_operations>
    </dsql_operation>
    <dsql_operation operation_type="ON">
      <location permanent="false" distribution="AllDistributions" />
      <sql_operations>
        <sql_operation type="statement">EXEC [tempdb].[sys].[sp_addextendedproperty] @name=N'IS_EXTERNAL_STREAMING_TABLE', @value=N'true', @level0type=N'SCHEMA', @level0name=N'dbo', @level1type=N'TABLE', @level1name=N'TEMP_ID_41'</sql_operation>
      </sql_operations>
    </dsql_operation>
    <dsql_operation operation_type="ON">
      <location permanent="false" distribution="AllDistributions" />
      <sql_operations>
        <sql_operation type="statement">UPDATE STATISTICS [tempdb].[dbo].[TEMP_ID_41] WITH ROWCOUNT = 6024, PAGECOUNT = 2</sql_operation>
      </sql_operations>
    </dsql_operation>
    <dsql_operation operation_type="ON">
      <location permanent="false" distribution="AllDistributions" />
      <sql_operations>
        <sql_operation type="statement">CREATE STATISTICS [year] ON [tempdb].[dbo].[TEMP_ID_41] ([year]) WITH STATS_STREAM = 0x010000000100000000000000000000...</sql_operation>
      </sql_operations>
    </dsql_operation>
    <dsql_operation operation_type="RND_ID">
      <identifier>TEMP_ID_42</identifier>
    </dsql_operation>
    <dsql_operation operation_type="ON">
      <location permanent="false" distribution="AllDistributions" />
      <sql_operations>
        <sql_operation type="statement">CREATE TABLE [tempdb].[dbo].[TEMP_ID_42] ([year] INT NOT NULL, [col] BIGINT ) WITH(DATA_COMPRESSION=PAGE);</sql_operation>
      </sql_operations>
    </dsql_operation>
    <dsql_operation operation_type="MULTI">
      <dsql_operation operation_type="SHUFFLE_MOVE">
        <operation_cost cost="0.00576" accumulative_cost="695.6052" average_rowsize="12" output_rows="4" />
        <source_statement>SELECT [T1_1].[year] AS [year],
       [T1_1].[col] AS [col]
FROM   (SELECT   COUNT_BIG(CAST ((1) AS INT)) AS [col],
                 [T3_1].[year] AS [year]
        FROM     [tempdb].[dbo].[TEMP_ID_40] AS T3_1
        WHERE    ([T3_1].[year] = CAST ((2006) AS INT))
        GROUP BY [T3_1].[year]
        UNION ALL
        SELECT   COUNT_BIG(CAST ((1) AS INT)) AS [col],
                 [T3_1].[year] AS [year]
        FROM     [tempdb].[dbo].[TEMP_ID_41] AS T3_1
        WHERE    ([T3_1].[year] = CAST ((2007) AS INT))
        GROUP BY [T3_1].[year]) AS T1_1</source_statement>
        <destination_table>[TEMP_ID_42]</destination_table>
        <shuffle_columns>year;</shuffle_columns>
      </dsql_operation>
      <dsql_operation operation_type="ExternalRoundRobinMove">
        <operation_cost cost="733.274256" accumulative_cost="1428.879456" average_rowsize="4" output_rows="48191" />
        <external_uri>wasbs://csflights@cspolybase.blob.core.windows.net/historical/2007.csv.bz2</external_uri>
        <destination_table>[TEMP_ID_41]</destination_table>
      </dsql_operation>
    </dsql_operation>
    <dsql_operation operation_type="ON">
      <location permanent="false" distribution="AllDistributions" />
      <sql_operations>
        <sql_operation type="statement">DROP TABLE [tempdb].[dbo].[TEMP_ID_41]</sql_operation>
      </sql_operations>
    </dsql_operation>
    <dsql_operation operation_type="ON">
      <location permanent="false" distribution="AllDistributions" />
      <sql_operations>
        <sql_operation type="statement">DROP TABLE [tempdb].[dbo].[TEMP_ID_40]</sql_operation>
      </sql_operations>
    </dsql_operation>
    <dsql_operation operation_type="STREAMING_RETURN">
      <operation_cost cost="1" accumulative_cost="1429.879456" average_rowsize="8" output_rows="2" />
      <location distribution="AllDistributions" />
      <select>SELECT [T1_1].[year] AS [year],
       [T1_1].[col] AS [col]
FROM   (SELECT CONVERT (INT, [T2_1].[col], 0) AS [col],
               [T2_1].[year] AS [year]
        FROM   (SELECT ISNULL([T3_1].[col], CONVERT (BIGINT, 0, 0)) AS [col],
                       [T3_1].[year] AS [year]
                FROM   (SELECT   SUM([T4_1].[col]) AS [col],
                                 [T4_1].[year] AS [year]
                        FROM     [tempdb].[dbo].[TEMP_ID_42] AS T4_1
                        GROUP BY [T4_1].[year]) AS T3_1) AS T2_1) AS T1_1</select>
    </dsql_operation>
    <dsql_operation operation_type="ON">
      <location permanent="false" distribution="AllDistributions" />
      <sql_operations>
        <sql_operation type="statement">DROP TABLE [tempdb].[dbo].[TEMP_ID_42]</sql_operation>
      </sql_operations>
    </dsql_operation>
  </dsql_operations>
</dsql_query>

We see wasbs references on line 16 and 76.  What’s interesting is that each individual file goes into its own temp table (TEMP_ID_40 and TEMP_ID_41 for 2006 and 2007, respectively), and then the Polybase engine creates a third temp table (TEMP_ID_42) and inserts the individual aggregations from 40 and 41 into this third temp table.

Conclusion:  “Stretch” Database

Using a view, we were able to create a “partitioned” Polybase experience, similar to what we had in SQL Server 2000.  This form of poor man’s partitioning allows us to segment out data sets and query them independently, something which can be helpful when storing very large amounts of data off-site and only occasionally needing to query it.  The thing to remember, though, is that if you store this in Azure Blob Storage, you will need to pull down the entire table’s worth of data to do any processing.

This leads to a concept I first heard from Ginger Grant:  pseudo-StretchDB.  Instead of paying for what Stretch offers, you get an important subset of the functionality at a much, much lower price.  If you do store the data in Azure Blob Storage, you’re paying pennies per gigabyte per month.  For cold storage, like a scenario in which you need to keep data around to keep the auditors happy but your main application doesn’t use that information, it can work fine.  But if you need to query this data frequently, performance might be a killer.

Presentation Goals For 2017

In light of my 2016 goals, I’ve decided to make my 2017 speaking goals a little bit more ambitious and keep them public. To wit:

Speak at 20 SQL Saturday and 10 User Groups

Last year, I broke both of these barriers, so you might wonder how this is particularly ambitious. The way I figure it, when combined with the other goals, getting out to 20 separate SQL Saturdays will be a bit harder, as there are only so many weeks (and weekends) in the conference season.

So far, I am locked into three SQL Saturdays: Nashville, Vienna, and Cleveland.

Speak at 2 Paid Conferences

Last year, my goal was to speak at one paid conference. This year, I’m pushing it up to two. This is a harder goal than SQL Saturday-related goals mostly because paid conferences usually mean travel during the week, and being in management—even as low a level of management as I’m in—makes that a little harder.

Give 6 Webinars

Here’s the big one for me. I have presented two webinars to date. Webinars are an area I have neglected in the past, focusing instead on in-person presentations. Speaking in person is a bit easier than giving a webinar for a few reasons. First, you get to warm up the audience before the session, priming them for a good talk. You also get to see how well the material is going over: if they’re laughing at your jokes and looking interested, you know you’re in the groove; if they seem confused, you can slow down and repeat things.

Without those kinds of audience cues, I want to create more of a subdued webinar experience. My goal is still to be educational and entertaining (to some extent, at least!), but I have stripped out some of the humor. I haven’t figured out how exactly to replace those gaps entirely smoothly, so that’s going to be part of my goal. After all, if I end up being terrible at webinars, I probably won’t hit my goal of 6!

Do A Full-Length, Pictures-Only Talk

I created a couple lightning talks this year which incorporate hand-drawn pictures, stealing the idea from David Neal, who does a much better job at it than I do. Regardless, I like the format and would like to do an entire talk in that style. It’s a bit of a risk, as I really want to do everything—including demos—in hand-drawn picture format. I think it would be an interesting, one-off style.

This isn’t something that I would want to do for every talk, but as I go over the talks I have today, I know that I’m a bit short on graphics, and I know of several places in talks where a simple picture can help me explain the concept better. Maybe this talk would help me get to that point.

Inserting Into Azure Blob Storage

This is a continuation of my Polybase series.

My goal in today’s post is to insert data into Azure Blob Storage using Polybase and see what happens.  My expectation is that it will behave the same as inserting into Hadoop.

Creating A New External Table

Our first step is to create a new external table.  In this case, I want to store information on second basemen.  I already have the data locally in a table called Player.SecondBasemen, and I’d like to move it up to Azure Blob Storage into a folder called ootp.

secondbasemen

To do this, we’ll create a new external table:

USE [OOTP]
GO
CREATE EXTERNAL TABLE [dbo].[SecondBasemenWASB]
(
	[FirstName] [varchar](50) NULL,
	[LastName] [varchar](50) NULL,
	[Age] [int] NULL,
	[Throws] [varchar](5) NULL,
	[Bats] [varchar](5) NULL
)
WITH
(
	DATA_SOURCE = [WASBFlights],
	LOCATION = N'ootp/',
	FILE_FORMAT = [CsvFileFormat],
	REJECT_TYPE = VALUE,
	REJECT_VALUE = 5
)
GO

Inserting Data

Inserting data is pretty easy.

INSERT INTO dbo.SecondBasemenWASB
(
	FirstName,
	LastName,
	Age,
	Throws,
	Bats
)
SELECT
	sb.FirstName,
	sb.LastName,
	sb.Age,
	sb.Bats,
	sb.Throws
FROM Player.SecondBasemen sb;

A few seconds later and we have some rows in the table:

rowsinserted

We can check to make sure that everything looks fine:

secondbasemenwasb

I get back the same 777 results.

Looking at Azure Management Studio, I can see that the Polybase engine created a logical folder named ootp and 8 separate files, of which 4 have data.

filesuploaded

One interesting finding is that there is also an empty block blob with the name ootp:

emptyblockblob

The timestamp for it is the same as for the initial upload, so these were created at the same time.

Uploading A Larger Data Set

I decided to take advantage of my Azure Blob Storage connection and create a larger data set.  In this case, I’m going to cross join the second basemen table, creating 777 records for each of the 777 second basemen.  Those will all go into Azure Blob Storage and I’ll be able to see how the Polybase engine deals with larger data sets.  I’m running two tests, one which will generate 777 * 777 = 603,729 records, and one which will generate 777 * 777 * 777 = 469,097,433 records.  Here’s my query for the latter:

INSERT INTO dbo.SecondBasemenWASB
(
	FirstName,
	LastName,
	Age,
	Throws,
	Bats
)
SELECT
	sb.FirstName,
	sb.LastName,
	sb.Age,
	sb.Bats,
	sb.Throws
FROM Player.SecondBasemen sb
	CROSS JOIN Player.SecondBasemen sb2
	CROSS JOIN Player.SecondBasemen sb3;

Generating and uploading 600K records was pretty fast, but building 469 million records was a little slower.  During this time, the Polybase engine was streaming results out, meaning it didn’t appear to write everything to a local file.  The mpdwsvc service did take up about 2 GB of RAM during this time and was pushing out at about 25-30 Mbps, which is probably my Wifi connection’s limitation.  All in all, it took my laptop approximately 80 minutes to generate and push those results.

The end results were not particularly shocking:

multipleuploads

I’ve sliced the first, second, and third uploads into separate batches to make it clearer.  In the first batch, as I mentioned above, only four of the files ended up with data in them.  In the second batch, each file had somewhere around 2MB of write activity.  For the third batch, each was about 1.5 GB.  I should note that for the second run, the spread between the smallest and largest files was approximately 17.6%, whereas it was 12.3% for the third run.

What About Fixed-Width Lines?

One additional question I have involves whether the process for loading data is round-robin on a row-by-row basis.  My conjecture is that it is not (particularly given that our first example had 4 files with zero records in them!), but I figured I’d create a new table and test.  In this case, I’m using three fixed-width data types and loading 10 million identical records.  I chose to use identical record values to make sure that the text length of the columns in this line were exactly the same; the reason is that we’re taking data out of SQL Server (where an int is stored in a 4-byte block) and converting that int to a string (where each numeric value in the int is stored as a one-byte character).  I chose 10 million because I now that’s well above the cutoff point for data to go into each of the eight files, so if there’s special logic to handle tiny row counts, I’d get past it.

CREATE TABLE dbo.LocalFixed
(
	ID INT NOT NULL,
	SomeNum INT NOT NULL,
	SomeChar CHAR(12) NOT NULL
);

INSERT INTO dbo.LocalFixed
(
	ID,
	SomeNum,
	SomeChar
)
SELECT TOP (10000000)
	1,
	0,
	'abcdefghijkl'
FROM Player.SecondBasemen sb
	CROSS JOIN Player.SecondBasemen sb1
	CROSS JOIN Player.SecondBasemen sb2;

From here, we can create another external table and insert our 10 million rows into it:

CREATE EXTERNAL TABLE [dbo].[RemoteFixed]
(
	ID INT NOT NULL,
	SomeNum INT NOT NULL,
	SomeChar CHAR(12) NOT NULL
)
WITH
(
	DATA_SOURCE = [WASBFlights],
	LOCATION = N'fixed/',
	FILE_FORMAT = [CsvFileFormat],
	REJECT_TYPE = VALUE,
	REJECT_VALUE = 5
);

INSERT INTO dbo.RemoteFixed
(
	ID,
	SomeNum,
	SomeChar
)
SELECT
	lf.ID,
	lf.SomeNum,
	lf.SomeChar
FROM dbo.LocalFixed lf;

 

And here are the results in Blob Storage:

fixedwidthfiles

Even when I guarantee that rows are exactly the same size, there are still minor differences:  here it’s 20.48 MB versus 20.02 MB.  This indicates to me that the Polybase file write system is not round robin on a row-by-row basis, as there is a difference of 28,193 records separating the largest from the smallest file (1,262,965 rows with data in the largest file versus 1,234,772 in the smallest file).

Conclusion

Just like the Hadoop scenario, inserting into Azure Blob Storage splits the data into a set of files and uploads those files separately.  There weren’t any surprises here, so you can read the Hadoop version to get the rest of the story.  We did take the opportunity to learn a bit more about the way the Polybase file writer engine operates and can conclusively reject the idea that it evenly distributes records based either on data length or number of rows.

This wraps up my Polybase investigations of Azure Blob Storage for now.  Next up, I’m going to look at Azure SQL Data Warehouse.

Understanding Blob Storage Behavior

This is a continuation of my Polybase series.

In the last chapter of the Polybase series, I looked at creating external tables pointing to files in Azure Blob Storage.  Today, we’re going to take a deeper look at how the Polybase engine interacts with Azure Blob Storage.  To do this is as simple as running a basic query.  The query that I will run is the following:

SELECT TOP(1000)
	f.month,
	f.arrdelay
FROM dbo.Flights2008 f
WHERE
	f.dest = N'CMH'
	AND f.dayofmonth = 14;

Execution Plan Details

The execution plan was pretty simple:

executionplan

As far as the XML goes, here’s the full plan if you’re curious:

<?xml version="1.0" encoding="utf-16"?>
<ShowPlanXML xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" Version="1.5" Build="13.0.1722.0" xmlns="http://schemas.microsoft.com/sqlserver/2004/07/showplan">
  <BatchSequence>
    <Batch>
      <Statements>
        <StmtSimple StatementCompId="1" StatementEstRows="67.2391" StatementId="1" StatementOptmLevel="FULL" StatementOptmEarlyAbortReason="GoodEnoughPlanFound" CardinalityEstimationModelVersion="130" StatementSubTreeCost="0.032413" StatementText="SELECT TOP(1000)
 f.month,
 f.arrdelay
FROM dbo.Flights2008 f
WHERE
 f.dest = N'CMH'
 AND f.dayofmonth = 14" StatementType="SELECT" QueryHash="0x955C4C3E381C844B" QueryPlanHash="0x53368F42C59FAD79" RetrievedFromCache="true" SecurityPolicyApplied="false">
          <StatementSetOptions ANSI_NULLS="true" ANSI_PADDING="true" ANSI_WARNINGS="true" ARITHABORT="true" CONCAT_NULL_YIELDS_NULL="true" NUMERIC_ROUNDABORT="false" QUOTED_IDENTIFIER="true" />
          <QueryPlan DegreeOfParallelism="0" NonParallelPlanReason="NoParallelForPDWCompilation" CachedPlanSize="120" CompileTime="50" CompileCPU="49" CompileMemory="208">
            <Warnings>
              <PlanAffectingConvert ConvertIssue="Seek Plan" Expression="CONVERT_IMPLICIT(nvarchar(100),[f].[dest],0)=N'CMH'" />
            </Warnings>
            <MemoryGrantInfo SerialRequiredMemory="0" SerialDesiredMemory="0" />
            <OptimizerHardwareDependentProperties EstimatedAvailableMemoryGrant="417021" EstimatedPagesCached="104255" EstimatedAvailableDegreeOfParallelism="2" />
            <RelOp AvgRowSize="65" EstimateCPU="0.032413" EstimateIO="0" EstimateRebinds="0" EstimateRewinds="0" EstimatedExecutionMode="Row" EstimateRows="67.2391" LogicalOp="Compute Scalar" NodeId="0" Parallel="false" PhysicalOp="Compute Scalar" EstimatedTotalSubtreeCost="0.032413">
              <OutputList>
                <ColumnReference Database="[OOTP]" Schema="[dbo]" Table="[Flights2008]" Alias="[f]" Column="month" />
                <ColumnReference Database="[OOTP]" Schema="[dbo]" Table="[Flights2008]" Alias="[f]" Column="arrdelay" />
              </OutputList>
              <ComputeScalar>
                <DefinedValues>
                  <DefinedValue>
                    <ColumnReference Database="[OOTP]" Schema="[dbo]" Table="[Flights2008]" Alias="[f]" Column="month" />
                    <ScalarOperator ScalarString="[OOTP].[dbo].[Flights2008].[month] as [f].[month]">
                      <Identifier>
                        <ColumnReference Database="[OOTP]" Schema="[dbo]" Table="[Flights2008]" Alias="[f]" Column="month" />
                      </Identifier>
                    </ScalarOperator>
                  </DefinedValue>
                  <DefinedValue>
                    <ColumnReference Database="[OOTP]" Schema="[dbo]" Table="[Flights2008]" Alias="[f]" Column="arrdelay" />
                    <ScalarOperator ScalarString="[OOTP].[dbo].[Flights2008].[arrdelay] as [f].[arrdelay]">
                      <Identifier>
                        <ColumnReference Database="[OOTP]" Schema="[dbo]" Table="[Flights2008]" Alias="[f]" Column="arrdelay" />
                      </Identifier>
                    </ScalarOperator>
                  </DefinedValue>
                </DefinedValues>
                <RelOp AvgRowSize="65" EstimateCPU="0.032413" EstimateIO="0" EstimateRebinds="0" EstimateRewinds="0" EstimatedExecutionMode="Row" EstimateRows="67.2391" LogicalOp="Remote Query" NodeId="1" Parallel="false" PhysicalOp="Remote Query" EstimatedTotalSubtreeCost="0.032413">
                  <OutputList>
                    <ColumnReference Database="[OOTP]" Schema="[dbo]" Table="[Flights2008]" Alias="[f]" Column="month" />
                    <ColumnReference Database="[OOTP]" Schema="[dbo]" Table="[Flights2008]" Alias="[f]" Column="arrdelay" />
                  </OutputList>
                  <RunTimeInformation>
                    <RunTimeCountersPerThread Thread="0" ActualRebinds="1" ActualRewinds="0" ActualRows="1000" Batches="0" ActualEndOfScans="1" ActualExecutions="1" ActualExecutionMode="Row" ActualElapsedms="265523" ActualCPUms="80822" ActualScans="0" ActualLogicalReads="0" ActualPhysicalReads="0" ActualReadAheads="0" ActualLobLogicalReads="0" ActualLobPhysicalReads="0" ActualLobReadAheads="0" />
                  </RunTimeInformation>
                  <RemoteQuery RemoteSource="Polybase_ExternalComputation" RemoteQuery="&lt;?xml version=&quot;1.0&quot; encoding=&quot;utf-8&quot;?&gt;
&lt;dsql_query number_nodes=&quot;1&quot; number_distributions=&quot;8&quot; number_distributions_per_node=&quot;8&quot;&gt;
 &lt;sql&gt;ExecuteMemo explain query&lt;/sql&gt;
 &lt;dsql_operations total_cost=&quot;1&quot; total_number_operations=&quot;10&quot;&gt;
 &lt;dsql_operation operation_type=&quot;RND_ID&quot;&gt;
 &lt;identifier&gt;TEMP_ID_23&lt;/identifier&gt;
 &lt;/dsql_operation&gt;
 &lt;dsql_operation operation_type=&quot;ON&quot;&gt;
 &lt;location permanent=&quot;false&quot; distribution=&quot;AllDistributions&quot; /&gt;
 &lt;sql_operations&gt;
 &lt;sql_operation type=&quot;statement&quot;&gt;CREATE TABLE [tempdb].[dbo].[TEMP_ID_23] ([month] INT, [dayofmonth] INT, [arrdelay] VARCHAR(100) COLLATE SQL_Latin1_General_CP1_CI_AS, [dest] VARCHAR(100) COLLATE SQL_Latin1_General_CP1_CI_AS ) WITH(DATA_COMPRESSION=PAGE);&lt;/sql_operation&gt;
 &lt;/sql_operations&gt;
 &lt;/dsql_operation&gt;
 &lt;dsql_operation operation_type=&quot;ON&quot;&gt;
 &lt;location permanent=&quot;false&quot; distribution=&quot;AllDistributions&quot; /&gt;
 &lt;sql_operations&gt;
 &lt;sql_operation type=&quot;statement&quot;&gt;EXEC [tempdb].[sys].[sp_addextendedproperty] @name=N'IS_EXTERNAL_STREAMING_TABLE', @value=N'true', @level0type=N'SCHEMA', @level0name=N'dbo', @level1type=N'TABLE', @level1name=N'TEMP_ID_23'&lt;/sql_operation&gt;
 &lt;/sql_operations&gt;
 &lt;/dsql_operation&gt;
 &lt;dsql_operation operation_type=&quot;ON&quot;&gt;
 &lt;location permanent=&quot;false&quot; distribution=&quot;AllDistributions&quot; /&gt;
 &lt;sql_operations&gt;
 &lt;sql_operation type=&quot;statement&quot;&gt;UPDATE STATISTICS [tempdb].[dbo].[TEMP_ID_23] WITH ROWCOUNT = 5651, PAGECOUNT = 143&lt;/sql_operation&gt;
 &lt;/sql_operations&gt;
 &lt;/dsql_operation&gt;
 &lt;dsql_operation operation_type=&quot;RND_ID&quot;&gt;
 &lt;identifier&gt;TEMP_ID_24&lt;/identifier&gt;
 &lt;/dsql_operation&gt;
 &lt;dsql_operation operation_type=&quot;ON&quot;&gt;
 &lt;location permanent=&quot;false&quot; distribution=&quot;Control&quot; /&gt;
 &lt;sql_operations&gt;
 &lt;sql_operation type=&quot;statement&quot;&gt;CREATE TABLE [tempdb].[dbo].[TEMP_ID_24] ([month] INT, [arrdelay] VARCHAR(100) COLLATE SQL_Latin1_General_CP1_CI_AS ) WITH(DATA_COMPRESSION=PAGE);&lt;/sql_operation&gt;
 &lt;/sql_operations&gt;
 &lt;/dsql_operation&gt;
 &lt;dsql_operation operation_type=&quot;MULTI&quot;&gt;
 &lt;dsql_operation operation_type=&quot;PARTITION_MOVE&quot;&gt;
 &lt;operation_cost cost=&quot;1.678287936&quot; accumulative_cost=&quot;1.678287936&quot; average_rowsize=&quot;104&quot; output_rows=&quot;67.2391&quot; /&gt;
 &lt;location distribution=&quot;AllDistributions&quot; /&gt;
 &lt;source_statement&gt;SELECT [T1_1].[month] AS [month],
 [T1_1].[arrdelay] AS [arrdelay]
FROM (SELECT TOP (CAST ((1000) AS BIGINT)) [T2_1].[month] AS [month],
 [T2_1].[arrdelay] AS [arrdelay]
 FROM (SELECT CONVERT (NVARCHAR (100), [T3_1].[dest], 0) COLLATE SQL_Latin1_General_CP1_CI_AS AS [col],
 [T3_1].[month] AS [month],
 [T3_1].[arrdelay] AS [arrdelay]
 FROM [tempdb].[dbo].[TEMP_ID_23] AS T3_1
 WHERE ([T3_1].[dayofmonth] = CAST ((14) AS INT))) AS T2_1
 WHERE ([T2_1].[col] = CAST (N'CMH' COLLATE SQL_Latin1_General_CP1_CI_AS AS NVARCHAR (3)) COLLATE SQL_Latin1_General_CP1_CI_AS)) AS T1_1&lt;/source_statement&gt;
 &lt;destination&gt;Control&lt;/destination&gt;
 &lt;destination_table&gt;[TEMP_ID_24]&lt;/destination_table&gt;
 &lt;/dsql_operation&gt;
 &lt;dsql_operation operation_type=&quot;ExternalRoundRobinMove&quot;&gt;
 &lt;operation_cost cost=&quot;964.621896&quot; accumulative_cost=&quot;966.300183936&quot; average_rowsize=&quot;208&quot; output_rows=&quot;45211&quot; /&gt;
 &lt;external_uri&gt;wasbs://csflights@cspolybase.blob.core.windows.net/historical/2008.csv.bz2&lt;/external_uri&gt;
 &lt;destination_table&gt;[TEMP_ID_23]&lt;/destination_table&gt;
 &lt;/dsql_operation&gt;
 &lt;/dsql_operation&gt;
 &lt;dsql_operation operation_type=&quot;ON&quot;&gt;
 &lt;location permanent=&quot;false&quot; distribution=&quot;AllDistributions&quot; /&gt;
 &lt;sql_operations&gt;
 &lt;sql_operation type=&quot;statement&quot;&gt;DROP TABLE [tempdb].[dbo].[TEMP_ID_23]&lt;/sql_operation&gt;
 &lt;/sql_operations&gt;
 &lt;/dsql_operation&gt;
 &lt;dsql_operation operation_type=&quot;STREAMING_RETURN&quot;&gt;
 &lt;operation_cost cost=&quot;1&quot; accumulative_cost=&quot;967.300183936&quot; average_rowsize=&quot;104&quot; output_rows=&quot;67.2391&quot; /&gt;
 &lt;location distribution=&quot;Control&quot; /&gt;
 &lt;select&gt;SELECT [T1_1].[month] AS [month],
 [T1_1].[arrdelay] AS [arrdelay]
FROM (SELECT TOP (CAST ((1000) AS BIGINT)) [T2_1].[month] AS [month],
 [T2_1].[arrdelay] AS [arrdelay]
 FROM [tempdb].[dbo].[TEMP_ID_24] AS T2_1) AS T1_1&lt;/select&gt;
 &lt;/dsql_operation&gt;
 &lt;dsql_operation operation_type=&quot;ON&quot;&gt;
 &lt;location permanent=&quot;false&quot; distribution=&quot;Control&quot; /&gt;
 &lt;sql_operations&gt;
 &lt;sql_operation type=&quot;statement&quot;&gt;DROP TABLE [tempdb].[dbo].[TEMP_ID_24]&lt;/sql_operation&gt;
 &lt;/sql_operations&gt;
 &lt;/dsql_operation&gt;
 &lt;/dsql_operations&gt;
&lt;/dsql_query&gt;" />
                </RelOp>
              </ComputeScalar>
            </RelOp>
          </QueryPlan>
        </StmtSimple>
      </Statements>
    </Batch>
  </BatchSequence>
</ShowPlanXML>

And here is the decoded remote query:

<?xml version="1.0" encoding="utf-8"?>
<dsql_query number_nodes="1" number_distributions="8" number_distributions_per_node="8">
  <sql>ExecuteMemo explain query</sql>
  <dsql_operations total_cost="1" total_number_operations="10">
    <dsql_operation operation_type="RND_ID">
      <identifier>TEMP_ID_23</identifier>
    </dsql_operation>
    <dsql_operation operation_type="ON">
      <location permanent="false" distribution="AllDistributions" />
      <sql_operations>
        <sql_operation type="statement">CREATE TABLE [tempdb].[dbo].[TEMP_ID_23] ([month] INT, [dayofmonth] INT, [arrdelay] VARCHAR(100) COLLATE SQL_Latin1_General_CP1_CI_AS, [dest] VARCHAR(100) COLLATE SQL_Latin1_General_CP1_CI_AS ) WITH(DATA_COMPRESSION=PAGE);</sql_operation>
      </sql_operations>
    </dsql_operation>
    <dsql_operation operation_type="ON">
      <location permanent="false" distribution="AllDistributions" />
      <sql_operations>
        <sql_operation type="statement">EXEC [tempdb].[sys].[sp_addextendedproperty] @name=N'IS_EXTERNAL_STREAMING_TABLE', @value=N'true', @level0type=N'SCHEMA', @level0name=N'dbo', @level1type=N'TABLE', @level1name=N'TEMP_ID_23'</sql_operation>
      </sql_operations>
    </dsql_operation>
    <dsql_operation operation_type="ON">
      <location permanent="false" distribution="AllDistributions" />
      <sql_operations>
        <sql_operation type="statement">UPDATE STATISTICS [tempdb].[dbo].[TEMP_ID_23] WITH ROWCOUNT = 5651, PAGECOUNT = 143</sql_operation>
      </sql_operations>
    </dsql_operation>
    <dsql_operation operation_type="RND_ID">
      <identifier>TEMP_ID_24</identifier>
    </dsql_operation>
    <dsql_operation operation_type="ON">
      <location permanent="false" distribution="Control" />
      <sql_operations>
        <sql_operation type="statement">CREATE TABLE [tempdb].[dbo].[TEMP_ID_24] ([month] INT, [arrdelay] VARCHAR(100) COLLATE SQL_Latin1_General_CP1_CI_AS ) WITH(DATA_COMPRESSION=PAGE);</sql_operation>
      </sql_operations>
    </dsql_operation>
    <dsql_operation operation_type="MULTI">
      <dsql_operation operation_type="PARTITION_MOVE">
        <operation_cost cost="1.678287936" accumulative_cost="1.678287936" average_rowsize="104" output_rows="67.2391" />
        <location distribution="AllDistributions" />
        <source_statement>SELECT [T1_1].[month] AS [month],
       [T1_1].[arrdelay] AS [arrdelay]
FROM   (SELECT TOP (CAST ((1000) AS BIGINT)) [T2_1].[month] AS [month],
                                             [T2_1].[arrdelay] AS [arrdelay]
        FROM   (SELECT CONVERT (NVARCHAR (100), [T3_1].[dest], 0) COLLATE SQL_Latin1_General_CP1_CI_AS AS [col],
                       [T3_1].[month] AS [month],
                       [T3_1].[arrdelay] AS [arrdelay]
                FROM   [tempdb].[dbo].[TEMP_ID_23] AS T3_1
                WHERE  ([T3_1].[dayofmonth] = CAST ((14) AS INT))) AS T2_1
        WHERE  ([T2_1].[col] = CAST (N'CMH' COLLATE SQL_Latin1_General_CP1_CI_AS AS NVARCHAR (3)) COLLATE SQL_Latin1_General_CP1_CI_AS)) AS T1_1</source_statement>
        <destination>Control</destination>
        <destination_table>[TEMP_ID_24]</destination_table>
      </dsql_operation>
      <dsql_operation operation_type="ExternalRoundRobinMove">
        <operation_cost cost="964.621896" accumulative_cost="966.300183936" average_rowsize="208" output_rows="45211" />
        <external_uri>wasbs://csflights@cspolybase.blob.core.windows.net/historical/2008.csv.bz2</external_uri>
        <destination_table>[TEMP_ID_23]</destination_table>
      </dsql_operation>
    </dsql_operation>
    <dsql_operation operation_type="ON">
      <location permanent="false" distribution="AllDistributions" />
      <sql_operations>
        <sql_operation type="statement">DROP TABLE [tempdb].[dbo].[TEMP_ID_23]</sql_operation>
      </sql_operations>
    </dsql_operation>
    <dsql_operation operation_type="STREAMING_RETURN">
      <operation_cost cost="1" accumulative_cost="967.300183936" average_rowsize="104" output_rows="67.2391" />
      <location distribution="Control" />
      <select>SELECT [T1_1].[month] AS [month],
       [T1_1].[arrdelay] AS [arrdelay]
FROM   (SELECT TOP (CAST ((1000) AS BIGINT)) [T2_1].[month] AS [month],
                                             [T2_1].[arrdelay] AS [arrdelay]
        FROM   [tempdb].[dbo].[TEMP_ID_24] AS T2_1) AS T1_1</select>
    </dsql_operation>
    <dsql_operation operation_type="ON">
      <location permanent="false" distribution="Control" />
      <sql_operations>
        <sql_operation type="statement">DROP TABLE [tempdb].[dbo].[TEMP_ID_24]</sql_operation>
      </sql_operations>
    </dsql_operation>
  </dsql_operations>
</dsql_query>

A couple of interesting details.  First, this plan looks pretty similar to the Hadoop plan when I did not induce a MapReduce job.  We can see 8 distributions per node, just like in on-prem Hadoop.  Second, as expected, our external URI points to the WASBS (secure WASB) blob location.  Given that I’m using the WASBS protocol, I’m expecting the connection to pull this data down to use SSL/TLS.  And there’s one easy way to find out.

Wireshark Details

While running this query, I also pulled a packet capture.  After figuring out the IP address (which wasn’t hard, given that there were about 93,000 packets in my packet capture associated with it), I was able to get the following details.

Overall Picture

packetcapture

We can see in the image above that Polybase uses TLS v 1.2 to communicate with Azure Blob Storage when you access Blob Storage using the WASBS protocol.  The connection is encrypted, and so we don’t get much useful information out of it unless you set up Wireshark to deal with encrypted data.  For what we want to do, that’s not neccessary.

The Download

What I do want to see is whether my statement that the Polybase engine needs to download the entire file is correct.  To do that, I can filter on the TCP stream associated with my file download and look for any with a length more than 1000 bytes.

downloadsize

In this case, all of those packets were 1514 bytes, so it’s an easy multiplication problem to see that we downloaded approximately 113 MB.  The 2008.csv.bz2 file itself is 108 MB, so factoring in TCP packet overhead and that there were additional, smaller packets in the stream, I think that’s enough to show that we did in fact download the entire file.  Just like in the Hadoop scenario without MapReduce, the Polybase engine needs to take all of the data and load it into a temp table (or set of temp tables if you’re using a Polybase scale-out cluster) before it can pull out the relevant rows based on our query.

Conclusion

Looking at the execution plan and packet capture, it’s pretty clear that using Polybase with Azure Blob Storage behaves similarly to Polybase with Hadoop as long as you don’t use MapReduce.

Connecting To Blob Storage With Polybase

This is a continuation of my Polybase series.

To this point, I have focused my Polybase series on interactions with on-premises Hadoop, as it’s the use case most apropos to me.  I want to start expanding that out to include other interaction mechanisms, and I’m going to start with one of the easiest:  Azure Blob Storage.

Ayman El-Ghazali has a great blog post the topic, which he turned into a full-length talk.  As such, this post will fill in the gaps rather than start from scratch.  In today’s post, my intention is to retrieve data from Azure Blob Storage and get an idea of what’s happening.  From there, we’ll spend a couple more posts on Azure Blob Storage, looking a bit deeper into the process.  That said, my expectation going into this series is that much of what we do with Azure Blob Storage will mimic what we did with Hadoop, as there are no Polybase core concepts unique to Azure Blob Storage, at least any of which I am aware.

Starting Point:  Files In Blob Storage

I have copied up the flight data from 2003 through 2008 to Azure Blob Storage using Azure Management Studio.  Note that I’ve put my uploads into a folder called historical; that way, when I create external tables later, I can segregate those files from other files I upload.

uploadedfiles

I also have SQL Server 2016 and the Polybase engine installed on my local machine, configured the same way I did for on-prem Hadoop.  Note that the Polybase Connectivity Configuration option 7 will work for Azure Blob Storage as well as modern versions of Hortonworks, so I don’t need to change that.

External Data Source

My first step is to set up an external data source for Azure Blob Storage.  If we go back to the MSDN documentation, we can see that the syntax for creating an external data source which uses Azure Blob Storage is pretty straightforward:

-- PolyBase only: Azure Storage Blob as data source   
-- (on SQL Server 2016 and Azure SQL Data Warehouse)  
CREATE EXTERNAL DATA SOURCE data_source_name  
    WITH (   
        TYPE = HADOOP,  
        LOCATION = 'wasb[s]://container@account_name.blob.core.windows.net'
        [, CREDENTIAL = credential_name ]
    )  
[;]

We specify the type to be HADOOP and point to the storage account, along with a credential.  In my case, the container is called csflights and my account name is cspolybase, as you can see in the picture above.

But before I can create the data source, I need to create a master key for my database (if I don’t have one already) and create a database-scoped credential which contains my secret for Azure Blob Storage.  To get the secret, I need to go to the Azure Portal and select the “Access keys” option in settings and copy one of the keys they generate.

accesskey

Once I have that, I can fill out the T-SQL statements to create my data source:

USE OOTP
GO
CREATE MASTER KEY ENCRYPTION BY PASSWORD = '<password>';
GO
CREATE DATABASE SCOPED CREDENTIAL AzureStorageCredential  
WITH IDENTITY = 'cspolybase',
SECRET = '<access key>';
GO
CREATE EXTERNAL DATA SOURCE WASBFlights
WITH (   
    TYPE = HADOOP,  
    LOCATION = 'wasbs://csflights@cspolybase.blob.core.windows.net',
	CREDENTIAL = AzureStorageCredential
);

Something important to note here is that Azure Blob Storage does not have a resource manager location.  This makes sense, as the resource manager (YARN) exists to allow you to push code to a Hadoop cluster to perform MapReduce jobs remotely.  All we can do with Azure Blob Storage is pull the data over the wire and perform all filters and joins within SQL Server.

External File Format

The external file format is the same as what we would use for on-premises Hadoop.  Just in case you don’t have one set up, here it is:

USE [OOTP]
GO
CREATE EXTERNAL FILE FORMAT [CsvFileFormat] WITH
(
	FORMAT_TYPE = DELIMITEDTEXT,
	FORMAT_OPTIONS
	(
		FIELD_TERMINATOR = N',',
		USE_TYPE_DEFAULT = True
	)
);
GO

External Tables

Now that we have the data source and file format set up, we can create some external tables.  The first table I want to create revolves around 2008 data.

USE [OOTP]
GO
CREATE EXTERNAL TABLE [dbo].[Flights2008]
(
    [year] int NULL,
    [month] int NULL,
    [dayofmonth] int NULL,
    [dayofweek] int NULL,
    deptime VARCHAR(100) NULL,
    crsdeptime VARCHAR(100) NULL,
    arrtime VARCHAR(100) NULL,
    crsarrtime VARCHAR(100) NULL,
    uniquecarrier VARCHAR(100) NULL,
    flightnum VARCHAR(100) NULL,
    tailnum VARCHAR(100) NULL,
    actualelapsedtime VARCHAR(100) NULL,
    crselapsedtime VARCHAR(100) NULL,
    airtime VARCHAR(100) NULL,
    arrdelay VARCHAR(100) NULL,
    depdelay VARCHAR(100) NULL,
    origin VARCHAR(100) NULL,
    dest VARCHAR(100) NULL,
    distance VARCHAR(100) NULL,
    taxiin VARCHAR(100) NULL,
    taxiout VARCHAR(100) NULL,
    cancelled VARCHAR(100) NULL,
    cancellationcode VARCHAR(100) NULL,
    diverted VARCHAR(100) NULL,
    carrierdelay VARCHAR(100) NULL,
    weatherdelay VARCHAR(100) NULL,
    nasdelay VARCHAR(100) NULL,
    securitydelay VARCHAR(100) NULL,
    lateaircraftdelay VARCHAR(100) NULL
)
WITH
(
    LOCATION = N'historical/2008.csv.bz2',
    DATA_SOURCE = WASBFlights,
    FILE_FORMAT = CsvFileFormat,
    -- Up to 5000 rows can have bad values before Polybase returns an error.
    REJECT_TYPE = Value,
    REJECT_VALUE = 5000
);
GO

My first table just looks at flight information for 2008 so I can double-check my work.  A quick top 10 records looks good:

SELECT TOP(10) *
FROM Flights2008 f;

top10

The second table I want to create is all data, not just for a particular year.

USE [OOTP]
GO
CREATE EXTERNAL TABLE [dbo].[FlightsAll]
(
    [year] int NULL,
    [month] int NULL,
    [dayofmonth] int NULL,
    [dayofweek] int NULL,
    deptime VARCHAR(100) NULL,
    crsdeptime VARCHAR(100) NULL,
    arrtime VARCHAR(100) NULL,
    crsarrtime VARCHAR(100) NULL,
    uniquecarrier VARCHAR(100) NULL,
    flightnum VARCHAR(100) NULL,
    tailnum VARCHAR(100) NULL,
    actualelapsedtime VARCHAR(100) NULL,
    crselapsedtime VARCHAR(100) NULL,
    airtime VARCHAR(100) NULL,
    arrdelay VARCHAR(100) NULL,
    depdelay VARCHAR(100) NULL,
    origin VARCHAR(100) NULL,
    dest VARCHAR(100) NULL,
    distance VARCHAR(100) NULL,
    taxiin VARCHAR(100) NULL,
    taxiout VARCHAR(100) NULL,
    cancelled VARCHAR(100) NULL,
    cancellationcode VARCHAR(100) NULL,
    diverted VARCHAR(100) NULL,
    carrierdelay VARCHAR(100) NULL,
    weatherdelay VARCHAR(100) NULL,
    nasdelay VARCHAR(100) NULL,
    securitydelay VARCHAR(100) NULL,
    lateaircraftdelay VARCHAR(100) NULL
)
WITH
(
    LOCATION = N'historical/',
    DATA_SOURCE = WASBFlights,
    FILE_FORMAT = CsvFileFormat,
    -- Up to 5000 rows can have bad values before Polybase returns an error.
    REJECT_TYPE = Value,
    REJECT_VALUE = 5000
);
GO

We can run a check to ensure that records look okay here. In this scenario, instead of just pulling down 10 results, I want to make sure that we get data for each year.

SELECT
	fa.[year],
	COUNT(1) AS NumberOfRecords
FROM dbo.FlightsAll fa
GROUP BY
	fa.[year]
ORDER BY
	fa.[year];

Naturally, this query takes a bit longer to run, as I have to download the files, decompress them, process them, and aggregate their results locally.  During this process, the mpdwsvc service was my biggest resource user, followed by a stretch in which SQL Server ate up a healthy amount of CPU time.

pdwsvc

When the query finished, it gave me my expected results:

aggregateresults

This particular query took me just over 5 minutes to run, getting data for approximately 42 million rows.

Conclusion

At this point, we’ve set up a Polybase connection to Azure Blob Storage.  In the next Polybase series post, we’re going to look at execution plans, packet captures, and other tools to gain a better understanding of the process.  After that, I will put together another post on inserting data into Azure Blob Storage.

Brave New World

For the past three years, I’ve worked as a Database Engineer—in other words, as a database developer—at ChannelAdvisor.  2 1/2 years of that time was spent working in the digital marketing space.  Coming into this job, I had worked at small-scale organizations:  the smallest cabinet-level department in Ohio, followed by a relatively small subsidiary of a large insurance company.  Working at ChannelAdvisor helped me build up skills as a database developer, figuring out that things which work well with a hundred thousand rows in a table don’t necessarily work well when that table hits a billion rows.

Well, come January 1st, I will no longer be a Database Engineer.  That’s because I’m going to be the Engineering Manager of a new predictive analytics team.

Wait, Management?

Yeah, this feels a little crazy for me as well.  The me of five years ago would never have wanted to be a manager, and the reasoning would have been the same as for other technical people:  I enjoy being on the front line, doing things rather than filling out paperwork.

Since then, I would not say that my priorities have changed much:  I still want to be on the front line, using technology to solve business problems.  What I get, though, is a force multiplier:  I now have two more people who can help me accomplish great things.

Vision

Something I’ve observed during the last few years of work is that we have a tremendous amount of interesting data at the company, and we throw away even more due to space and budget constraints.  What we have not been so good at was taking full advantage of that data to help customers.  Most of our systems are designed around a single customer’s data.  Obviously, our transactional systems are keyed toward individual customers, rather than aggregating their results.  What’s interesting is that even our warehouses tend to be customer-focused rather than company-focused.

My vision on predictive analytics is to blow out our company’s disk budget.

It is also to take advantage of this data and solve problems for customers, between customers, and for the company as a whole.  We have the data, and it will be my job to put together the tools to collect (in a place which does not harm our transactional processes), process, aggregate, and analyze the data.  Without getting into specifics, I want to close the internal gap between what we could conceivably do versus what we can do in practice.

Plan of Action:  Data Engineering

In order to pull off my vision, I’ve got to build up skills on a number of fronts, all at the same time.  There are four major quadrants I need to hit; the good news is that I’m building a team to help me with two of them.  I’m going to start with the Data Engineering Crucible, in which I (hopefully) take people with complementary skills across the following three axes and build up people with strong skills across all three.

Statistics

Doing a few analytics projects has reminded me that I need to re-take some stats courses.  My last statistics course was in graduate school, and that was mostly statistics for economists, meaning lots of regressions.  I’m bringing in an employee who has a pretty strong background in this, and so I plan to lean on that person pretty heavily (and push that person to get better at the same time).

Development

My .NET skills have stagnated the last few years.  That makes sense, as I don’t write as much .NET code as before.  The good news is that by hanging around the .NET user group and working on projects both at work and for presentations, I haven’t forgotten much there.  I also want to have my other employee bring in a strong development background to help the team get better.

Aside from .NET development (F# for life!), we’ll use other languages too.  I have some experience with R and Python, and that experience is about to grow significantly.  I have a lot of experience with SQL that I will share with the team, as well as some Java/Scala experience and whatever other crazy languages we decide on.

Subject Matter Expertise

I’ve worked on digital marketing for the past 2 1/2 years, but that’s only a part of what the company does.  My job will be to work with my team to train them on DM but also learn more about the rest of the company’s data and processes.

Plan of Action:  Management

Aside from hitting the Data Engineering trifecta, it’s time to ramp up management skills.  I have some ideas, and I’ll probably share more as I do a bit more.  Right now, it involves reading some books and thinking through various plans, like how I want to run meetings or drive sprint work.  After a few months, I hope to have a post up which describes some of this and see how things work out.

Conclusion

Over the past year, I have been itching for a team lead position.  Now that I have it, I’ll be like a dog with a chew toy, though probably a little less destructive.