This is a continuation of my Polybase series.

The first step to working with Polybase in Azure SQL Data Warehouse is to provision an Azure SQL Data Warehouse instance.  I’m going to follow along with the data I used in my Azure Blob Storage examples and load that data into Azure SQL Data Warehouse using Polybase.

The best way to connect to Azure SQL Data Warehouse is not SQL Server Management Studio, but rather SQL Server Data Tools.  So we’ll connect using SSDT.

In my scenario, I have an Azure SQL Data Warehouse instance named csdwdb.

connection

After putting in the correct details, I can see my instance in the SQL Server Object Explorer in Visual Studio.

connected

My plan is to do two things:  first, I want to pull all of my flight data into Azure SQL Data Warehouse, putting it into a table with a clustered columnstore index.  Second, I want to perform some processing of the data and put the results back into Azure Blob Storage to allow me to migrate data back to an on-prem SQL Server instance.

Step One:  Loading Data

External Types

The first thing I need to do is to create an external data source, external file format, and external table for my Azure Blob Storage flights folder.  Looking back at my post on Azure Blob Storage, I can repurpose most of that code for this scenario as well:

CREATE MASTER KEY ENCRYPTION BY PASSWORD = '<password>';
GO
CREATE DATABASE SCOPED CREDENTIAL AzureStorageCredential
WITH IDENTITY = 'cspolybase',
SECRET = '<access key>';
GO
CREATE EXTERNAL DATA SOURCE WASBFlights
WITH (
    TYPE = HADOOP,
    LOCATION = 'wasbs://csflights@cspolybase.blob.core.windows.net',
    CREDENTIAL = AzureStorageCredential
);
GO
CREATE EXTERNAL FILE FORMAT [CsvFileFormat] WITH
(
    FORMAT_TYPE = DELIMITEDTEXT,
    FORMAT_OPTIONS
    (
        FIELD_TERMINATOR = N',',
        USE_TYPE_DEFAULT = True
    )
);
GO
CREATE EXTERNAL TABLE [dbo].[Flights2008]
(
    [year] int NULL,
    [month] int NULL,
    [dayofmonth] int NULL,
    [dayofweek] int NULL,
    deptime VARCHAR(100) NULL,
    crsdeptime VARCHAR(100) NULL,
    arrtime VARCHAR(100) NULL,
    crsarrtime VARCHAR(100) NULL,
    uniquecarrier VARCHAR(100) NULL,
    flightnum VARCHAR(100) NULL,
    tailnum VARCHAR(100) NULL,
    actualelapsedtime VARCHAR(100) NULL,
    crselapsedtime VARCHAR(100) NULL,
    airtime VARCHAR(100) NULL,
    arrdelay VARCHAR(100) NULL,
    depdelay VARCHAR(100) NULL,
    origin VARCHAR(100) NULL,
    dest VARCHAR(100) NULL,
    distance VARCHAR(100) NULL,
    taxiin VARCHAR(100) NULL,
    taxiout VARCHAR(100) NULL,
    cancelled VARCHAR(100) NULL,
    cancellationcode VARCHAR(100) NULL,
    diverted VARCHAR(100) NULL,
    carrierdelay VARCHAR(100) NULL,
    weatherdelay VARCHAR(100) NULL,
    nasdelay VARCHAR(100) NULL,
    securitydelay VARCHAR(100) NULL,
    lateaircraftdelay VARCHAR(100) NULL
)
WITH
(
    LOCATION = N'historical/2008.csv.bz2',
    DATA_SOURCE = WASBFlights,
    FILE_FORMAT = CsvFileFormat,
    -- Up to 5000 rows can have bad values before Polybase returns an error.
    REJECT_TYPE = Value,
    REJECT_VALUE = 5000
);
GO
CREATE EXTERNAL TABLE [dbo].[FlightsAll]
(
    [year] int NULL,
    [month] int NULL,
    [dayofmonth] int NULL,
    [dayofweek] int NULL,
    deptime VARCHAR(100) NULL,
    crsdeptime VARCHAR(100) NULL,
    arrtime VARCHAR(100) NULL,
    crsarrtime VARCHAR(100) NULL,
    uniquecarrier VARCHAR(100) NULL,
    flightnum VARCHAR(100) NULL,
    tailnum VARCHAR(100) NULL,
    actualelapsedtime VARCHAR(100) NULL,
    crselapsedtime VARCHAR(100) NULL,
    airtime VARCHAR(100) NULL,
    arrdelay VARCHAR(100) NULL,
    depdelay VARCHAR(100) NULL,
    origin VARCHAR(100) NULL,
    dest VARCHAR(100) NULL,
    distance VARCHAR(100) NULL,
    taxiin VARCHAR(100) NULL,
    taxiout VARCHAR(100) NULL,
    cancelled VARCHAR(100) NULL,
    cancellationcode VARCHAR(100) NULL,
    diverted VARCHAR(100) NULL,
    carrierdelay VARCHAR(100) NULL,
    weatherdelay VARCHAR(100) NULL,
    nasdelay VARCHAR(100) NULL,
    securitydelay VARCHAR(100) NULL,
    lateaircraftdelay VARCHAR(100) NULL
)
WITH
(
    LOCATION = N'historical/',
    DATA_SOURCE = WASBFlights,
    FILE_FORMAT = CsvFileFormat,
    -- Up to 5000 rows can have bad values before Polybase returns an error.
    REJECT_TYPE = Value,
    REJECT_VALUE = 5000
);
GO

Note that there is no database defined anymore. I’m running the query against my Azure SQL DW database, so I’m not going to specify a separate database here.

I want to make sure that everything’s working, so a quick select statement is in order:

SELECT TOP(100) * FROM dbo.FlightsAll;

top100

So, “quick” might be an overstatement, but I did get results I wanted and proof that we have data available.

Loading SQL Data Warehouse

My next step is to take this data and load it into Azure SQL Data Warehouse.  The smart way to create tables in Azure SQL Data Warehouse is to use the CREATE TABLE AS SELECT syntax.  I’m going to use that, as well as the advice they give on getting a fairly unique distribution key, and create my 2008 flights table as follows:

CREATE TABLE[dbo].[Flights2008DW]
WITH
(
	CLUSTERED COLUMNSTORE INDEX,
	DISTRIBUTION = HASH(tailnum)
)
AS SELECT * FROM dbo.Flights2008;
GO

After a little over a minute, I get my results:

Query completed. Rows were rejected while reading from external source(s).
1 row rejected from external table [Flights2008] in plan step 3 of query execution:
Location: ‘/historical/2008.csv.bz2’ Column ordinal: 0, Expected data type: INT, Offending value: Year (Column Conversion Error), Error: Error converting data type NVARCHAR to INT.

That’s a header row, and I’m okay with it not making its way in.  As a quick aside, I should note that I picked tailnum as my distribution key.  The airplane’s tail number is unique to that craft, so there absolutely will be more than 60 distinct values, and as I recall, this data set didn’t have too many NULL values.  After loading the 2008 data, I loaded all years’ data the same way, except selecting from dbo.Flights instead of Flights2008.

Querying The Data

So let’s do some quick queries and make sure that everything looks alright.  Going back to earlier this year, I had an example of a query against a larger data set.  My subset of flights is not exactly the same here as it was there—I only have flight data for the years 2003 through 2008 in Azure SQL Data Warehouse—but I figured it’d be a good start.

Here’s my slightly revised query:

SELECT
    f.origin,
    f.distance
FROM dbo.FlightsAllDW f
WHERE
    f.year = 2005
    AND f.dest = 'CMH';

And against an Azure SQL Data Warehouse instance with 200 DWUs, it took 6 seconds to load the results.

2005flights

Again, that’s not a fair comp:  I’m getting 200 DWU for data partitioned across 60 servers versus running a query on my laptop.  That said, it was pleasant seeing results come back that quickly.

Migrating Data Out Using Polybase

I’m going to show one last trick here:  outmigration using CETAS:  CREATE EXTERNAL TABLE AS SELECT.

CREATE EXTERNAL TABLE [dbo].[CMHFlights]
WITH
(
	LOCATION = N'columbus/',
    DATA_SOURCE = WASBFlights,
    FILE_FORMAT = CsvFileFormat,
    -- Up to 5000 rows can have bad values before Polybase returns an error.
    REJECT_TYPE = Value,
    REJECT_VALUE = 5000
)
AS SELECT * FROM dbo.FlightsAllDW WHERE dest = 'CMH';
GO

This query takes data from my FlightsAllDW Azure SQL DW table and pumps them into Azure Blob Storage into a new external table.

cmhflights

In eight seconds, it figured out the 220K records that interested me and wrote them to a new external table called CMHFlights, and in a folder marked columbus.  If I look at the folder, I can see that the Polybase engine created 60 files:

cetas

Remember that when we did this ourselves using on-prem Polybase (e.g., writing to Hadoop), we ended up the data split out into 8 files.  Azure SQL Data Warehouse is special.

Conclusion

Polybase inside Azure SQL Data Warehouse isn’t that different from on-prem Polybase.  Its primary focus is to get data into and out of Azure SQL Data Warehouse for cases in which you don’t want to use Azure Data Factory.  One really cool thing you can do with Polybase in Azure SQL DW that you can’t do on-prem is CREATE EXTERNAL TABLE AS SELECT.  Otherwise, if you understand how the on-prem product works, you’ll get how it works in SQL Data Warehouse.

One thought on “Working With Azure SQL DW

Leave a comment