Connecting To Blob Storage With Polybase

This is a continuation of my Polybase series.

To this point, I have focused my Polybase series on interactions with on-premises Hadoop, as it’s the use case most apropos to me.  I want to start expanding that out to include other interaction mechanisms, and I’m going to start with one of the easiest:  Azure Blob Storage.

Ayman El-Ghazali has a great blog post the topic, which he turned into a full-length talk.  As such, this post will fill in the gaps rather than start from scratch.  In today’s post, my intention is to retrieve data from Azure Blob Storage and get an idea of what’s happening.  From there, we’ll spend a couple more posts on Azure Blob Storage, looking a bit deeper into the process.  That said, my expectation going into this series is that much of what we do with Azure Blob Storage will mimic what we did with Hadoop, as there are no Polybase core concepts unique to Azure Blob Storage, at least any of which I am aware.

Starting Point:  Files In Blob Storage

I have copied up the flight data from 2003 through 2008 to Azure Blob Storage using Azure Management Studio.  Note that I’ve put my uploads into a folder called historical; that way, when I create external tables later, I can segregate those files from other files I upload.

uploadedfiles

I also have SQL Server 2016 and the Polybase engine installed on my local machine, configured the same way I did for on-prem Hadoop.  Note that the Polybase Connectivity Configuration option 7 will work for Azure Blob Storage as well as modern versions of Hortonworks, so I don’t need to change that.

External Data Source

My first step is to set up an external data source for Azure Blob Storage.  If we go back to the MSDN documentation, we can see that the syntax for creating an external data source which uses Azure Blob Storage is pretty straightforward:

-- PolyBase only: Azure Storage Blob as data source   
-- (on SQL Server 2016 and Azure SQL Data Warehouse)  
CREATE EXTERNAL DATA SOURCE data_source_name  
    WITH (   
        TYPE = HADOOP,  
        LOCATION = 'wasb[s]://container@account_name.blob.core.windows.net'
        [, CREDENTIAL = credential_name ]
    )  
[;]

We specify the type to be HADOOP and point to the storage account, along with a credential.  In my case, the container is called csflights and my account name is cspolybase, as you can see in the picture above.

But before I can create the data source, I need to create a master key for my database (if I don’t have one already) and create a database-scoped credential which contains my secret for Azure Blob Storage.  To get the secret, I need to go to the Azure Portal and select the “Access keys” option in settings and copy one of the keys they generate.

accesskey

Once I have that, I can fill out the T-SQL statements to create my data source:

USE OOTP
GO
CREATE MASTER KEY ENCRYPTION BY PASSWORD = '<password>';
GO
CREATE DATABASE SCOPED CREDENTIAL AzureStorageCredential  
WITH IDENTITY = 'cspolybase',
SECRET = '<access key>';
GO
CREATE EXTERNAL DATA SOURCE WASBFlights
WITH (   
    TYPE = HADOOP,  
    LOCATION = 'wasbs://csflights@cspolybase.blob.core.windows.net',
	CREDENTIAL = AzureStorageCredential
);

Something important to note here is that Azure Blob Storage does not have a resource manager location.  This makes sense, as the resource manager (YARN) exists to allow you to push code to a Hadoop cluster to perform MapReduce jobs remotely.  All we can do with Azure Blob Storage is pull the data over the wire and perform all filters and joins within SQL Server.

External File Format

The external file format is the same as what we would use for on-premises Hadoop.  Just in case you don’t have one set up, here it is:

USE [OOTP]
GO
CREATE EXTERNAL FILE FORMAT [CsvFileFormat] WITH
(
	FORMAT_TYPE = DELIMITEDTEXT,
	FORMAT_OPTIONS
	(
		FIELD_TERMINATOR = N',',
		USE_TYPE_DEFAULT = True
	)
);
GO

External Tables

Now that we have the data source and file format set up, we can create some external tables.  The first table I want to create revolves around 2008 data.

USE [OOTP]
GO
CREATE EXTERNAL TABLE [dbo].[Flights2008]
(
    [year] int NULL,
    [month] int NULL,
    [dayofmonth] int NULL,
    [dayofweek] int NULL,
    deptime VARCHAR(100) NULL,
    crsdeptime VARCHAR(100) NULL,
    arrtime VARCHAR(100) NULL,
    crsarrtime VARCHAR(100) NULL,
    uniquecarrier VARCHAR(100) NULL,
    flightnum VARCHAR(100) NULL,
    tailnum VARCHAR(100) NULL,
    actualelapsedtime VARCHAR(100) NULL,
    crselapsedtime VARCHAR(100) NULL,
    airtime VARCHAR(100) NULL,
    arrdelay VARCHAR(100) NULL,
    depdelay VARCHAR(100) NULL,
    origin VARCHAR(100) NULL,
    dest VARCHAR(100) NULL,
    distance VARCHAR(100) NULL,
    taxiin VARCHAR(100) NULL,
    taxiout VARCHAR(100) NULL,
    cancelled VARCHAR(100) NULL,
    cancellationcode VARCHAR(100) NULL,
    diverted VARCHAR(100) NULL,
    carrierdelay VARCHAR(100) NULL,
    weatherdelay VARCHAR(100) NULL,
    nasdelay VARCHAR(100) NULL,
    securitydelay VARCHAR(100) NULL,
    lateaircraftdelay VARCHAR(100) NULL
)
WITH
(
    LOCATION = N'historical/2008.csv.bz2',
    DATA_SOURCE = WASBFlights,
    FILE_FORMAT = CsvFileFormat,
    -- Up to 5000 rows can have bad values before Polybase returns an error.
    REJECT_TYPE = Value,
    REJECT_VALUE = 5000
);
GO

My first table just looks at flight information for 2008 so I can double-check my work.  A quick top 10 records looks good:

SELECT TOP(10) *
FROM Flights2008 f;

top10

The second table I want to create is all data, not just for a particular year.

USE [OOTP]
GO
CREATE EXTERNAL TABLE [dbo].[FlightsAll]
(
    [year] int NULL,
    [month] int NULL,
    [dayofmonth] int NULL,
    [dayofweek] int NULL,
    deptime VARCHAR(100) NULL,
    crsdeptime VARCHAR(100) NULL,
    arrtime VARCHAR(100) NULL,
    crsarrtime VARCHAR(100) NULL,
    uniquecarrier VARCHAR(100) NULL,
    flightnum VARCHAR(100) NULL,
    tailnum VARCHAR(100) NULL,
    actualelapsedtime VARCHAR(100) NULL,
    crselapsedtime VARCHAR(100) NULL,
    airtime VARCHAR(100) NULL,
    arrdelay VARCHAR(100) NULL,
    depdelay VARCHAR(100) NULL,
    origin VARCHAR(100) NULL,
    dest VARCHAR(100) NULL,
    distance VARCHAR(100) NULL,
    taxiin VARCHAR(100) NULL,
    taxiout VARCHAR(100) NULL,
    cancelled VARCHAR(100) NULL,
    cancellationcode VARCHAR(100) NULL,
    diverted VARCHAR(100) NULL,
    carrierdelay VARCHAR(100) NULL,
    weatherdelay VARCHAR(100) NULL,
    nasdelay VARCHAR(100) NULL,
    securitydelay VARCHAR(100) NULL,
    lateaircraftdelay VARCHAR(100) NULL
)
WITH
(
    LOCATION = N'historical/',
    DATA_SOURCE = WASBFlights,
    FILE_FORMAT = CsvFileFormat,
    -- Up to 5000 rows can have bad values before Polybase returns an error.
    REJECT_TYPE = Value,
    REJECT_VALUE = 5000
);
GO

We can run a check to ensure that records look okay here. In this scenario, instead of just pulling down 10 results, I want to make sure that we get data for each year.

SELECT
	fa.[year],
	COUNT(1) AS NumberOfRecords
FROM dbo.FlightsAll fa
GROUP BY
	fa.[year]
ORDER BY
	fa.[year];

Naturally, this query takes a bit longer to run, as I have to download the files, decompress them, process them, and aggregate their results locally.  During this process, the mpdwsvc service was my biggest resource user, followed by a stretch in which SQL Server ate up a healthy amount of CPU time.

pdwsvc

When the query finished, it gave me my expected results:

aggregateresults

This particular query took me just over 5 minutes to run, getting data for approximately 42 million rows.

Conclusion

At this point, we’ve set up a Polybase connection to Azure Blob Storage.  In the next Polybase series post, we’re going to look at execution plans, packet captures, and other tools to gain a better understanding of the process.  After that, I will put together another post on inserting data into Azure Blob Storage.

Advertisements

2 thoughts on “Connecting To Blob Storage With Polybase

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s