Upcoming Events: SQL Saturday Boston

Key Details

What: SQL Saturday Boston
Where: Microsoft, 5 Wayside Road, Burlington, Massachusetts, 01803, United States
When: Saturday, September 14th.
Admission is free. Register on the SQL Saturday website.

What I’m Presenting

02:45 PM — 03:45 PM — Approaching Zero: Minimizing Downtime During Deployments

That 2:45 PM time slot is a murderer’s row; there are seven strong talks in there.

PolyBase: When You Have No Local Tables

Today’s PolyBase post is all about what happens when you want to join data from multiple data sources together, but none of your tables are local SQL Server tables.

No Local Tables were Harmed in the Making of this Post

Let’s suppose we have two sets of data in two different sources. My first is North Carolina population living in Azure Blob Storage. Here is the script to create this table.

CREATE DATABASE SCOPED CREDENTIAL AzureStorageCredential
WITH IDENTITY = 'cspolybaseblob',
SECRET = '<It's a secret>';
GO
CREATE EXTERNAL DATA SOURCE AzureNCPopBlob WITH
(
    TYPE = HADOOP,
    LOCATION = 'wasbs://ncpop@cspolybaseblob.blob.core.windows.net',
    CREDENTIAL = AzureStorageCredential
);
GO
CREATE EXTERNAL FILE FORMAT CsvFileFormat WITH
(
    FORMAT_TYPE = DELIMITEDTEXT,
    FORMAT_OPTIONS
    (
        FIELD_TERMINATOR = N',',
        USE_TYPE_DEFAULT = True,
        STRING_DELIMITER = '"',
        ENCODING = 'UTF8'
    )
);
GO
CREATE EXTERNAL TABLE dbo.NorthCarolinaPopulation
(
    SumLev INT NOT NULL,
    County INT NOT NULL,
    Place INT NOT NULL,
    IsPrimaryGeography BIT NOT NULL,
    [Name] VARCHAR(120) NOT NULL,
    PopulationType VARCHAR(20) NOT NULL,
    Year INT NOT NULL,
    Population INT NOT NULL
)
WITH
(
    LOCATION = N'Census/NorthCarolinaPopulation.csv',
    DATA_SOURCE = AzureNCPopBlob,
    FILE_FORMAT = CsvFileFormat,
    REJECT_TYPE = VALUE,
    REJECT_VALUE = 5
);
GO

With this combination of external data source, file format, and table, we can create our next table. This is a data set which lives in Spark and contains…well, the same data. What we join together isn’t that interesting, after all.

CREATE DATABASE SCOPED CREDENTIAL SparkCredential
WITH IDENTITY = '<Your Username>', Secret = '<Your Password>';
GO
CREATE EXTERNAL DATA SOURCE ClusterinoSpark WITH
(
    LOCATION = 'odbc://clusterino:10016',
    CONNECTION_OPTIONS = 'Driver={Hortonworks Spark ODBC Driver}; Host = clusterino; Port = 10016; Database = default; ServerNode = clusterino:10016',
    CREDENTIAL = SparkCredential,
    PUSHDOWN = ON
);
GO
CREATE EXTERNAL TABLE dbo.NorthCarolinaPopulationTypedSpark
(
    SummaryLevel INT,
    CountyID INT,
    PlaceID INT,
    IsPrimaryGeography BIT,
    Name NVARCHAR(255),
    PopulationType NVARCHAR(255),
    Year INT,
    Population INT
)
WITH
(
    LOCATION = 'NorthCarolinaPopulationTyped',
    DATA_SOURCE = ClusterinoSpark
);

Once we have these two tables, we can join them.

SELECT
	*
FROM dbo.NorthCarolinaPopulation ncp
	INNER JOIN dbo.NorthCarolinaPopulationTypedSpark ncps
		ON ncp.SumLev = ncps.SummaryLevel
		AND ncp.County = ncps.CountyID
		AND ncp.Place = ncps.PlaceID
		AND ncp.PopulationType = ncps.PopulationType
		AND ncp.Year = ncps.Year;

You can extend this out to as many tables as you’d like. For each table, the PolyBase engine will determine whether to pull the data into a temp table locally or stream the data back, as well as whether to push down the operation to the remote source (if possible) or pull all of the data back. This is true for all PolyBase data sources, not just the two I showed here.