PolyBase: When You Have No Local Tables

Today’s PolyBase post is all about what happens when you want to join data from multiple data sources together, but none of your tables are local SQL Server tables.

No Local Tables were Harmed in the Making of this Post

Let’s suppose we have two sets of data in two different sources. My first is North Carolina population living in Azure Blob Storage. Here is the script to create this table.

CREATE DATABASE SCOPED CREDENTIAL AzureStorageCredential
WITH IDENTITY = 'cspolybaseblob',
SECRET = '<It's a secret&gt;';
GO
CREATE EXTERNAL DATA SOURCE AzureNCPopBlob WITH
(
    TYPE = HADOOP,
    LOCATION = 'wasbs://ncpop@cspolybaseblob.blob.core.windows.net',
    CREDENTIAL = AzureStorageCredential
);
GO
CREATE EXTERNAL FILE FORMAT CsvFileFormat WITH
(
    FORMAT_TYPE = DELIMITEDTEXT,
    FORMAT_OPTIONS
    (
        FIELD_TERMINATOR = N',',
        USE_TYPE_DEFAULT = True,
        STRING_DELIMITER = '"',
        ENCODING = 'UTF8'
    )
);
GO
CREATE EXTERNAL TABLE dbo.NorthCarolinaPopulation
(
    SumLev INT NOT NULL,
    County INT NOT NULL,
    Place INT NOT NULL,
    IsPrimaryGeography BIT NOT NULL,
    [Name] VARCHAR(120) NOT NULL,
    PopulationType VARCHAR(20) NOT NULL,
    Year INT NOT NULL,
    Population INT NOT NULL
)
WITH
(
    LOCATION = N'Census/NorthCarolinaPopulation.csv',
    DATA_SOURCE = AzureNCPopBlob,
    FILE_FORMAT = CsvFileFormat,
    REJECT_TYPE = VALUE,
    REJECT_VALUE = 5
);
GO

With this combination of external data source, file format, and table, we can create our next table. This is a data set which lives in Spark and contains…well, the same data. What we join together isn’t that interesting, after all.

CREATE DATABASE SCOPED CREDENTIAL SparkCredential
WITH IDENTITY = '<Your Username&gt;', Secret = '<Your Password&gt;';
GO
CREATE EXTERNAL DATA SOURCE ClusterinoSpark WITH
(
    LOCATION = 'odbc://clusterino:10016',
    CONNECTION_OPTIONS = 'Driver={Hortonworks Spark ODBC Driver}; Host = clusterino; Port = 10016; Database = default; ServerNode = clusterino:10016',
    CREDENTIAL = SparkCredential,
    PUSHDOWN = ON
);
GO
CREATE EXTERNAL TABLE dbo.NorthCarolinaPopulationTypedSpark
(
    SummaryLevel INT,
    CountyID INT,
    PlaceID INT,
    IsPrimaryGeography BIT,
    Name NVARCHAR(255),
    PopulationType NVARCHAR(255),
    Year INT,
    Population INT
)
WITH
(
    LOCATION = 'NorthCarolinaPopulationTyped',
    DATA_SOURCE = ClusterinoSpark
);

Once we have these two tables, we can join them.

SELECT
	*
FROM dbo.NorthCarolinaPopulation ncp
	INNER JOIN dbo.NorthCarolinaPopulationTypedSpark ncps
		ON ncp.SumLev = ncps.SummaryLevel
		AND ncp.County = ncps.CountyID
		AND ncp.Place = ncps.PlaceID
		AND ncp.PopulationType = ncps.PopulationType
		AND ncp.Year = ncps.Year;

You can extend this out to as many tables as you’d like. For each table, the PolyBase engine will determine whether to pull the data into a temp table locally or stream the data back, as well as whether to push down the operation to the remote source (if possible) or pull all of the data back. This is true for all PolyBase data sources, not just the two I showed here.

Advertisements

One thought on “PolyBase: When You Have No Local Tables

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s