A long, long time ago (I can still remember)

PolyBase used to connect to Azure Blob Storage. Specifically, you could use the wasbs:// protocol and connect to Azure Blob Storage over WebHDFS. Here’s an example of an external data source which would work for SQL Server 2016 through 2019:

USE MyDB
GO
CREATE DATABASE SCOPED CREDENTIAL AzureStorageCredential  
WITH IDENTITY = 'cspolybase',
SECRET = '<access key>';
GO
CREATE EXTERNAL DATA SOURCE WASBFlights WITH
(
    TYPE = HADOOP,  
    LOCATION = 'wasbs://csflights@cspolybase.blob.core.windows.net',
    CREDENTIAL = AzureStorageCredential
);

There are two steps here to the creation process. First, we create a database scoped credential using the Azure Blob Storage account name as the identity and either the primary or secondary key in Azure Blob Storage’s Access keys menu as the secret.

Then, we create an external data source whose type is “Hadoop” and whose location is a wasbs:// path.

Bad News on the Doorstep (I Couldn’t Take One More Step)

If you try the above command in SQL Server 2022, you’ll get an error. Specifically:

External data sources are not supported with type HADOOP.

So we can see here that TYPE = HADOOP is right out. That’s because SQL Server 2022 eliminated the Java connector which drove Azure Blob Storage functionality. If you remove that and try again, you’ll get another error:

wasbs://[…] contains an unsupported prefix. Retry with supported connector prefix.

That’s because data virtualization in SQL Server 2022 doesn’t use the wasbs:// prefix anymore—it uses abs:// for Azure Blob Storage and adls:// for Azure Data Lake Storage Gen2. Here’s an example using the new format.

CREATE DATABASE SCOPED CREDENTIAL AzureStorageCredential2
WITH IDENTITY = 'cspolybaseblob',
SECRET = '<access key>';
GO
CREATE EXTERNAL DATA SOURCE AzureNCPopBlob2 WITH
(
	LOCATION = 'abs://ncpop@cspolybaseblob.blob.core.windows.net',
	CREDENTIAL = AzureStorageCredential
);
GO

From there, creating an external table is the same as it always was.

CREATE EXTERNAL TABLE dbo.NorthCarolinaPopulation2
(
	SumLev INT NOT NULL,
	County INT NOT NULL,
	Place INT NOT NULL,
	IsPrimaryGeography BIT NOT NULL,
	[Name] VARCHAR(120) NOT NULL,
	PopulationType VARCHAR(20) NOT NULL,
	Year INT NOT NULL,
	Population INT NOT NULL
)
WITH
(
	DATA_SOURCE = AzureNCPopBlob2,
	LOCATION = N'Census/NorthCarolinaPopulation.csv',
	FILE_FORMAT = CsvFileFormat,
	REJECT_TYPE = VALUE,
	REJECT_VALUE = 5
);
GO

I Was a Lonely Teenage Broncin’ Buck (With a Pink Carnation and a Pickup Truck)

One nice thing about file-based data virtualization in SQL Server 2022 is that it will tell you if you get rejected. For example, this file has four rows with bad data in them. In SQL Server 2016-2019, we’d either get back the 13,607 rows or an error indicating that processing failed, depending on what our rejection threshold was. Now, we can see the error messages and lines in the file:

Bulk load data conversion error (type mismatch or invalid character for the specified codepage).

With this information, you can look at the file itself and see what’s wrong. Here, we have a header in the first row and three row with letters for population instead of numbers. Let’s fix the header problem with a minor change to the external file format.

CREATE EXTERNAL FILE FORMAT CsvFileFormatWithHeader WITH
(
    FORMAT_TYPE = DELIMITEDTEXT,
    FORMAT_OPTIONS
    (
        FIELD_TERMINATOR = N',',
		FIRST_ROW = 2,
        USE_TYPE_DEFAULT = True,
        STRING_DELIMITER = '"',
        ENCODING = 'UTF8'
    )
);

The FIRST_ROW = 2 parameter is what’ll do it for us, as it means we ignore the first row. Note that this is available in Azure Synapse Analytics but was not available in SQL Server 2016-2019. Now we still get back 13,607 rows but have three errors with which to content.

The header row is no longer a concern.

If we wanted to ensure that we pick up the last three rows, we could switch the Population column to a string and perform TRY_CAST() to change it to an integer, though I’d rather just have the three rows fail.

My guess is that this new behavior will cause applications to fail because it’s catching actual errors. That might be tricky.

Oh, and while the King was Looking Down (The Jester Stole His Thorny Crown)

One more thing I want to cover here is that Azure Blob Storage (and Data Lake Storage) connections support Shared Access Signature (SAS) tokens. This is very nice, as it allows you to control what kind of access to grant the SQL Server user and you can narrow things down to specific directories and files. You’ve been able to use SAS tokens for Azure Synapse Analytics data virtualization but now you can do it on-prem with SQL Server 2022.

CREATE DATABASE SCOPED CREDENTIAL AzureStorageCredential
WITH IDENTITY = 'SHARED ACCESS SIGNATURE',
SECRET = '<YOUR SECRET>';

Two thumbs up for that.

One thought on “PolyBase and Azure Blob Storage

Leave a comment