A long, long time ago (I can still remember)
PolyBase used to connect to Azure Blob Storage. Specifically, you could use the
wasbs:// protocol and connect to Azure Blob Storage over WebHDFS. Here’s an example of an external data source which would work for SQL Server 2016 through 2019:
USE MyDB GO CREATE DATABASE SCOPED CREDENTIAL AzureStorageCredential WITH IDENTITY = 'cspolybase', SECRET = '<access key>'; GO CREATE EXTERNAL DATA SOURCE WASBFlights WITH ( TYPE = HADOOP, LOCATION = 'wasbs://firstname.lastname@example.org', CREDENTIAL = AzureStorageCredential );
There are two steps here to the creation process. First, we create a database scoped credential using the Azure Blob Storage account name as the identity and either the primary or secondary key in Azure Blob Storage’s Access keys menu as the secret.
Then, we create an external data source whose type is “Hadoop” and whose location is a
Bad News on the Doorstep (I Couldn’t Take One More Step)
If you try the above command in SQL Server 2022, you’ll get an error. Specifically:
So we can see here that
TYPE = HADOOP is right out. That’s because SQL Server 2022 eliminated the Java connector which drove Azure Blob Storage functionality. If you remove that and try again, you’ll get another error:
That’s because data virtualization in SQL Server 2022 doesn’t use the
wasbs:// prefix anymore—it uses
abs:// for Azure Blob Storage and
adls:// for Azure Data Lake Storage Gen2. Here’s an example using the new format.
CREATE DATABASE SCOPED CREDENTIAL AzureStorageCredential2 WITH IDENTITY = 'cspolybaseblob', SECRET = '<access key>'; GO CREATE EXTERNAL DATA SOURCE AzureNCPopBlob2 WITH ( LOCATION = 'abs://email@example.com', CREDENTIAL = AzureStorageCredential ); GO
From there, creating an external table is the same as it always was.
CREATE EXTERNAL TABLE dbo.NorthCarolinaPopulation2 ( SumLev INT NOT NULL, County INT NOT NULL, Place INT NOT NULL, IsPrimaryGeography BIT NOT NULL, [Name] VARCHAR(120) NOT NULL, PopulationType VARCHAR(20) NOT NULL, Year INT NOT NULL, Population INT NOT NULL ) WITH ( DATA_SOURCE = AzureNCPopBlob2, LOCATION = N'Census/NorthCarolinaPopulation.csv', FILE_FORMAT = CsvFileFormat, REJECT_TYPE = VALUE, REJECT_VALUE = 5 ); GO
I Was a Lonely Teenage Broncin’ Buck (With a Pink Carnation and a Pickup Truck)
One nice thing about file-based data virtualization in SQL Server 2022 is that it will tell you if you get rejected. For example, this file has four rows with bad data in them. In SQL Server 2016-2019, we’d either get back the 13,607 rows or an error indicating that processing failed, depending on what our rejection threshold was. Now, we can see the error messages and lines in the file:
With this information, you can look at the file itself and see what’s wrong. Here, we have a header in the first row and three row with letters for population instead of numbers. Let’s fix the header problem with a minor change to the external file format.
CREATE EXTERNAL FILE FORMAT CsvFileFormatWithHeader WITH ( FORMAT_TYPE = DELIMITEDTEXT, FORMAT_OPTIONS ( FIELD_TERMINATOR = N',', FIRST_ROW = 2, USE_TYPE_DEFAULT = True, STRING_DELIMITER = '"', ENCODING = 'UTF8' ) );
FIRST_ROW = 2 parameter is what’ll do it for us, as it means we ignore the first row. Note that this is available in Azure Synapse Analytics but was not available in SQL Server 2016-2019. Now we still get back 13,607 rows but have three errors with which to content.
If we wanted to ensure that we pick up the last three rows, we could switch the
Population column to a string and perform
TRY_CAST() to change it to an integer, though I’d rather just have the three rows fail.
My guess is that this new behavior will cause applications to fail because it’s catching actual errors. That might be tricky.
Oh, and while the King was Looking Down (The Jester Stole His Thorny Crown)
One more thing I want to cover here is that Azure Blob Storage (and Data Lake Storage) connections support Shared Access Signature (SAS) tokens. This is very nice, as it allows you to control what kind of access to grant the SQL Server user and you can narrow things down to specific directories and files. You’ve been able to use SAS tokens for Azure Synapse Analytics data virtualization but now you can do it on-prem with SQL Server 2022.
CREATE DATABASE SCOPED CREDENTIAL AzureStorageCredential WITH IDENTITY = 'SHARED ACCESS SIGNATURE', SECRET = '<YOUR SECRET>';
Two thumbs up for that.
One thought on “PolyBase and Azure Blob Storage”