Several years ago, I wrote a blog post on how to insert data into Azure Blob Storage from SQL Server using PolyBase. That technique used PolyBase V1: the Java connector for Hadoop. With SQL Server 2022 eliminating that connector, we’re going to learn the new method.
Starting out the Same
Regardless of the version of SQL Server you’re using (that is, 2016 or later), you’ll need to enable PolyBase and then enable insertion using PolyBase:
USE [master]
GO
EXEC sp_configure
@configname = 'polybase enabled',
@configvalue = 1;
GO
RECONFIGURE;
GO
--To insert data, first we need to make sure that
--PolyBase insertion is turned on.
EXEC sp_configure 'show advanced options', 1;
GO
RECONFIGURE
GO
EXEC sp_configure
@configname = 'allow polybase export',
@configvalue = 1;
GO
RECONFIGURE;
GO
-- Now restart the SQL Server database engine service, as well
-- as the two PolyBase services.
We’ll also need the same external objects: external data source (and maybe a database scoped credential), external file format, and external table. But there are several minor differences between the two versions, so let’s cover each in turn.
Pre-2022 Method
If we want to insert into Azure Blob Storage using SQL Server 2016-2019, we’ll need to create a proper external data source. Supposing you have the right database scoped credential, we’d first create an external data source:
CREATE EXTERNAL DATA SOURCE AzureFireIncidentsBlob WITH
(
TYPE = HADOOP,
LOCATION = 'wasbs://visits@cspolybaseblob.blob.core.windows.net',
CREDENTIAL = AzureStorageCredential
);
We’d then need to create an external file format. We’ll use Parquet for this example.
CREATE EXTERNAL FILE FORMAT ParquetFileFormat WITH
(
FORMAT_TYPE = PARQUET,
DATA_COMPRESSION = 'org.apache.hadoop.io.compress.SnappyCodec'
);
Finally, we need an external table. Here’s a simple external table with one column.
CREATE EXTERNAL TABLE [dbo].[SomeTable]
(
[SomeColumn] [nvarchar](50) NULL
)
WITH
(
DATA_SOURCE = AzureFireIncidentsBlob,
LOCATION = N'SomeTable/',
FILE_FORMAT = ParquetFileFormat,
REJECT_TYPE = VALUE,
REJECT_VALUE = 1
);
Life is good. We have an external table and we’re writing it to our Blob storage account into a folder called SomeTable. Every time we write some data, we’ll create 1 or more Parquet files in that folder and all of the data will be available to us.
Specifically, we insert data like so:
INSERT INTO dbo.SomeTable
(
SomeColumn
)
SELECT SomeLocalColumn
FROM dbo.MyLocalTable
WHERE
NeedsToBeExported = 1;
That’s right: a basic insert statement. Note that, although we can insert data, we cannot run update, delete, merge, or other DML operations like them. This is append-only.
Try It Again, 2022 Style
Now that we’ve gotten a refresher on how this all works prior to SQL Server 2022, let’s see how things differ. First up, we’re going to create an external data source.
CREATE EXTERNAL DATA SOURCE AzureFireIncidentsBlob WITH
(
LOCATION = 'abs://visits@cspolybaseblob.blob.core.windows.net',
CREDENTIAL = AzureStorageCredential
);
Right off the bat, the data source definition is different. TYPE = HADOOP
was needed for the V1 connector but became irrelevant with V2 as well as the data virtualization technology in SQL Server 2022. Instead, V2 moved us to the idea of protocols determining the right driver to use. Here, abs://
means that we connect to Azure Blob Storage. For Azure Data Lake Storage Gen2, it would be adls://
and for Amazon S3, it’s s3://
.
The external file format is exactly the same as before, so no changes there. Similarly, our external table creation is the same as before.
Our biggest difference is in how we insert data. Instead of INSERT
operations, we use something familiar to Azure Synapse Analytics users: CREATE EXTERNAL TABLE AS SELECT
, or CETAS. Here’s an example:
CREATE EXTERNAL TABLE [dbo].[SomeTable]
WITH
(
DATA_SOURCE = AzureFireIncidentsBlob,
LOCATION = N'SomeTable/',
FILE_FORMAT = ParquetFileFormat,
REJECT_TYPE = VALUE,
REJECT_VALUE = 1
) AS
SELECT
[SomeLocalColumn] AS SomeColumn
FROM [dbo].MyLocalTable
WHERE
NeedsToBeExported = 1;
We are creating a new external table, defining where we will write the data. In practice, this is very similar to the INSERT
technique, in that we’re still going to create a series of Parquet files for our data insertion operation. There are two big differences here, however. First, we don’t need to define the table structure—the SELECT
query defines the structure of how we write data to file, including data types, column names, and data lengths. The second difference is that this is a one-time trick: if we’re inserting data monthly, we can’t keep this external table around and insert into it like the 2016-2019 version. Instead, what we would do is (optionally) drop this external table, create a new external table writing to a new subfolder, and (optionally) get rid of that external table afterward. If we want to query the whole thing in one go, such as reading all historical data, we could create one external table which reads from N'/'
or whatever our top-level directory is for that data.
Conclusion
I’m pretty happy about getting CETAS in on-premises SQL Server. I think it’s a reasonable way of helping people understand that neither PolyBase nor data virtualization in SQL Server 2022 are intended for transactional processing—you don’t insert rows onesie-twosie; it’s intended for bulk insert operations on a periodic or one-time basis. I think creating an external table to write out to Azure Blob Storage is a good reminder of this.
Coda: Hail ORC
One last thing I want to mention here is that, in the process of updating my Data Virtualization with PolyBase talk to become Data Virtualization in SQL Server 2022, I learned that we cannot write to files using ORC format in SQL Server 2022 RC0—we can write to delimited files or Parquet. We can read from ORC files but cannot write to them. This is a shame, though admittedly, Parquet has become a much more popular format than ORC. It would be great to see ORC support for parity but I’m not sure how much actual value to customers that would bring, so we’ll see if it gets added later.
One thought on “SQL Server 2022 Data Virtualization: Insert into Azure Blob Storage”