Getting Back to Basics
To date, I’ve always used the MongoDB API for Cosmos DB when demonstrating PolyBase. This is true in PolyBase Revealed, as well as blog posts here. Today, that changes: we are going to use PolyBase to get data from a Cosmos DB using the Core API, also known as the SQL API. This is particularly egregious because you can actually connect to the Core API in SQL Server 2019, not just 2022.
PolyBase comes with a few built-in drivers, including Oracle, Teradata, MongoDB, and SQL Server. For everything else in the 2019 “style” of things, there is a generic ODBC route. In this route, you need to obtain a valid ODBC driver, configure it, and let PolyBase know how to access data from that remote source.
Cosmos DB’s Core API just happens to have a working ODBC driver, so the first step is to grab the relevant version of that driver and install it on the machine running SQL Server.
Creating the External Objects
One of the interesting differences between the MongoDB API and the Core API is that the Core API does not use database scoped credentials—even if you specify the credential in the external data source definition, it won’t understand what you’re sending. Therefore, we won’t create a database scoped credential here and will jump right into it.
First Try
Here is the code to create an external data source pointing to Cosmos DB.
CREATE EXTERNAL DATA SOURCE CosmosCore WITH
(
LOCATION = 'odbc://cspolybasecore.documents.azure.com:443/',
CONNECTION_OPTIONS = 'Driver={Microsoft Azure DocumentDB ODBC Driver};Host={https://cspolybasecore.documents.azure.com:443};AuthenticationKey={<Your Auth Key>}'
);
By contrast, here’s what we used to create an external data source using the MongoDB API:
CREATE EXTERNAL DATA SOURCE CosmosDB WITH
(
LOCATION = 'mongodb://cspolybase.mongo.cosmos.azure.com:10255',
CONNECTION_OPTIONS = 'ssl=true',
CREDENTIAL = CosmosCredential,
PUSHDOWN = ON
);
The CONNECTION_OPTIONS
section for the Core API requires both the host (the HTTPS URL) and the authentication key. That’s…not a good thing. The reason why is that anybody with access to read sys.external_data_sources
gets your auth details:
As far as I’m concerned, this route’s a no-go. Fortunately, there’s a better way.
Second Try
This time around, we’re going to create a DSN and specify it instead of using the driver directly. To do so, open the ODBC Data Sources (64-bit) app in Windows. In the Drivers tab, you should see a driver called Microsoft Azure DocumentDB ODBC Driver. If you have that, you can create a new System DSN and fill in the relevant host and key details.
Once you’ve created the DSN, the script to create an external data source is a lot better in terms of security risks.
CREATE EXTERNAL DATA SOURCE CosmosCore WITH
(
LOCATION = 'odbc://MACCLOUD/',
CONNECTION_OPTIONS = 'DSN=CosmosCore'
);
One thing I should point out is that the LOCATION
attribute on the external data source is necessary for PolyBase, as this attribute tells the engine what kind of driver to look for. Therefore, for an ODBC driver, we need to start with odbc://
. After that, however, PolyBase doesn’t even look at the location for host info, so we can put whatever we want in there. The real details are in CONNECTION_OPTIONS
, in which we connect to the CosmosCore DSN. That said, LOCATION
is what shows up when you query sys.dm_exec_external_work
, so if you do use something silly, at least make it unique if you have multiple Cosmos DB accounts.
Once we have the external data source created, the last step is to create an external table. I have a database called PolyBaseTest in my Core API and a table inside there named Employee. Interestingly, although I need to specify PolyBaseTest
in the LOCATION
attribute of the external table for the MongoDB API, I do not specify it for the Core API.
CREATE EXTERNAL TABLE dbo.EmployeeCore
(
_rid NVARCHAR(100) NOT NULL,
FirstName NVARCHAR(50) NOT NULL,
LastName NVARCHAR(50) NOT NULL
)
WITH
(
LOCATION='Employee',
DATA_SOURCE = CosmosCore
);
Investigative minds may wonder what happens if you have two databases with the same Employee table name. Investigative minds should try that out, as I’m wrapping this post up pretty soon.
Aggregate Pushdowns
There was a warning message in Microsoft’s documentation for the Cosmos DB ODBC driver:
The current ODBC driver doesn’t support aggregate pushdowns, and has known issues with some analytics tools.
This led me to wonder if this issue would break any PolyBase queries. It appears the answer may be “no.” I tried out the following queries:
SELECT * FROM dbo.EmployeeCore;
SELECT COUNT(*) FROM dbo.EmployeeCore;
SELECT COUNT(*) FROM dbo.EmployeeCore where lastname = 'Aubrey';
SELECT COUNT(*) FROM dbo.EmployeeCore where lastname = 'Aubrey' option(force externalpushdown);
GO
All of them worked:
Granted, this is a simplistic example, so perhaps more complex scenarios will break the engine. In the worst case scenario, if you do experience pushdown problems, I’d recreate the external data source as:
CREATE EXTERNAL DATA SOURCE CosmosCore WITH
(
LOCATION = 'odbc://MACCLOUD/',
CONNECTION_OPTIONS = 'DSN=CosmosCore',
PUSHDOWN = OFF
);
That way, the engine wouldn’t even ask for pushdown on the Cosmos side. But that’s something I’d do as a last resort after getting weird failures. Given that the ODBC driver still has “DocumentDB” in the name and was last released in 2019, this is a, uh, very stable driver in the sense that it doesn’t get updated all that often.
Conclusion
In today’s post, we learned that we can connect to Cosmos DB from PolyBase even when the Comsos account is using the Core API. This is great, as the vast majority of Cosmos DB accounts in my experience are using the Core API.
One thought on “PolyBase 2022 and Cosmos DB (Core API)”