Getting Back to Basics

To date, I’ve always used the MongoDB API for Cosmos DB when demonstrating PolyBase. This is true in PolyBase Revealed, as well as blog posts here. Today, that changes: we are going to use PolyBase to get data from a Cosmos DB using the Core API, also known as the SQL API. This is particularly egregious because you can actually connect to the Core API in SQL Server 2019, not just 2022.

PolyBase comes with a few built-in drivers, including Oracle, Teradata, MongoDB, and SQL Server. For everything else in the 2019 “style” of things, there is a generic ODBC route. In this route, you need to obtain a valid ODBC driver, configure it, and let PolyBase know how to access data from that remote source.

Cosmos DB’s Core API just happens to have a working ODBC driver, so the first step is to grab the relevant version of that driver and install it on the machine running SQL Server.

Creating the External Objects

One of the interesting differences between the MongoDB API and the Core API is that the Core API does not use database scoped credentials—even if you specify the credential in the external data source definition, it won’t understand what you’re sending. Therefore, we won’t create a database scoped credential here and will jump right into it.

First Try

Here is the code to create an external data source pointing to Cosmos DB.

CREATE EXTERNAL DATA SOURCE CosmosCore WITH
(
    LOCATION = 'odbc://cspolybasecore.documents.azure.com:443/',
    CONNECTION_OPTIONS = 'Driver={Microsoft Azure DocumentDB ODBC Driver};Host={https://cspolybasecore.documents.azure.com:443};AuthenticationKey={<Your Auth Key>}'
);

By contrast, here’s what we used to create an external data source using the MongoDB API:

CREATE EXTERNAL DATA SOURCE CosmosDB WITH
(
    LOCATION = 'mongodb://cspolybase.mongo.cosmos.azure.com:10255',
    CONNECTION_OPTIONS = 'ssl=true',
    CREDENTIAL = CosmosCredential,
    PUSHDOWN = ON
);

The CONNECTION_OPTIONS section for the Core API requires both the host (the HTTPS URL) and the authentication key. That’s…not a good thing. The reason why is that anybody with access to read sys.external_data_sources gets your auth details:

And let’s face it: Cosmos DB is way too expensive to let people just run around with your auth keys.

As far as I’m concerned, this route’s a no-go. Fortunately, there’s a better way.

Second Try

This time around, we’re going to create a DSN and specify it instead of using the driver directly. To do so, open the ODBC Data Sources (64-bit) app in Windows. In the Drivers tab, you should see a driver called Microsoft Azure DocumentDB ODBC Driver. If you have that, you can create a new System DSN and fill in the relevant host and key details.

Creating a DSN to hide those access key details.

Once you’ve created the DSN, the script to create an external data source is a lot better in terms of security risks.

CREATE EXTERNAL DATA SOURCE CosmosCore WITH
(
    LOCATION = 'odbc://MACCLOUD/',
    CONNECTION_OPTIONS = 'DSN=CosmosCore'
);

One thing I should point out is that the LOCATION attribute on the external data source is necessary for PolyBase, as this attribute tells the engine what kind of driver to look for. Therefore, for an ODBC driver, we need to start with odbc://. After that, however, PolyBase doesn’t even look at the location for host info, so we can put whatever we want in there. The real details are in CONNECTION_OPTIONS, in which we connect to the CosmosCore DSN. That said, LOCATION is what shows up when you query sys.dm_exec_external_work, so if you do use something silly, at least make it unique if you have multiple Cosmos DB accounts.

Once we have the external data source created, the last step is to create an external table. I have a database called PolyBaseTest in my Core API and a table inside there named Employee. Interestingly, although I need to specify PolyBaseTest in the LOCATION attribute of the external table for the MongoDB API, I do not specify it for the Core API.

CREATE EXTERNAL TABLE dbo.EmployeeCore
(
    _rid NVARCHAR(100) NOT NULL,
	FirstName NVARCHAR(50) NOT NULL,
	LastName NVARCHAR(50) NOT NULL
)
WITH
(
    LOCATION='Employee',
    DATA_SOURCE = CosmosCore
);

Investigative minds may wonder what happens if you have two databases with the same Employee table name. Investigative minds should try that out, as I’m wrapping this post up pretty soon.

Aggregate Pushdowns

There was a warning message in Microsoft’s documentation for the Cosmos DB ODBC driver:

The current ODBC driver doesn’t support aggregate pushdowns, and has known issues with some analytics tools.

This led me to wonder if this issue would break any PolyBase queries. It appears the answer may be “no.” I tried out the following queries:

SELECT * FROM dbo.EmployeeCore;
SELECT COUNT(*) FROM dbo.EmployeeCore;
SELECT COUNT(*) FROM dbo.EmployeeCore where lastname = 'Aubrey';
SELECT COUNT(*) FROM dbo.EmployeeCore where lastname = 'Aubrey' option(force externalpushdown);
GO

All of them worked:

No errors on pushdown.

Granted, this is a simplistic example, so perhaps more complex scenarios will break the engine. In the worst case scenario, if you do experience pushdown problems, I’d recreate the external data source as:

CREATE EXTERNAL DATA SOURCE CosmosCore WITH
(
    LOCATION = 'odbc://MACCLOUD/',
    CONNECTION_OPTIONS = 'DSN=CosmosCore',
    PUSHDOWN = OFF
);

That way, the engine wouldn’t even ask for pushdown on the Cosmos side. But that’s something I’d do as a last resort after getting weird failures. Given that the ODBC driver still has “DocumentDB” in the name and was last released in 2019, this is a, uh, very stable driver in the sense that it doesn’t get updated all that often.

Conclusion

In today’s post, we learned that we can connect to Cosmos DB from PolyBase even when the Comsos account is using the Core API. This is great, as the vast majority of Cosmos DB accounts in my experience are using the Core API.

One thought on “PolyBase 2022 and Cosmos DB (Core API)

Leave a comment