With the announcement back at Ignite that SQL Server 2022 will have support for Synapse Link, I figured it would be a good idea to put together a quick post showing how you can use Synapse Link today, using the already-available connector for Cosmos DB as an example. I don’t know if the behavior will be exactly the same with SQL Server 2022, but this should give us some idea.

If you’ve never heard of it before, Synapse Link is an easy way to feed data into Azure Synapse Analytics. At its best, it makes data available with little to no ETL/ELT code. In this post, we will see how we can combine Synapse Link for Cosmos DB with the sqlanalytics() function in Spark pools to take data from Cosmos DB and load it into a dedicated SQL pool.


Before we get started with Synapse Link, there are a few things you’ll need. First, you’ll need to enable the analytical store in Cosmos DB. This is because Synapse Link for Cosmos DB uses the analytical store for OLAP operations. The typical use case for Cosmos DB is OLTP, writing and reading one document at a time (with the capability to work with multiple documents). Considering that Azure Synapse Analytics is a warehousing solution, this doesn’t sound like a good mix. That’s where the analytical store comes into play: with it enabled, replication from OLTP to OLAP happens behind the scenes, meaning that we don’t need to write any code to do it. Furthermore, when we do use Synapse Link for Cosmos DB, we can access data in the analytical store without affecting end users. Synapse Link for Cosmos DB will give us data in a Spark DataFrame.

From there, we will need to use the Spark to Synapse SQL Connector to migrate data from a Spark DataFrame into a dedicated SQL pool. We can write to a new table in the dedicated SQL pool, so think of this as populating a staging table rather than the end product. That is, unless your final table comes entirely from this solution and you don’t need to do any cleanup, enrichment, or other processing beyond what you do in Spark.

In order to run this, there are a few important permissions:

  • In order to run a Spark notebook, we will need the Compute Operator or Synapse Contributor Synapse RBAC roles.
  • In order to write to our own dedicated SQL pool (that is, a dedicated SQL pool in the same workspace), Synapse will use the Azure AD credentials of the notebook runner. If you are writing to a pool outside of the current workspace, you can use SQL authentication. For the sake of simplicity, I’ll assume that you’re writing to a dedicated SQL pool in the same workspace as your Spark pool.
  • The dedicated SQL pool login you use must have the db_exporter role on the dedicated SQL pool.
  • The Azure AD user must have Storage Blob Data Contributor on the Azure Storage Account associated with Azure Synapse Analytics.

Now that we have requirements out of the way, let’s start doing.

I’ll warn you up-front that it’s going to be a bit easier to work with Azure Synapse Link for Cosmos DB on a new collection rather than an existing one. You can enable the analytical store for an existing container, but there’s no UI option for it as of the time of writing. So let’s start with a new container and work from there. First up, in the Settings menu, choose Features and then select Azure Synapse Link. This will allow you to enable Azure Synapse Link for Cosmos DB.

Enable Azure Synapse Link

Once you have this enabled, you’ll be able to create new containers with the analytical store turned on. Navigate to the Data Explorer and choose New Container. Fill out the relevant details, but be sure to enable the Analytical store, as that’s the whole point of the post.

Enable the analytical store for that sweet OLAP goodness.

From there, you’ll need to load some data into the Cosmos DB container. There’s a helpful Upload Item option if you happen to have a JSON file handy.

From Cosmos to Synpase

We now have an analytical store in Cosmos DB, but it’s not hooked up to a Synapse workspace. In order to do this, we need to do a bit of setup. I’m going to do this in hard mode, that is, using a Synapse workspace in a managed virtual network with data exfiltration protection enabled. The upshot of this is that we can only access that Cosmos DB source if we first create a managed private endpoint. As a quick note, if you did not select the option to create a managed virtual network when you created the Synapse workspace, you won’t be able to create a managed private endpoint, and you can skip ahead to the part where we created a linked service.

Create a Managed Private Endpoint

Supposing that you do have a managed virtual network for Synapse, navigate to the Manage blade and then select Managed private endpoints from the Security menu. This will allow you to create a new managed private endpoint.

All of my managed private endpoints.

Create a new managed private endpoint using the Azure Cosmos DB (SQL API) option—be sure not to select the MongoDB API or you lose all of your data you won’t be able to find the proper collection. You’ll want to select the appropriate Cosmos DB account and target the Analytical sub-resource. The Sql sub-resource will allow you to access the OLTP side, and Analytical the OLAP side.

Choose the right sub-resource!

Once you select Create, it actually requests a private endpoint connection with Cosmos DB. This will take a few minutes to complete. To finish the job, navigate back to Cosmos DB and then, in the Settings menu, choose Private Endpoint Connections. You’ll see the connection associated with your Synapse account. Select it and choose Approve. This step will take a few minutes to complete.

Time to approve the endpoints!

Now that you’ve waited a few minutes a few times, you should have a private endpoint.

Create a Linked Service

Welcome back, people who chose not to use a managed virtual network. We’re now ready to create a linked service. Return to the Synapse workspace if you aren’t already there. Enter the Manage blade and select Linked services from the External connections menu. Choose to add a new Cosmos DB (SQL API) linked service and fill in the details. Note that I have a managed private endpoint, so it automatically gets used—there’s nothing you need to do to enable this.

Create a new linked service.

Once you have that in place, it’s time to get notebooking!

Data Migration, Notebook-Style

Next up, we need to create a new Synapse notebook. I should note that you’ll want to create this in Scala (that is, a Spark notebook, not a Pyspark notebook), as some of the work requires Scala. Select the Develop blade and then choose to create a new notebook from the + menu. Attach this to a Spark cluster and change the language (not pictured due to my zoom settings). You can then use the following code to interact with Cosmos DB:

val df = spark.read.format("cosmos.olap").
    option("spark.synapse.linkedservice", "CosmosDb1").
    option("spark.cosmos.container", "GameData2020").

Fill in the appropriate values as needed, but this will read in data from Cosmos OLAP, using the CosmosDb1 linked service and reading from the GameData2020 container. The end result will be transformed into a Spark DataFrame.

If we want to feed the data into a dedicated SQL pool, that’s pretty easy as well. First up, we need to import a couple of things to make life easier.

import com.microsoft.spark.sqlanalytics.utils.Constants
import org.apache.spark.sql.SqlAnalyticsConnector._

With these imported, we can actually write the data to our dedicated SQL pool:

df.write.synapsesql("briwh.dbo.GameData2020", Constants.INTERNAL)

The Constants enumeration has two relevant values: INTERNAL and EXTERNAL. An internal table is one which lives in the dedicated SQL pool. An external table uses PolyBase to access the Spark DataFrame (but not the Cosmos DB collection!) remotely. I chose to make this an internal table, so there is a one-time write as I execute this statement and the end result is that I now have a new table in the dedicated SQL pool:

GameData2020 sounds like a cheap action game from the mid-90s.

As a note, because I created an internal table, this is a one-time load. I could always schedule a process to drop the dedicated SQL pool table (if it exists), load data from Cosmos DB, move the data into a staging table on the dedicated SQL pool, and merge the data together. But that’s a little outside the scope of today’s post.


In this post, we saw how to enable Cosmos DB’s Analytical store, access data using Synapse Link for Cosmos DB, and use the Spark to Synapse SQL Connector to move that data into a dedicated SQL pool. We saw how to do this in a workspace using a managed virtual network with data exfiltration protection enabled, meaning this is the largest number of steps necessary.


2 thoughts on “From Cosmos DB to the Synapse Dedicated SQL Pool

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s