PolyBase Revealed: SQL Server to SQL Server

Today I’m going to cover something I’ve been waiting to do for just about three years: use PolyBase to connect one SQL Server instance to another SQL Server instance. In this post, we’ll cover the basic setup and a couple points of interest I’ve found in writing this chapter of the book.

Setup: One Less Amigo

Historically, PolyBase has three separate external entities: external data sources, external file formats, and external tables. External data sources tell SQL Server where the remote data is stored. External file formats tell SQL Server what the shape of that data looks like—in other words, CSV, tab-separated, Parquet, ORC, etc. External tables tell SQL Server the structure of some data of a particular external file format at a particular external data source.

With PolyBase V2—connectivity with SQL Server, Cosmos DB, Oracle, Spark, Hive, and a boatload of other external data sources—we no longer need external file formats because we ingest structured data. Therefore, we only need an external data source and an external table. You will need SQL Server 2019 to play along and I’d recommend keeping up on CTPs—PolyBase is under active development so being a CTP behind may mean hitting bugs which have subsequently been fixed.

External Data Source

As I create an external data source, I’m also going to use a database-scoped credential. In this case, I’m going to use SQL authentication though you can also use Windows authentication.

IF NOT EXISTS
(
	SELECT 1
	FROM sys.database_scoped_credentials dsc
	WHERE
		dsc.name = N'SqlWin10Credentials'
)
BEGIN
	CREATE DATABASE SCOPED CREDENTIAL SqlWin10Credentials
	WITH IDENTITY = 'PolyBaseUser', Secret = '<<Some Password>>';
END
GO
IF NOT EXISTS
(
	SELECT 1
	FROM sys.external_data_sources e
	WHERE
		e.name = N'SQLWIN10'
)
BEGIN
	CREATE EXTERNAL DATA SOURCE SQLWIN10 WITH
	(
		LOCATION = 'sqlserver://SQLWIN10',
		PUSHDOWN = ON,
		CREDENTIAL = SqlWin10Credentials
	);
END
GO

There are a couple of things worth noting here. First, the LOCATION parameter for our external data source is a URI which uses the sqlserver protocol. PolyBase V1 used a configuration setting to determine what the external data source was, but for the new connectors, we define it in the location itself. SQLWIN10 is the name of my SQL Server instance, and I’m going to use SqlWin10Credentials to access that remote box.

Second, we have the option to disable predicate pushdown if we so desire. I’d recommend keeping it on in general and don’t know of a scenario where I’d definitely turn it off.

External Table

Here is a simple external table definition:

CREATE EXTERNAL TABLE [dbo].[CityPopulationCenter]
(
	[CityName] [varchar](120) NOT NULL,
	[PopulationCenterName] [varchar](30) NOT NULL
)
WITH
(
	LOCATION = 'Scratch.dbo.CityPopulationCenter',
	DATA_SOURCE = SQLWIN10
);

There are a couple points of note here as well. In the LOCATION parameter, we need to use three-part naming: database, schema, table. Our external data source did not specify a database, so we therefore need to specify it here.

Second, the data types need to be compatible with the remote server’s data types. You have a table or view defined on the other side, so that’s pretty simple copy-pasta. Just note that we don’t need (and aren’t allowed to use) any types of constraints, including primary keys, foreign keys, unique keys, default constraints, check constraints, and the like.

The good news is that once you’ve defined this external table, you can directly query it as dbo.CityPopulationCenter and end users won’t know or care that it’s an external table. Note that you do need to open your database master key in sessions which intend to query external tables, as that is how SQL Server can decrypt your password to send over to the remote SQL Server instance.

Quick Notes

I’m not going to hit too much more because I’m saving it for the book, but here are a couple interesting things I’ve found while working with PolyBase to SQL Server:

  • Performance is generally around the same as when working with linked servers. Sometimes it’s better, sometimes it’s worse.
  • When using scale-out groups, performance tends to be a little better, even though the other side is just a single SQL Server instance.
  • PolyBase can create an external table to a remote view.
  • PolyBase can not create an external table to another external table.

Conclusion

Even though it’s still in CTP and I think there are quite a few things Microsoft can do to make it better, I’m happy that direct SQL-to-SQL support is here.

Advertisements