Today I’m going to cover something I’ve been waiting to do for just about three years: use PolyBase to connect one SQL Server instance to another SQL Server instance. In this post, we’ll cover the basic setup and a couple points of interest I’ve found in writing this chapter of the book.
Setup: One Less Amigo
Historically, PolyBase has three separate external entities: external data sources, external file formats, and external tables. External data sources tell SQL Server where the remote data is stored. External file formats tell SQL Server what the shape of that data looks like—in other words, CSV, tab-separated, Parquet, ORC, etc. External tables tell SQL Server the structure of some data of a particular external file format at a particular external data source.
With PolyBase V2—connectivity with SQL Server, Cosmos DB, Oracle, Spark, Hive, and a boatload of other external data sources—we no longer need external file formats because we ingest structured data. Therefore, we only need an external data source and an external table. You will need SQL Server 2019 to play along and I’d recommend keeping up on CTPs—PolyBase is under active development so being a CTP behind may mean hitting bugs which have subsequently been fixed.
External Data Source
As I create an external data source, I’m also going to use a database-scoped credential. In this case, I’m going to use SQL authentication though you can also use Windows authentication.
IF NOT EXISTS ( SELECT 1 FROM sys.database_scoped_credentials dsc WHERE dsc.name = N'SqlWin10Credentials' ) BEGIN CREATE DATABASE SCOPED CREDENTIAL SqlWin10Credentials WITH IDENTITY = 'PolyBaseUser', Secret = '<<Some Password>>'; END GO IF NOT EXISTS ( SELECT 1 FROM sys.external_data_sources e WHERE e.name = N'SQLWIN10' ) BEGIN CREATE EXTERNAL DATA SOURCE SQLWIN10 WITH ( LOCATION = 'sqlserver://SQLWIN10', PUSHDOWN = ON, CREDENTIAL = SqlWin10Credentials ); END GO
There are a couple of things worth noting here. First, the
LOCATION parameter for our external data source is a URI which uses the
sqlserver protocol. PolyBase V1 used a configuration setting to determine what the external data source was, but for the new connectors, we define it in the location itself.
SQLWIN10 is the name of my SQL Server instance, and I’m going to use
SqlWin10Credentials to access that remote box.
Second, we have the option to disable predicate pushdown if we so desire. I’d recommend keeping it on in general and don’t know of a scenario where I’d definitely turn it off.
Here is a simple external table definition:
CREATE EXTERNAL TABLE [dbo].[CityPopulationCenter] ( [CityName] [varchar](120) NOT NULL, [PopulationCenterName] [varchar](30) NOT NULL ) WITH ( LOCATION = 'Scratch.dbo.CityPopulationCenter', DATA_SOURCE = SQLWIN10 );
There are a couple points of note here as well. In the
LOCATION parameter, we need to use three-part naming: database, schema, table. Our external data source did not specify a database, so we therefore need to specify it here.
Second, the data types need to be compatible with the remote server’s data types. You have a table or view defined on the other side, so that’s pretty simple copy-pasta. Just note that we don’t need (and aren’t allowed to use) any types of constraints, including primary keys, foreign keys, unique keys, default constraints, check constraints, and the like.
The good news is that once you’ve defined this external table, you can directly query it as
dbo.CityPopulationCenter and end users won’t know or care that it’s an external table. Note that you do need to open your database master key in sessions which intend to query external tables, as that is how SQL Server can decrypt your password to send over to the remote SQL Server instance.
I’m not going to hit too much more because I’m saving it for the book, but here are a couple interesting things I’ve found while working with PolyBase to SQL Server:
- Performance is generally around the same as when working with linked servers. Sometimes it’s better, sometimes it’s worse.
- When using scale-out groups, performance tends to be a little better, even though the other side is just a single SQL Server instance.
- PolyBase can create an external table to a remote view.
- PolyBase can not create an external table to another external table.
Even though it’s still in CTP and I think there are quite a few things Microsoft can do to make it better, I’m happy that direct SQL-to-SQL support is here.