PolyBase and External Column Names

The SQL Server 2019 CTP 3.2 release notes includes a couple lines of text which are easy to miss:

External table column names are now used for querying SQL Server, Oracle, Teradata, MongoDB, and ODBC data sources. In previous CTP releases, the columns were bound only based on ordinal on the destination and column names in external table definition was not used.

Let’s explain what this means.

So You Have a Table

I’m going to use external two tables in this experiment. In the left corner, we have some ORC files stored in Azure Blob Storage which we’ll represent as FireIncidents2017. In the right corner, we have data stored in a remote SQL Server instance which we’ll call LineItem. The data doesn’t really matter that much, but to give you an idea of where we’re going, I’ll show each table. First, FireIncidents2017:

USE [Scratch]
GO
CREATE EXTERNAL TABLE [dbo].[FireIncidents2017]
(
	[X] [float] NULL,
	[Y] [float] NULL,
	[OBJECTID] [int] NULL,
	[incident_number] [nvarchar](500) NULL,
	[incident_type] [smallint] NULL,
	[incident_type_description] [nvarchar](500) NULL,
	[arrive_date_time] [datetime2](7) NULL,
	[cleared_date_time] [datetime2](7) NULL,
	[dispatch_date_time] [datetime2](7) NULL,
	[exposure] [tinyint] NULL,
	[platoon] [nvarchar](50) NULL,
	[station] [tinyint] NULL,
	[address] [nvarchar](500) NULL,
	[address2] [nvarchar](500) NULL,
	[apt_room] [nvarchar](500) NULL,
	[GlobalID] [nvarchar](50) NULL,
	[CreationDate] [datetime2](7) NULL,
	[Creator] [nvarchar](50) NULL,
	[EditDate] [datetime2](7) NULL,
	[Editor] [nvarchar](50) NULL
)
WITH (DATA_SOURCE = [AzureFireIncidentsBlob],LOCATION = N'FireIncidents2017/',FILE_FORMAT = [OrcFileFormat],REJECT_TYPE = VALUE,REJECT_VALUE = 1)
GO

And now, LineItem:

USE [Scratch]
GO
CREATE EXTERNAL TABLE [ELT].[LineItem]
(
	[LineItemID] [int] NOT NULL,
	[BusID] [int] NOT NULL,
	[VendorID] [int] NOT NULL,
	[ExpenseCategoryID] [tinyint] NOT NULL,
	[EmployeeID] [int] NOT NULL,
	[LineItemDate] [date] NOT NULL,
	[Amount] [decimal](13, 2) NOT NULL
)
WITH (DATA_SOURCE = [SQLCONTROL],LOCATION = N'ForensicAccounting.dbo.LineItem')
GO

These two external tables follow their sources exactly: the raw data in the FireIncidents2017 folder have exactly the same structure as our dbo.FireIncidents2017 external table, down to column ordering and data types. And for LineItem, here’s what it looks like on the remote server named SQLCONTROL:

USE [ForensicAccounting]
GO
CREATE TABLE [dbo].[LineItem]
(
	[LineItemID] [int] IDENTITY(1,1) NOT NULL,
	[BusID] [int] NOT NULL,
	[VendorID] [int] NOT NULL,
	[ExpenseCategoryID] [tinyint] NOT NULL,
	[EmployeeID] [int] NOT NULL,
	[LineItemDate] [date] NOT NULL,
	[Amount] [decimal](13, 2) NOT NULL,
CONSTRAINT [PK_LineItem] PRIMARY KEY CLUSTERED 
(
	[LineItemID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, OPTIMIZE_FOR_SEQUENTIAL_KEY = OFF) ON [PRIMARY]
) ON [PRIMARY]
GO

Again, you can see data types and columns are exactly the same. The only difference is that we can’t have identity integers on external table definitions.

Shuffling Things Up

Let’s say that we just don’t like this order because it reminds us of an old ex-girlfriend or something. Instead, we want to move EmployeeID and LineItemDate above BusID because that’s clearly the best thing to do.

With SQL Server 2019 CTP 3.2 and later, you can do just that:

CREATE EXTERNAL TABLE [ELT].[LineItemTest]
(
	[LineItemID] [int] NOT NULL,
	[EmployeeID] [int] NOT NULL,
	[LineItemDate] [date] NOT NULL,
	[BusID] [int] NOT NULL,
	[VendorID] [int] NOT NULL,
	[ExpenseCategoryID] [tinyint] NOT NULL,
	[Amount] [decimal](13, 2) NOT NULL
)
WITH (DATA_SOURCE = [SQLCONTROL],LOCATION = N'ForensicAccounting.dbo.LineItem')
GO

Querying the table gives us rows back in our new order.

Finally, the column order for the civilized man systemically incapable of selecting anything other than *.

This is what the release notes mean by binding by column name rather than ordinal value.

But the notes are also specific about the data sources which support this. Let’s try with Azure Blob Storage, moving two columns up between the X and Y values:

CREATE EXTERNAL TABLE [dbo].[FireIncidents2017Test]
(
	[X] [float] NULL,
	[apt_room] [nvarchar](500) NULL,
	[GlobalID] [nvarchar](50) NULL,
	[Y] [float] NULL,
	[OBJECTID] [int] NULL,
	[incident_number] [nvarchar](500) NULL,
	[incident_type] [smallint] NULL,
	[incident_type_description] [nvarchar](500) NULL,
	[arrive_date_time] [datetime2](7) NULL,
	[cleared_date_time] [datetime2](7) NULL,
	[dispatch_date_time] [datetime2](7) NULL,
	[exposure] [tinyint] NULL,
	[platoon] [nvarchar](50) NULL,
	[station] [tinyint] NULL,
	[address] [nvarchar](500) NULL,
	[address2] [nvarchar](500) NULL,
	[CreationDate] [datetime2](7) NULL,
	[Creator] [nvarchar](50) NULL,
	[EditDate] [datetime2](7) NULL,
	[Editor] [nvarchar](50) NULL
)
WITH (DATA_SOURCE = [AzureFireIncidentsBlob],LOCATION = N'FireIncidents2017/',FILE_FORMAT = [OrcFileFormat],REJECT_TYPE = VALUE,REJECT_VALUE = 1)
GO

This create statement will succeed, but as soon as you try to query the table, you get a nastygram:

This error message is almost useful, so it’s got that going for it.

This tells us that we can’t switch column order around on ORC files in SQL Server 2019. That’s also true for Parquet and flat files. There needs to be metadata which includes column names for this to work.

Goofing About

Now that we know we can shuffle around column order, let’s try a couple other experiments. Put on your safety goggles and lab coats.

Removing Columns

Let’s say we only want to include some of the columns. A targeted layoff of some of our columns—EmployeeID and LineItemDate—will bring up our stock price a bit, so let’s give that a try.

DROP EXTERNAL TABLE [ELT].[LineItemTest];
CREATE EXTERNAL TABLE [ELT].[LineItemTest]
(
	[LineItemID] [int] NOT NULL,
	[BusID] [int] NOT NULL,
	[VendorID] [int] NOT NULL,
	[ExpenseCategoryID] [tinyint] NOT NULL,
	[Amount] [decimal](13, 2) NOT NULL
)
WITH (DATA_SOURCE = [SQLCONTROL],LOCATION = N'ForensicAccounting.dbo.LineItem')
GO

This works:

Line Items, now with 28% less filler.

Furtive Additions

Now let’s do something sneaky and add a column to the dbo.LineItem table on SQLCONTROL.

ALTER TABLE dbo.LineItem ADD Chaff BIT NULL;

If we query the LineItemTest external table, we still get exactly the columns we expected based on the initial definition—we don’t get the new column, but we also don’t get an error.

Renaming Columns

What if we want to call BusID VehicleID? We’ll bring back the other two columns and keep everything in its original order.

CREATE EXTERNAL TABLE [ELT].[LineItemTest]
(
	[LineItemID] [int] NOT NULL,
	[VehicleID] [int] NOT NULL,
	[VendorID] [int] NOT NULL,
	[ExpenseCategoryID] [tinyint] NOT NULL,
	[EmployeeID] [int] NOT NULL,
	[LineItemDate] [date] NOT NULL,
	[Amount] [decimal](13, 2) NOT NULL
)
WITH (DATA_SOURCE = [SQLCONTROL],LOCATION = N'ForensicAccounting.dbo.LineItem')
GO

Trying to run this gives us a useful error:

Msg 105083, Level 16, State 1, Line 6

105083;The following columns in the user defined schema are incompatible with the external table schema for table ‘LineItemTest’: user defined column: ‘VehicleID’ was not found in the external table. The detected external table schema is: ([LineItemID] INT NOT NULL, [BusID] INT NOT NULL, [VendorID] INT NOT NULL, [ExpenseCategoryID] TINYINT NOT NULL, [EmployeeID] INT NOT NULL, [LineItemDate] DATE NOT NULL, [Amount] DECIMAL(13, 2) NOT NULL).

Therefore, renaming columns is right out.

Conclusion

Here’s what we learned today:

  • “Metadata-rich” external data sources like SQL Server, Oracle, Teradata, etc., offer a bit more flexibility as of SQL Server 2019 CTP 3.2.
  • You may re-order the columns when defining an external table and this will not cause failure.
  • You may add columns to remote tables without breaking queries against external tables.
  • You may not re-order the columns or leave columns out of the specification when dealing with data sources like flat files, ORC/Parquet files, etc., at least as of CTP 3.2.
  • You may not use column names other than the ones used on the remote tables.

One thought on “PolyBase and External Column Names

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s