Creating External Tables

At this point, we’ve got Polybase configured and have created external data sources and file formats (and by the way, if you haven’t been keeping up, I am tracking my entire Polybase series there). Now we want to create an external table to do something with all of this. As usual, my first stop is MSDN, which gives us the following syntax for creating an external table from within SQL Server or Azure SQL Database.

Looking At Syntax

CREATE EXTERNAL TABLE [ database_name . [ schema_name ] . | schema_name. ] table_name   
    ( <column_definition> [ ,...n ] )  
    WITH (   
        LOCATION = 'folder_or_filepath',  
        DATA_SOURCE = external_data_source_name,  
        FILE_FORMAT = external_file_format_name  
        [ , <reject_options> [ ,...n ] ]  
    )  
[;]  
  
<reject_options> ::=  
{  
    | REJECT_TYPE = value | percentage  
    | REJECT_VALUE = reject_value  
    | REJECT_SAMPLE_VALUE = reject_sample_value  
  
}  
  
-- Create a table for use with Elastic Database query  
CREATE EXTERNAL TABLE [ database_name . [ schema_name ] . | schema_name. ] table_name   
    ( <column_definition> [ ,...n ] )  
    WITH ( <sharded_external_table_options> )  
[;]  
  
<sharded_external_table_options> ::=  
        DATA_SOURCE = external_data_source_name,   
        SCHEMA_NAME = N'nonescaped_schema_name',  
        OBJECT_NAME = N'nonescaped_object_name',  
        [DISTRIBUTION  = SHARDED(sharding_column_name) | REPLICATED | ROUND_ROBIN]]  
    )  
[;]

For now, we’re going to care about the first version; I’ll look at Elastic Database later on in this series, but for this post, we’re looking at Hadoop.

The create statement itself is pretty easy, and the first half of the statement looks like what you’d expect from any other table, save for the term “EXTERNAL” in the create statement. It’s when you get down to the WITH clause that things start getting a bit different.

First, you have to define a location, data source, and file format. LOCATION is the location of the file or folder, so in my case, it’s going to be in HDFS, so I want to specify that path. For example, I have a file in ‘/tmp/ootp/secondbasemen.csv’ so I just put that in. Note that you don’t need to put in IP address, port, or anything else because those details are already defined in the external data source. If you don’t remember what they are, you can check the DMV:

SELECT *
FROM sys.external_data_sources eds
WHERE
	eds.name = 'HDP2';

The column named location already has the first half of our statement, so we don’t need to specify that location again.

The DATA_SOURCE and DATA_FORMAT options are easy: pick you external data source and external file format of choice.

The last major section deals with rejection. We’re going from a semi-structured system to a structured system, and sometimes there are bad rows in our data, as there are no strict checks of structure before inserting records. The Hadoop mindset is that there are two places in which you can perform data quality checks: in the original client (pushing data into HDFS) and in any clients reading data from HDFS. To make things simpler for us, the Polybase engine will outright reject any records which do not adhere to the quality standards you define when you create the table. For example, let’s say that we have a Age column for each of our players, and that each age is an integer. If the first row of our file has headers, then the first row will literally read “Age” and conversion to integer will fail. Polybase rejects this row (removing it from the result set stream) and increments a rejection counter. What happens next depends upon the reject options.

There are two rejection types, value and percentage. When you set REJECT_TYPE to VALUE, it’s pretty easy: a query fails once it hits REJECT_VALUE failed records. For example, if you set REJECT_VALUE to 10, your query will succeed if the query set returned had up to 10 bad values, but as soon as you get that 11th bad row, the query fails with an error.

Msg 8680, Level 17, State 26, Line 35
Internal Query Processor Error: The query processor encountered an unexpected error during the processing of a remote query phase.

Looking at the compute node errors, we can see the issue:

SELECT TOP(10) *
FROM sys.dm_exec_compute_node_errors decne
ORDER BY
	decne.create_time DESC;

Look at the details column and we can see the details behind why this query failed:

Query aborted– the maximum reject threshold (0 rows) was reached while reading from an external source: 1 rows rejected out of total 1 rows processed. (/tmp/ootp/secondbasemen.csv) Column ordinal: 0, Expected data type: INT, Offending value: Casey (Column Conversion Error), Error: Error converting data type NVARCHAR to INT.

I set the reject value to 0 and the first row caused me to fail.

If I set the REJECT_TYPE to PERCENTAGE, then the REJECT_VALUE becomes the percent of records which need to be bad before the query fails, and I need to specify REJECT_SAMPLE_VALUE. I ran a test with a sample value of 100 and a reject percentage of 1, and here’s what the results look like:

107091;Query aborted– the maximum reject threshold (1 %) was reached while reading from an external source: 777 rows rejected out of total 777 rows processed.

The MSDN documentation states that it should try to run a check every REJECT_SAMPLE_VALUE rows, but in this case, it pulled all 777 and rejected all of them.

Focus On Data Types

The linked MSDN document above shows how each SQL data type converts to a Hadoop data type. The mappings are fairly straightforward, though note that the money and smallmoney types convert to doubles in Hadoop, meaning that money is imprecise; decimal types in SQL Server convert to decimals in Hadoop, so if you want to avoid floating point irregularities, stick to decimal.

Sample Script

Here is a sample to create a table called SecondBasemen.

CREATE EXTERNAL TABLE [dbo].[SecondBasemen]
(
	[FirstName] [VARCHAR](50) NULL,
	[LastName] [VARCHAR](50) NULL,
	[Age] [INT] NULL,
	[Throws] [VARCHAR](5) NULL,
	[Bats] [VARCHAR](5) NULL
)
WITH
(
	DATA_SOURCE = [HDP2],
	LOCATION = N'/tmp/ootp/secondbasemen.csv',
	FILE_FORMAT = [TextFileFormat],
	REJECT_TYPE = VALUE,
	REJECT_VALUE = 5
);

By this point, there’s nothing special about this table creation statement, and there aren’t that many other options you can choose. You could change the reject type to be percentage-based:

CREATE EXTERNAL TABLE [dbo].[SecondBasemen]
(
	[FirstName] [VARCHAR](50) NULL,
	[LastName] [VARCHAR](50) NULL,
	[Age] [INT] NULL,
	[Throws] [VARCHAR](5) NULL,
	[Bats] [VARCHAR](5) NULL
)
WITH
(
	DATA_SOURCE = [HDP2],
	LOCATION = N'/tmp/ootp/secondbasemen.csv',
	FILE_FORMAT = [TextFileFormat],
	REJECT_TYPE = PERCENTAGE,
	REJECT_VALUE = 15,
	REJECT_SAMPLE_VALUE = 500
);

Miscellany

External tables do not store their data in SQL Server; they are truly external, and every query is a remote query. You can store statistics (which I’ve talked about in the past and will discuss in more detail later), but there’s no such thing as indexing external tables, so the only data we have about that table is metadata.

Speaking of metadata, when you create the table, it shows up in sys.tables like any other table. If you set the is_external flag to 1, you can see all external tables in the current database:

SELECT *
FROM sys.tables
WHERE
	is_external = 1;

Conclusion

The script to create an external table is pretty easy, with relatively few options. You will want to fine-tune those rejection counts, though. The problem is that if you set the rejection count too low, you’ll run the risk of having queries fail after a very long time, likely without returning any data to you. Imagine that you have a 20 billion row table and you’re returning 10 million records, of which 50 are bad. If you set the reject value to 49, then you might have to do all of that work on 10 million records just to watch it fail at the very end. But at the same time, if you have a large percentage of records failing, you’d want to know that (especially if you can fix the HDFS files) rather than setting a reject count to 100% and blissfully ignoring the problem.

If you need to drop an external table, it’s easy: use the DROP EXTERNAL TABLE command.

S	M	T	W	T	F	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

36 Chambers – The Legendary Journeys: Execution to the max!