Today, we are going to explore inserting data into a Hadoop cluster using Polybase. As always, I am using a Hortonworks sandbox, but as long as you are using a supported distribution, that’s not particularly relevant.
Step One: Configuration
The first thing we need to do is enable data modification using Polybase. I’m not quite sure why this is a configuration setting, but maybe we’ll find out as the series continues… Fortunately, configuration is easy: it’s just a setting in sp_configure.
USE OOTP GO EXEC sp_configure 'allow polybase export', 1; GO RECONFIGURE GO
This does not require a server reboot, so as soon as I run the script, we’re ready to go. Note that all of my work will be in the OOTP database, but you can use whichever database you’d like.
Step Two: Working On A New Table
Create A Table
The next thing I’d like to do is create a new external table to support my insertions. I could use an existing external table, but just in case I mess things up, I want to create a new one… In this scenario, I’m going to suppose that I want to archive information on second basemen that I already have in a Player.SecondBasemen table in SQL Server. I’d like to use the same structure as my SQL Server table (which was itself a table whose data I retrieved from Hadoop, but never mind that):
CREATE EXTERNAL TABLE [dbo].[SecondBasemenTest] ( [FirstName] [varchar](50) NULL, [LastName] [varchar](50) NULL, [Age] [int] NULL, [Throws] [varchar](5) NULL, [Bats] [varchar](5) NULL ) WITH ( DATA_SOURCE = [HDP2], LOCATION = N'/tmp/ootp/SecondBasemenTest/', FILE_FORMAT = [TextFileFormat], REJECT_TYPE = VALUE, REJECT_VALUE = 5 ) GO
After running the script above, I can check Ambari and see that I have a new folder:
That folder is currently empty, and its owner is pdw_user, which is the account that Polybase tries to use by default.
Load Some Data
Once I have the table, I would like to run an insert statement to load data from Player.SecondBasemen into my SecondBasemenTest external table.
INSERT INTO dbo.SecondBasemenTest ( FirstName, LastName, Age, Throws, Bats ) SELECT sb.FirstName, sb.LastName, sb.Age, sb.Bats, sb.Throws FROM Player.SecondBasemen sb;
This query successfully inserts 777 records into the dbo.SecondBasemenTest folder. If I check Ambari, I can see that the Polybase engine created eight separate files, only one of which it used this time:
If I click on the first file, I can see a preview of that file:
As we expected, the file is in comma-separated values format, as that’s what I defined TextFileFormat to be.
Loading More Data
Continuing on, I can insert more rows into this table. To separate the old rows from the new rows, I added a few characters to each first name and last name:
INSERT INTO dbo.SecondBasemenTest ( FirstName, LastName, Age, Throws, Bats ) SELECT sb.FirstName + 'elo', sb.LastName + 'elo', sb.Age, sb.Bats, sb.Throws FROM Player.SecondBasemen sb;
The results are pretty interesting. First, Polybase created eight new files rather than updating existing files:
As if that weren’t interesting enough, this time it decided to split the data between two files instead of pushing it all into one file. This says to me that the insert pattern is not trivial, that there are several factors that go into how the Polybase engine decides to write data to disk.
Reading Data From Disk
The next question is, does this affect how I query this external table?
SELECT sb.FirstName, sb.LastName, sb.Age, sb.Bats, sb.Throws FROM dbo.SecondBasemenTest sb;
The query is the same as if I had loaded the data manually (or using some other process), and the results are what we would expect:
Something of mild interest is that it does not iterate through all of the files one-by-one; if it did that, I would have expected the first “elo” players starting at row 778. It does read in chunks, though, instead of interleaving records from each file.
Testing Data Modification
Now that we’ve played around with inserting data, I’d like to see if I can update or delete data. Let’s try updating my data:
UPDATE sb SET FirstName = FirstName + 'reddy' FROM dbo.SecondBasemenTest sb;
The answer is apparent in the subsequent message:
Msg 46519, Level 16, State 16, Line 70
DML Operations are not supported with external tables.
DELETE FROM dbo.SecondBasemenTest;
End result: same error message. In short, you can insert data but you cannot update or delete that data through Polybase. If you need to modify data later, use other methods. For example, if you have partitioned data out into separate files and want to delete older partitions of data, you can delete the files themselves. Alternatively, you could migrate rows that you want to keep into a new table and drop and re-create your external table script.
Execution Plan Differences
I was curious at this point what the execution plan would look like for an insert statement, and here it is:
The new operator here is a Put operator, and we can hover over it to get more details:
Note the number of distributions, which is equal to number of nodes * distributions per node. I believe the number of distributions is what controls the number of files which get created.
Inserting Into A Single File
The last thing I wanted to try was to insert into a single file rather than a folder. Ideally, I’d like one big file to which I’d keep appending data. It was obvious that creating an external table and pointing it to a folder wouldn’t do the trick, so how about the following?
CREATE EXTERNAL TABLE [dbo].[SecondBasemenFileTest] ( [FirstName] [varchar](50) NULL, [LastName] [varchar](50) NULL, [Age] [int] NULL, [Throws] [varchar](5) NULL, [Bats] [varchar](5) NULL ) WITH ( DATA_SOURCE = [HDP2], LOCATION = N'/tmp/ootp/SecondBasemenFileTest/secondbasemenfile.csv', FILE_FORMAT = [TextFileFormat], REJECT_TYPE = VALUE, REJECT_VALUE = 5 ) GO --Insert into new table INSERT INTO dbo.SecondBasemenFileTest ( FirstName, LastName, Age, Throws, Bats ) SELECT sb.FirstName, sb.LastName, sb.Age, sb.Bats, sb.Throws FROM Player.SecondBasemen sb;
The results were…not what I expected.
Instead of creating a CSV file, it created a folder with the name secondbasemenfile.csv. So let’s teach it a lesson. Instead of letting it create a new folder, I would create a file called secondbasemen.csv and have it point to that file.
DROP EXTERNAL TABLE dbo.SecondBasemenFileTest; --Moved an actual file into SecondBasemenFileTest/secondbasemen.csv CREATE EXTERNAL TABLE [dbo].[SecondBasemenFileTest] ( [FirstName] [varchar](50) NULL, [LastName] [varchar](50) NULL, [Age] [int] NULL, [Throws] [varchar](5) NULL, [Bats] [varchar](5) NULL ) WITH (DATA_SOURCE = [HDP2],LOCATION = N'/tmp/ootp/SecondBasemenFileTest/secondbasemen.csv', FILE_FORMAT = [TextFileFormat], REJECT_TYPE = VALUE, REJECT_VALUE = 5) GO --First, show that we have records in here. SELECT sb.FirstName, sb.LastName, sb.Age, sb.Throws, sb.Bats FROM dbo.SecondBasemenFileTest sb; --Now try to insert: INSERT INTO dbo.SecondBasemenFileTest ( FirstName, LastName, Age, Throws, Bats ) SELECT sb.FirstName, sb.LastName, sb.Age, sb.Bats, sb.Throws FROM Player.SecondBasemen sb;
In this case, it successfully dropped the old table and created a new external table based off of my CSV. I ran the select statement to prove that I could query the data, and did in fact get 777 results. When I tried to run the insert statement, however:
Msg 7320, Level 16, State 102, Line 162
Cannot execute the query “Remote Query” against OLE DB provider “SQLNCLI11” for linked server “SQLNCLI11”. EXTERNAL TABLE access failed because the specified path name ‘hdfs://sandbox.hortonworks.com:8020/tmp/ootp/SecondBasemenFileTest/secondbasemen.csv’ does not exist. Enter a valid path and try again.
What’s interesting is the error message itself is correct, but could be confusing. Note that it’s looking for a path with this name, but it isn’t seeing a path; it’s seeing a file with that name. Therefore, it throws an error.
This proves that you cannot control insertion into a single file by specifying the file at create time. If you do want to keep the files nicely packed (which is a good thing for Hadoop!), you could run a job on the Hadoop cluster to concatenate all of the results of the various files into one big file and delete the other files. You might do this as part of a staging process, where Polybase inserts into a staging table and then something kicks off an append process to put the data into the real tables.
In this post, we looked at inserting data into external tables which live on HDFS. The only DML operation we can run is INSERT, and whenever we insert data, the Polybase engine creates new files with the inserted data.