Let’s Get Some Data!

Yesterday, I showed how to install and configure Polybase.  Today, we’re going to do two things:  get a small data set (for Polybase testing) and get a large data set (for some real testing).  Tomorrow, we’re going to use Polybase (finally!) to grab some data.

Get Data!

We are going to build a data set of second basemen from the BattingRatings.csv file in my Hadoop Integration Github repo.  This is a small data set with a lot of columns, so we’re going to upload the file, create a Hive table, and then grab a subset of the data.

Upload via Ambari

Connect to Ambari, which is on port 8080 of your Hortonworks  sandbox.  Then, go to the HDFS files section.

AmbariHDFS.png

Once you’re there, let’s create a directory and upload a file.  Go to the tmp directory and then click the “New Directory” button near the top right-hand side of the screen.  We’ll first create a directory called ootpoutput (which we’ll use soon), and then create a directory called ootp (noting that this is case sensitive!).  Inside that directory, we’ll upload BattingRatings.csv.

Create BattingRatingsPerm Table

We’re going over to Hive next to create a table.  In the same drop-down, go to Hive View.  This gives us a web interface to send Hive scripts.  It’s no SSMS, but it’ll do.

AmbariHive

In the main box, we’re going to create a new table called battingratingsperm.  Run the following code:

CREATE TABLE `battingratingsperm`(
`firstname` string,
`lastname` string,
`age` int,
`bats` string,
`throws` string,
`speed` int,
`stealing` int,
`baserunning` int,
`position` string,
`contact` int,
`gap` int,
`power` int,
`eye` int,
`strikeouts` int,
`hbp` int,
`babip` int,
`contactr` int,
`gapr` int,
`powerr` int,
`eyer` int,
`strikeoutsr` int,
`hbpr` int,
`babipr` int,
`contactl` int,
`gapl` int,
`powerl` int,
`eyel` int,
`strikeoutsl` int,
`hbpl` int,
`babipl` int,
`contactpotential` int,
`gappotential` int,
`powerpotential` int,
`eyepotential` int,
`strikeoutspotential` int,
`infieldrange` int,
`infieldarm` int,
`infielderror` int,
`turndoubleplay` int,
`outfieldrange` int,
`outfieldarm` int,
`outfielderror` int,
`catcherability` int,
`fieldexperiencec` int,
`fieldexperience1b` int,
`fieldexperience2b` int,
`fieldexperience3b` int,
`fieldexperiencess` int,
`fieldexperiencelf` int,
`fieldexperiencecf` int,
`fieldexperiencerf` int,
`fieldratingc` int,
`fieldrating1b` int,
`fieldrating2b` int,
`fieldrating3b` int,
`fieldratingss` int,
`fieldratinglf` int,
`fieldratingcf` int,
`fieldratingrf` int)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
TBLPROPERTIES("skip.header.line.count"="1");

LOAD DATA INPATH '/tmp/ootp/BattingRatings.csv' OVERWRITE INTO TABLE battingratingsperm;

So let’s walk through this code.  The CREATE TABLE statement looks pretty similar to a T-SQL CREATE TABLE statement, except that I am using backticks to delimit fields rather than brackets.  These aren’t absolutely necessary (because I’m not using any special characters in the field names) but The More You Know(tm).

Next, we are using a comma-separated file with line breaks separating lines.  It’s a text file, so we can use TextInputFormat.  The good thing about TextInputFormat is that it automatically handles certain compression formats (like gzip, bzip2, and zip), which is quite helpful for giant files.

Our output is in format HiveIgnoreKeyTextOutputFormat.  Here’s where I plead ignorance and admit that I don’t actually know why this format exists and what its importance is.  I do want to skip the first line in the file, though, because it’s just a header.

If you run that script, you’ll have a table called battingratingsperm.  From that table, we’re going to create a file in /tmp/ootp called SecondBasemen.csv.  First, we’ll create a Hive job to write out data to /tmp/ootpoutput:

INSERT OVERWRITE DIRECTORY '/tmp/ootpoutput/'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
SELECT FirstName, LastName, Age, Bats, Throws FROM BattingRatingsPerm WHERE Position = '2B';

Once you execute that job, go to HDFS Files and then /tmp/ootpoutput.  Inside that folder, you’ll see a file that is 21.8KB in size.

ootpoutput.png

We’re going to rename that file to SecondBasemen.csv, and then delete the .hive-staging file.  To delete a file, click the trash can.  To rename a file, right-click on the file name and you’ll see a Rename option.  Rename that to SecondBasemen.csv and you’re good to go.

Get More Data!

The 21.8K set of second basemen isn’t exactly the type of thing you spin up a Hadoop cluster to analyze.  That’s just going to act as a test file, to ensure that we’re doing this correctly.  For the main event, I recommend grabbing a bigger data set.  For a laptop running a single node, airline data is a good candidate and is what we’ll use.  There are a few reasons for this:

  1. The data set is reasonably clean.  The biggest problem with grabbing data sets from the Internet is that so many of them are full of messy data.  Real life is messy, but it’s hard to explain a point when you have to spend 20 hours cleaning up data oddities.
  2. The data set is reasonably large.  In total, the data set is about 1.65 GB compressed (bz2 format).  There’s a 10:1 compression ratio, meaning that uncompressed the data set is over 16 GB.  That’s large enough to start talking about Hadoop.  In total, we have something like 160 million rows.  That’s not a huge quantity of data for a production cluster, but again, works well on a one-node sandbox.
  3. The data set is reasonably interesting.  It’s easy to understand and although the data set stops in 2008, airline arrivals and delays are still an issue.
  4. The supplemental data helps tie everything together.  I grabbed the airports.csv and carriers.csv files and loaded them in a database on my SQL Server database.  That way, I can join the large data set to my local data set and (hopefully) show off some of the value of predicate pushdown.

Grab the files and move them (compressed) into /tmp/airport (which directory you’ll need to create).  You can upload them one at a time through the Ambari interface or whip up a quick program to upload Hadoop files.  Check out my Hadoop presentation for tips on how to write a .NET program to upload files (and then check out the code).

Conclusion

So far, we’ve done the basic installation work for Polybase and have loaded a couple of data sets that we’ll use.  In the next post, I’m going to hook everything together and we’ll pull data from Hadoop using T-SQL.

Advertisements

2 thoughts on “Let’s Get Some Data!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s