Understanding Blob Storage Behavior

This is a continuation of my Polybase series.

In the last chapter of the Polybase series, I looked at creating external tables pointing to files in Azure Blob Storage.  Today, we’re going to take a deeper look at how the Polybase engine interacts with Azure Blob Storage.  To do this is as simple as running a basic query.  The query that I will run is the following:

SELECT TOP(1000)
	f.month,
	f.arrdelay
FROM dbo.Flights2008 f
WHERE
	f.dest = N'CMH'
	AND f.dayofmonth = 14;

Execution Plan Details

The execution plan was pretty simple:

executionplan

As far as the XML goes, here’s the full plan if you’re curious:

<?xml version="1.0" encoding="utf-16"?>
<ShowPlanXML xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" Version="1.5" Build="13.0.1722.0" xmlns="http://schemas.microsoft.com/sqlserver/2004/07/showplan">
  <BatchSequence>
    <Batch>
      <Statements>
        <StmtSimple StatementCompId="1" StatementEstRows="67.2391" StatementId="1" StatementOptmLevel="FULL" StatementOptmEarlyAbortReason="GoodEnoughPlanFound" CardinalityEstimationModelVersion="130" StatementSubTreeCost="0.032413" StatementText="SELECT TOP(1000)
 f.month,
 f.arrdelay
FROM dbo.Flights2008 f
WHERE
 f.dest = N'CMH'
 AND f.dayofmonth = 14" StatementType="SELECT" QueryHash="0x955C4C3E381C844B" QueryPlanHash="0x53368F42C59FAD79" RetrievedFromCache="true" SecurityPolicyApplied="false">
          <StatementSetOptions ANSI_NULLS="true" ANSI_PADDING="true" ANSI_WARNINGS="true" ARITHABORT="true" CONCAT_NULL_YIELDS_NULL="true" NUMERIC_ROUNDABORT="false" QUOTED_IDENTIFIER="true" />
          <QueryPlan DegreeOfParallelism="0" NonParallelPlanReason="NoParallelForPDWCompilation" CachedPlanSize="120" CompileTime="50" CompileCPU="49" CompileMemory="208">
            <Warnings>
              <PlanAffectingConvert ConvertIssue="Seek Plan" Expression="CONVERT_IMPLICIT(nvarchar(100),[f].[dest],0)=N'CMH'" />
            </Warnings>
            <MemoryGrantInfo SerialRequiredMemory="0" SerialDesiredMemory="0" />
            <OptimizerHardwareDependentProperties EstimatedAvailableMemoryGrant="417021" EstimatedPagesCached="104255" EstimatedAvailableDegreeOfParallelism="2" />
            <RelOp AvgRowSize="65" EstimateCPU="0.032413" EstimateIO="0" EstimateRebinds="0" EstimateRewinds="0" EstimatedExecutionMode="Row" EstimateRows="67.2391" LogicalOp="Compute Scalar" NodeId="0" Parallel="false" PhysicalOp="Compute Scalar" EstimatedTotalSubtreeCost="0.032413">
              <OutputList>
                <ColumnReference Database="[OOTP]" Schema="[dbo]" Table="[Flights2008]" Alias="[f]" Column="month" />
                <ColumnReference Database="[OOTP]" Schema="[dbo]" Table="[Flights2008]" Alias="[f]" Column="arrdelay" />
              </OutputList>
              <ComputeScalar>
                <DefinedValues>
                  <DefinedValue>
                    <ColumnReference Database="[OOTP]" Schema="[dbo]" Table="[Flights2008]" Alias="[f]" Column="month" />
                    <ScalarOperator ScalarString="[OOTP].[dbo].[Flights2008].[month] as [f].[month]">
                      <Identifier>
                        <ColumnReference Database="[OOTP]" Schema="[dbo]" Table="[Flights2008]" Alias="[f]" Column="month" />
                      </Identifier>
                    </ScalarOperator>
                  </DefinedValue>
                  <DefinedValue>
                    <ColumnReference Database="[OOTP]" Schema="[dbo]" Table="[Flights2008]" Alias="[f]" Column="arrdelay" />
                    <ScalarOperator ScalarString="[OOTP].[dbo].[Flights2008].[arrdelay] as [f].[arrdelay]">
                      <Identifier>
                        <ColumnReference Database="[OOTP]" Schema="[dbo]" Table="[Flights2008]" Alias="[f]" Column="arrdelay" />
                      </Identifier>
                    </ScalarOperator>
                  </DefinedValue>
                </DefinedValues>
                <RelOp AvgRowSize="65" EstimateCPU="0.032413" EstimateIO="0" EstimateRebinds="0" EstimateRewinds="0" EstimatedExecutionMode="Row" EstimateRows="67.2391" LogicalOp="Remote Query" NodeId="1" Parallel="false" PhysicalOp="Remote Query" EstimatedTotalSubtreeCost="0.032413">
                  <OutputList>
                    <ColumnReference Database="[OOTP]" Schema="[dbo]" Table="[Flights2008]" Alias="[f]" Column="month" />
                    <ColumnReference Database="[OOTP]" Schema="[dbo]" Table="[Flights2008]" Alias="[f]" Column="arrdelay" />
                  </OutputList>
                  <RunTimeInformation>
                    <RunTimeCountersPerThread Thread="0" ActualRebinds="1" ActualRewinds="0" ActualRows="1000" Batches="0" ActualEndOfScans="1" ActualExecutions="1" ActualExecutionMode="Row" ActualElapsedms="265523" ActualCPUms="80822" ActualScans="0" ActualLogicalReads="0" ActualPhysicalReads="0" ActualReadAheads="0" ActualLobLogicalReads="0" ActualLobPhysicalReads="0" ActualLobReadAheads="0" />
                  </RunTimeInformation>
                  <RemoteQuery RemoteSource="Polybase_ExternalComputation" RemoteQuery="&lt;?xml version=&quot;1.0&quot; encoding=&quot;utf-8&quot;?&gt;
&lt;dsql_query number_nodes=&quot;1&quot; number_distributions=&quot;8&quot; number_distributions_per_node=&quot;8&quot;&gt;
 &lt;sql&gt;ExecuteMemo explain query&lt;/sql&gt;
 &lt;dsql_operations total_cost=&quot;1&quot; total_number_operations=&quot;10&quot;&gt;
 &lt;dsql_operation operation_type=&quot;RND_ID&quot;&gt;
 &lt;identifier&gt;TEMP_ID_23&lt;/identifier&gt;
 &lt;/dsql_operation&gt;
 &lt;dsql_operation operation_type=&quot;ON&quot;&gt;
 &lt;location permanent=&quot;false&quot; distribution=&quot;AllDistributions&quot; /&gt;
 &lt;sql_operations&gt;
 &lt;sql_operation type=&quot;statement&quot;&gt;CREATE TABLE [tempdb].[dbo].[TEMP_ID_23] ([month] INT, [dayofmonth] INT, [arrdelay] VARCHAR(100) COLLATE SQL_Latin1_General_CP1_CI_AS, [dest] VARCHAR(100) COLLATE SQL_Latin1_General_CP1_CI_AS ) WITH(DATA_COMPRESSION=PAGE);&lt;/sql_operation&gt;
 &lt;/sql_operations&gt;
 &lt;/dsql_operation&gt;
 &lt;dsql_operation operation_type=&quot;ON&quot;&gt;
 &lt;location permanent=&quot;false&quot; distribution=&quot;AllDistributions&quot; /&gt;
 &lt;sql_operations&gt;
 &lt;sql_operation type=&quot;statement&quot;&gt;EXEC [tempdb].[sys].[sp_addextendedproperty] @name=N'IS_EXTERNAL_STREAMING_TABLE', @value=N'true', @level0type=N'SCHEMA', @level0name=N'dbo', @level1type=N'TABLE', @level1name=N'TEMP_ID_23'&lt;/sql_operation&gt;
 &lt;/sql_operations&gt;
 &lt;/dsql_operation&gt;
 &lt;dsql_operation operation_type=&quot;ON&quot;&gt;
 &lt;location permanent=&quot;false&quot; distribution=&quot;AllDistributions&quot; /&gt;
 &lt;sql_operations&gt;
 &lt;sql_operation type=&quot;statement&quot;&gt;UPDATE STATISTICS [tempdb].[dbo].[TEMP_ID_23] WITH ROWCOUNT = 5651, PAGECOUNT = 143&lt;/sql_operation&gt;
 &lt;/sql_operations&gt;
 &lt;/dsql_operation&gt;
 &lt;dsql_operation operation_type=&quot;RND_ID&quot;&gt;
 &lt;identifier&gt;TEMP_ID_24&lt;/identifier&gt;
 &lt;/dsql_operation&gt;
 &lt;dsql_operation operation_type=&quot;ON&quot;&gt;
 &lt;location permanent=&quot;false&quot; distribution=&quot;Control&quot; /&gt;
 &lt;sql_operations&gt;
 &lt;sql_operation type=&quot;statement&quot;&gt;CREATE TABLE [tempdb].[dbo].[TEMP_ID_24] ([month] INT, [arrdelay] VARCHAR(100) COLLATE SQL_Latin1_General_CP1_CI_AS ) WITH(DATA_COMPRESSION=PAGE);&lt;/sql_operation&gt;
 &lt;/sql_operations&gt;
 &lt;/dsql_operation&gt;
 &lt;dsql_operation operation_type=&quot;MULTI&quot;&gt;
 &lt;dsql_operation operation_type=&quot;PARTITION_MOVE&quot;&gt;
 &lt;operation_cost cost=&quot;1.678287936&quot; accumulative_cost=&quot;1.678287936&quot; average_rowsize=&quot;104&quot; output_rows=&quot;67.2391&quot; /&gt;
 &lt;location distribution=&quot;AllDistributions&quot; /&gt;
 &lt;source_statement&gt;SELECT [T1_1].[month] AS [month],
 [T1_1].[arrdelay] AS [arrdelay]
FROM (SELECT TOP (CAST ((1000) AS BIGINT)) [T2_1].[month] AS [month],
 [T2_1].[arrdelay] AS [arrdelay]
 FROM (SELECT CONVERT (NVARCHAR (100), [T3_1].[dest], 0) COLLATE SQL_Latin1_General_CP1_CI_AS AS [col],
 [T3_1].[month] AS [month],
 [T3_1].[arrdelay] AS [arrdelay]
 FROM [tempdb].[dbo].[TEMP_ID_23] AS T3_1
 WHERE ([T3_1].[dayofmonth] = CAST ((14) AS INT))) AS T2_1
 WHERE ([T2_1].[col] = CAST (N'CMH' COLLATE SQL_Latin1_General_CP1_CI_AS AS NVARCHAR (3)) COLLATE SQL_Latin1_General_CP1_CI_AS)) AS T1_1&lt;/source_statement&gt;
 &lt;destination&gt;Control&lt;/destination&gt;
 &lt;destination_table&gt;[TEMP_ID_24]&lt;/destination_table&gt;
 &lt;/dsql_operation&gt;
 &lt;dsql_operation operation_type=&quot;ExternalRoundRobinMove&quot;&gt;
 &lt;operation_cost cost=&quot;964.621896&quot; accumulative_cost=&quot;966.300183936&quot; average_rowsize=&quot;208&quot; output_rows=&quot;45211&quot; /&gt;
 &lt;external_uri&gt;wasbs://csflights@cspolybase.blob.core.windows.net/historical/2008.csv.bz2&lt;/external_uri&gt;
 &lt;destination_table&gt;[TEMP_ID_23]&lt;/destination_table&gt;
 &lt;/dsql_operation&gt;
 &lt;/dsql_operation&gt;
 &lt;dsql_operation operation_type=&quot;ON&quot;&gt;
 &lt;location permanent=&quot;false&quot; distribution=&quot;AllDistributions&quot; /&gt;
 &lt;sql_operations&gt;
 &lt;sql_operation type=&quot;statement&quot;&gt;DROP TABLE [tempdb].[dbo].[TEMP_ID_23]&lt;/sql_operation&gt;
 &lt;/sql_operations&gt;
 &lt;/dsql_operation&gt;
 &lt;dsql_operation operation_type=&quot;STREAMING_RETURN&quot;&gt;
 &lt;operation_cost cost=&quot;1&quot; accumulative_cost=&quot;967.300183936&quot; average_rowsize=&quot;104&quot; output_rows=&quot;67.2391&quot; /&gt;
 &lt;location distribution=&quot;Control&quot; /&gt;
 &lt;select&gt;SELECT [T1_1].[month] AS [month],
 [T1_1].[arrdelay] AS [arrdelay]
FROM (SELECT TOP (CAST ((1000) AS BIGINT)) [T2_1].[month] AS [month],
 [T2_1].[arrdelay] AS [arrdelay]
 FROM [tempdb].[dbo].[TEMP_ID_24] AS T2_1) AS T1_1&lt;/select&gt;
 &lt;/dsql_operation&gt;
 &lt;dsql_operation operation_type=&quot;ON&quot;&gt;
 &lt;location permanent=&quot;false&quot; distribution=&quot;Control&quot; /&gt;
 &lt;sql_operations&gt;
 &lt;sql_operation type=&quot;statement&quot;&gt;DROP TABLE [tempdb].[dbo].[TEMP_ID_24]&lt;/sql_operation&gt;
 &lt;/sql_operations&gt;
 &lt;/dsql_operation&gt;
 &lt;/dsql_operations&gt;
&lt;/dsql_query&gt;" />
                </RelOp>
              </ComputeScalar>
            </RelOp>
          </QueryPlan>
        </StmtSimple>
      </Statements>
    </Batch>
  </BatchSequence>
</ShowPlanXML>

And here is the decoded remote query:

<?xml version="1.0" encoding="utf-8"?>
<dsql_query number_nodes="1" number_distributions="8" number_distributions_per_node="8">
  <sql>ExecuteMemo explain query</sql>
  <dsql_operations total_cost="1" total_number_operations="10">
    <dsql_operation operation_type="RND_ID">
      <identifier>TEMP_ID_23</identifier>
    </dsql_operation>
    <dsql_operation operation_type="ON">
      <location permanent="false" distribution="AllDistributions" />
      <sql_operations>
        <sql_operation type="statement">CREATE TABLE [tempdb].[dbo].[TEMP_ID_23] ([month] INT, [dayofmonth] INT, [arrdelay] VARCHAR(100) COLLATE SQL_Latin1_General_CP1_CI_AS, [dest] VARCHAR(100) COLLATE SQL_Latin1_General_CP1_CI_AS ) WITH(DATA_COMPRESSION=PAGE);</sql_operation>
      </sql_operations>
    </dsql_operation>
    <dsql_operation operation_type="ON">
      <location permanent="false" distribution="AllDistributions" />
      <sql_operations>
        <sql_operation type="statement">EXEC [tempdb].[sys].[sp_addextendedproperty] @name=N'IS_EXTERNAL_STREAMING_TABLE', @value=N'true', @level0type=N'SCHEMA', @level0name=N'dbo', @level1type=N'TABLE', @level1name=N'TEMP_ID_23'</sql_operation>
      </sql_operations>
    </dsql_operation>
    <dsql_operation operation_type="ON">
      <location permanent="false" distribution="AllDistributions" />
      <sql_operations>
        <sql_operation type="statement">UPDATE STATISTICS [tempdb].[dbo].[TEMP_ID_23] WITH ROWCOUNT = 5651, PAGECOUNT = 143</sql_operation>
      </sql_operations>
    </dsql_operation>
    <dsql_operation operation_type="RND_ID">
      <identifier>TEMP_ID_24</identifier>
    </dsql_operation>
    <dsql_operation operation_type="ON">
      <location permanent="false" distribution="Control" />
      <sql_operations>
        <sql_operation type="statement">CREATE TABLE [tempdb].[dbo].[TEMP_ID_24] ([month] INT, [arrdelay] VARCHAR(100) COLLATE SQL_Latin1_General_CP1_CI_AS ) WITH(DATA_COMPRESSION=PAGE);</sql_operation>
      </sql_operations>
    </dsql_operation>
    <dsql_operation operation_type="MULTI">
      <dsql_operation operation_type="PARTITION_MOVE">
        <operation_cost cost="1.678287936" accumulative_cost="1.678287936" average_rowsize="104" output_rows="67.2391" />
        <location distribution="AllDistributions" />
        <source_statement>SELECT [T1_1].[month] AS [month],
       [T1_1].[arrdelay] AS [arrdelay]
FROM   (SELECT TOP (CAST ((1000) AS BIGINT)) [T2_1].[month] AS [month],
                                             [T2_1].[arrdelay] AS [arrdelay]
        FROM   (SELECT CONVERT (NVARCHAR (100), [T3_1].[dest], 0) COLLATE SQL_Latin1_General_CP1_CI_AS AS [col],
                       [T3_1].[month] AS [month],
                       [T3_1].[arrdelay] AS [arrdelay]
                FROM   [tempdb].[dbo].[TEMP_ID_23] AS T3_1
                WHERE  ([T3_1].[dayofmonth] = CAST ((14) AS INT))) AS T2_1
        WHERE  ([T2_1].[col] = CAST (N'CMH' COLLATE SQL_Latin1_General_CP1_CI_AS AS NVARCHAR (3)) COLLATE SQL_Latin1_General_CP1_CI_AS)) AS T1_1</source_statement>
        <destination>Control</destination>
        <destination_table>[TEMP_ID_24]</destination_table>
      </dsql_operation>
      <dsql_operation operation_type="ExternalRoundRobinMove">
        <operation_cost cost="964.621896" accumulative_cost="966.300183936" average_rowsize="208" output_rows="45211" />
        <external_uri>wasbs://csflights@cspolybase.blob.core.windows.net/historical/2008.csv.bz2</external_uri>
        <destination_table>[TEMP_ID_23]</destination_table>
      </dsql_operation>
    </dsql_operation>
    <dsql_operation operation_type="ON">
      <location permanent="false" distribution="AllDistributions" />
      <sql_operations>
        <sql_operation type="statement">DROP TABLE [tempdb].[dbo].[TEMP_ID_23]</sql_operation>
      </sql_operations>
    </dsql_operation>
    <dsql_operation operation_type="STREAMING_RETURN">
      <operation_cost cost="1" accumulative_cost="967.300183936" average_rowsize="104" output_rows="67.2391" />
      <location distribution="Control" />
      <select>SELECT [T1_1].[month] AS [month],
       [T1_1].[arrdelay] AS [arrdelay]
FROM   (SELECT TOP (CAST ((1000) AS BIGINT)) [T2_1].[month] AS [month],
                                             [T2_1].[arrdelay] AS [arrdelay]
        FROM   [tempdb].[dbo].[TEMP_ID_24] AS T2_1) AS T1_1</select>
    </dsql_operation>
    <dsql_operation operation_type="ON">
      <location permanent="false" distribution="Control" />
      <sql_operations>
        <sql_operation type="statement">DROP TABLE [tempdb].[dbo].[TEMP_ID_24]</sql_operation>
      </sql_operations>
    </dsql_operation>
  </dsql_operations>
</dsql_query>

A couple of interesting details.  First, this plan looks pretty similar to the Hadoop plan when I did not induce a MapReduce job.  We can see 8 distributions per node, just like in on-prem Hadoop.  Second, as expected, our external URI points to the WASBS (secure WASB) blob location.  Given that I’m using the WASBS protocol, I’m expecting the connection to pull this data down to use SSL/TLS.  And there’s one easy way to find out.

Wireshark Details

While running this query, I also pulled a packet capture.  After figuring out the IP address (which wasn’t hard, given that there were about 93,000 packets in my packet capture associated with it), I was able to get the following details.

Overall Picture

packetcapture

We can see in the image above that Polybase uses TLS v 1.2 to communicate with Azure Blob Storage when you access Blob Storage using the WASBS protocol.  The connection is encrypted, and so we don’t get much useful information out of it unless you set up Wireshark to deal with encrypted data.  For what we want to do, that’s not neccessary.

The Download

What I do want to see is whether my statement that the Polybase engine needs to download the entire file is correct.  To do that, I can filter on the TCP stream associated with my file download and look for any with a length more than 1000 bytes.

downloadsize

In this case, all of those packets were 1514 bytes, so it’s an easy multiplication problem to see that we downloaded approximately 113 MB.  The 2008.csv.bz2 file itself is 108 MB, so factoring in TCP packet overhead and that there were additional, smaller packets in the stream, I think that’s enough to show that we did in fact download the entire file.  Just like in the Hadoop scenario without MapReduce, the Polybase engine needs to take all of the data and load it into a temp table (or set of temp tables if you’re using a Polybase scale-out cluster) before it can pull out the relevant rows based on our query.

Conclusion

Looking at the execution plan and packet capture, it’s pretty clear that using Polybase with Azure Blob Storage behaves similarly to Polybase with Hadoop as long as you don’t use MapReduce.

Connecting To Blob Storage With Polybase

This is a continuation of my Polybase series.

To this point, I have focused my Polybase series on interactions with on-premises Hadoop, as it’s the use case most apropos to me.  I want to start expanding that out to include other interaction mechanisms, and I’m going to start with one of the easiest:  Azure Blob Storage.

Ayman El-Ghazali has a great blog post the topic, which he turned into a full-length talk.  As such, this post will fill in the gaps rather than start from scratch.  In today’s post, my intention is to retrieve data from Azure Blob Storage and get an idea of what’s happening.  From there, we’ll spend a couple more posts on Azure Blob Storage, looking a bit deeper into the process.  That said, my expectation going into this series is that much of what we do with Azure Blob Storage will mimic what we did with Hadoop, as there are no Polybase core concepts unique to Azure Blob Storage, at least any of which I am aware.

Starting Point:  Files In Blob Storage

I have copied up the flight data from 2003 through 2008 to Azure Blob Storage using Azure Management Studio.  Note that I’ve put my uploads into a folder called historical; that way, when I create external tables later, I can segregate those files from other files I upload.

uploadedfiles

I also have SQL Server 2016 and the Polybase engine installed on my local machine, configured the same way I did for on-prem Hadoop.  Note that the Polybase Connectivity Configuration option 7 will work for Azure Blob Storage as well as modern versions of Hortonworks, so I don’t need to change that.

External Data Source

My first step is to set up an external data source for Azure Blob Storage.  If we go back to the MSDN documentation, we can see that the syntax for creating an external data source which uses Azure Blob Storage is pretty straightforward:

-- PolyBase only: Azure Storage Blob as data source   
-- (on SQL Server 2016 and Azure SQL Data Warehouse)  
CREATE EXTERNAL DATA SOURCE data_source_name  
    WITH (   
        TYPE = HADOOP,  
        LOCATION = 'wasb[s]://container@account_name.blob.core.windows.net'
        [, CREDENTIAL = credential_name ]
    )  
[;]

We specify the type to be HADOOP and point to the storage account, along with a credential.  In my case, the container is called csflights and my account name is cspolybase, as you can see in the picture above.

But before I can create the data source, I need to create a master key for my database (if I don’t have one already) and create a database-scoped credential which contains my secret for Azure Blob Storage.  To get the secret, I need to go to the Azure Portal and select the “Access keys” option in settings and copy one of the keys they generate.

accesskey

Once I have that, I can fill out the T-SQL statements to create my data source:

USE OOTP
GO
CREATE MASTER KEY ENCRYPTION BY PASSWORD = '<password>';
GO
CREATE DATABASE SCOPED CREDENTIAL AzureStorageCredential  
WITH IDENTITY = 'cspolybase',
SECRET = '<access key>';
GO
CREATE EXTERNAL DATA SOURCE WASBFlights
WITH (   
    TYPE = HADOOP,  
    LOCATION = 'wasbs://csflights@cspolybase.blob.core.windows.net',
	CREDENTIAL = AzureStorageCredential
);

Something important to note here is that Azure Blob Storage does not have a resource manager location.  This makes sense, as the resource manager (YARN) exists to allow you to push code to a Hadoop cluster to perform MapReduce jobs remotely.  All we can do with Azure Blob Storage is pull the data over the wire and perform all filters and joins within SQL Server.

External File Format

The external file format is the same as what we would use for on-premises Hadoop.  Just in case you don’t have one set up, here it is:

USE [OOTP]
GO
CREATE EXTERNAL FILE FORMAT [CsvFileFormat] WITH
(
	FORMAT_TYPE = DELIMITEDTEXT,
	FORMAT_OPTIONS
	(
		FIELD_TERMINATOR = N',',
		USE_TYPE_DEFAULT = True
	)
);
GO

External Tables

Now that we have the data source and file format set up, we can create some external tables.  The first table I want to create revolves around 2008 data.

USE [OOTP]
GO
CREATE EXTERNAL TABLE [dbo].[Flights2008]
(
    [year] int NULL,
    [month] int NULL,
    [dayofmonth] int NULL,
    [dayofweek] int NULL,
    deptime VARCHAR(100) NULL,
    crsdeptime VARCHAR(100) NULL,
    arrtime VARCHAR(100) NULL,
    crsarrtime VARCHAR(100) NULL,
    uniquecarrier VARCHAR(100) NULL,
    flightnum VARCHAR(100) NULL,
    tailnum VARCHAR(100) NULL,
    actualelapsedtime VARCHAR(100) NULL,
    crselapsedtime VARCHAR(100) NULL,
    airtime VARCHAR(100) NULL,
    arrdelay VARCHAR(100) NULL,
    depdelay VARCHAR(100) NULL,
    origin VARCHAR(100) NULL,
    dest VARCHAR(100) NULL,
    distance VARCHAR(100) NULL,
    taxiin VARCHAR(100) NULL,
    taxiout VARCHAR(100) NULL,
    cancelled VARCHAR(100) NULL,
    cancellationcode VARCHAR(100) NULL,
    diverted VARCHAR(100) NULL,
    carrierdelay VARCHAR(100) NULL,
    weatherdelay VARCHAR(100) NULL,
    nasdelay VARCHAR(100) NULL,
    securitydelay VARCHAR(100) NULL,
    lateaircraftdelay VARCHAR(100) NULL
)
WITH
(
    LOCATION = N'historical/2008.csv.bz2',
    DATA_SOURCE = WASBFlights,
    FILE_FORMAT = CsvFileFormat,
    -- Up to 5000 rows can have bad values before Polybase returns an error.
    REJECT_TYPE = Value,
    REJECT_VALUE = 5000
);
GO

My first table just looks at flight information for 2008 so I can double-check my work.  A quick top 10 records looks good:

SELECT TOP(10) *
FROM Flights2008 f;

top10

The second table I want to create is all data, not just for a particular year.

USE [OOTP]
GO
CREATE EXTERNAL TABLE [dbo].[FlightsAll]
(
    [year] int NULL,
    [month] int NULL,
    [dayofmonth] int NULL,
    [dayofweek] int NULL,
    deptime VARCHAR(100) NULL,
    crsdeptime VARCHAR(100) NULL,
    arrtime VARCHAR(100) NULL,
    crsarrtime VARCHAR(100) NULL,
    uniquecarrier VARCHAR(100) NULL,
    flightnum VARCHAR(100) NULL,
    tailnum VARCHAR(100) NULL,
    actualelapsedtime VARCHAR(100) NULL,
    crselapsedtime VARCHAR(100) NULL,
    airtime VARCHAR(100) NULL,
    arrdelay VARCHAR(100) NULL,
    depdelay VARCHAR(100) NULL,
    origin VARCHAR(100) NULL,
    dest VARCHAR(100) NULL,
    distance VARCHAR(100) NULL,
    taxiin VARCHAR(100) NULL,
    taxiout VARCHAR(100) NULL,
    cancelled VARCHAR(100) NULL,
    cancellationcode VARCHAR(100) NULL,
    diverted VARCHAR(100) NULL,
    carrierdelay VARCHAR(100) NULL,
    weatherdelay VARCHAR(100) NULL,
    nasdelay VARCHAR(100) NULL,
    securitydelay VARCHAR(100) NULL,
    lateaircraftdelay VARCHAR(100) NULL
)
WITH
(
    LOCATION = N'historical/',
    DATA_SOURCE = WASBFlights,
    FILE_FORMAT = CsvFileFormat,
    -- Up to 5000 rows can have bad values before Polybase returns an error.
    REJECT_TYPE = Value,
    REJECT_VALUE = 5000
);
GO

We can run a check to ensure that records look okay here. In this scenario, instead of just pulling down 10 results, I want to make sure that we get data for each year.

SELECT
	fa.[year],
	COUNT(1) AS NumberOfRecords
FROM dbo.FlightsAll fa
GROUP BY
	fa.[year]
ORDER BY
	fa.[year];

Naturally, this query takes a bit longer to run, as I have to download the files, decompress them, process them, and aggregate their results locally.  During this process, the mpdwsvc service was my biggest resource user, followed by a stretch in which SQL Server ate up a healthy amount of CPU time.

pdwsvc

When the query finished, it gave me my expected results:

aggregateresults

This particular query took me just over 5 minutes to run, getting data for approximately 42 million rows.

Conclusion

At this point, we’ve set up a Polybase connection to Azure Blob Storage.  In the next Polybase series post, we’re going to look at execution plans, packet captures, and other tools to gain a better understanding of the process.  After that, I will put together another post on inserting data into Azure Blob Storage.

Brave New World

For the past three years, I’ve worked as a Database Engineer—in other words, as a database developer—at ChannelAdvisor.  2 1/2 years of that time was spent working in the digital marketing space.  Coming into this job, I had worked at small-scale organizations:  the smallest cabinet-level department in Ohio, followed by a relatively small subsidiary of a large insurance company.  Working at ChannelAdvisor helped me build up skills as a database developer, figuring out that things which work well with a hundred thousand rows in a table don’t necessarily work well when that table hits a billion rows.

Well, come January 1st, I will no longer be a Database Engineer.  That’s because I’m going to be the Engineering Manager of a new predictive analytics team.

Wait, Management?

Yeah, this feels a little crazy for me as well.  The me of five years ago would never have wanted to be a manager, and the reasoning would have been the same as for other technical people:  I enjoy being on the front line, doing things rather than filling out paperwork.

Since then, I would not say that my priorities have changed much:  I still want to be on the front line, using technology to solve business problems.  What I get, though, is a force multiplier:  I now have two more people who can help me accomplish great things.

Vision

Something I’ve observed during the last few years of work is that we have a tremendous amount of interesting data at the company, and we throw away even more due to space and budget constraints.  What we have not been so good at was taking full advantage of that data to help customers.  Most of our systems are designed around a single customer’s data.  Obviously, our transactional systems are keyed toward individual customers, rather than aggregating their results.  What’s interesting is that even our warehouses tend to be customer-focused rather than company-focused.

My vision on predictive analytics is to blow out our company’s disk budget.

It is also to take advantage of this data and solve problems for customers, between customers, and for the company as a whole.  We have the data, and it will be my job to put together the tools to collect (in a place which does not harm our transactional processes), process, aggregate, and analyze the data.  Without getting into specifics, I want to close the internal gap between what we could conceivably do versus what we can do in practice.

Plan of Action:  Data Engineering

In order to pull off my vision, I’ve got to build up skills on a number of fronts, all at the same time.  There are four major quadrants I need to hit; the good news is that I’m building a team to help me with two of them.  I’m going to start with the Data Engineering Crucible, in which I (hopefully) take people with complementary skills across the following three axes and build up people with strong skills across all three.

Statistics

Doing a few analytics projects has reminded me that I need to re-take some stats courses.  My last statistics course was in graduate school, and that was mostly statistics for economists, meaning lots of regressions.  I’m bringing in an employee who has a pretty strong background in this, and so I plan to lean on that person pretty heavily (and push that person to get better at the same time).

Development

My .NET skills have stagnated the last few years.  That makes sense, as I don’t write as much .NET code as before.  The good news is that by hanging around the .NET user group and working on projects both at work and for presentations, I haven’t forgotten much there.  I also want to have my other employee bring in a strong development background to help the team get better.

Aside from .NET development (F# for life!), we’ll use other languages too.  I have some experience with R and Python, and that experience is about to grow significantly.  I have a lot of experience with SQL that I will share with the team, as well as some Java/Scala experience and whatever other crazy languages we decide on.

Subject Matter Expertise

I’ve worked on digital marketing for the past 2 1/2 years, but that’s only a part of what the company does.  My job will be to work with my team to train them on DM but also learn more about the rest of the company’s data and processes.

Plan of Action:  Management

Aside from hitting the Data Engineering trifecta, it’s time to ramp up management skills.  I have some ideas, and I’ll probably share more as I do a bit more.  Right now, it involves reading some books and thinking through various plans, like how I want to run meetings or drive sprint work.  After a few months, I hope to have a post up which describes some of this and see how things work out.

Conclusion

Over the past year, I have been itching for a team lead position.  Now that I have it, I’ll be like a dog with a chew toy, though probably a little less destructive.

New Talks For 2017

2017 is just around the corner, and my intent is to revamp my presentation lineup for the new year.  Here is a quick look at how my presentation lineup looks for 2017, given the proviso that by August, the lineup will probably change.

Headline Talks

These are the talks I lead with.  They’re fun and fit a particular niche that not too many people are hitting.

  1. R for the SQL Developer.  Already finished.  I may tweak the talk a little bit, but I think at its core, it’s a good talk.
  2. Kafka for .NET Developers.  Already finished.  I’ve given this a couple of times and have an idea of how it flows.  I’m really looking forward to pushing it next year.
  3. APPLY Yourself.  Already finished.  I’ve been giving this talk for a few years now, and I’ll be giving it a few years from now too.
  4. Much Ado About Hadoop.  Need to revamp.  The old version of my talk was written in 2014 and is woefully out of date now.  I intend this to be a high-level overview of the Hadoop ecosystem, focusing on Spark and streaming more than “classic” MapReduce operations.
  5. Polybase.  Need to write.  I also need a good title for this.  I think Polybase will be a big talk for me in the second half of 2017.

Midline Talks

These are the talks that I’m still happy to give, or which I want to create but don’t think they’ll see huge play next year.

  1. Peanut Butter and Chocolate:  Integrating Hadoop and SQL Server.  Already finished.  It’s a stable talk and still quite relevant, but I gave it 14 times last year, so that limits the number of places I can submit it.
  2. Securing SQL Server.  Already finished.  It’s nice having a security talk.  I (semi-)jokingly describe my set of talks as the weird side of SQL Server, but having this talk and the APPLY operator talk keeps me somewhat grounded.
  3. Big Data, Small Data, and Everything In Between.  Already finished.  I will probably need to revamp the lists, but I’m still happy with this high-level overview of the data platform space.
  4. Rent-A-Cluster:  HDInsight In Action.  Need to write.  I wanted to have a talk specifically on HDInsight, as most of my Hadoop talks focus around using the local sandbox.
  5. Who Needs A Cluster?  AWS Aurora and Azure Data Lake In Action.  Need to write.  This feeds off of my Rent-A-Cluster talk and focuses on two Platform-as-a-Service offerings which give you parts of Hadoop without dealing with the administration effort.

Other Talks

These are still good talks (at least in my biased opinion!), but I see them as more of niche offerings that I’d submit under certain specific circumstances.

  1. Client Migration with Biml.  Already finished.  I’ve enjoyed this talk and it’s still a viable topic, as familiarity with Biml is still surprisingly low among Integration Services developers.
  2. Working Effectively with Legacy SQL.  Needs revisions.  I tried pushing this one in 2016 and other talks were more popular.  I think that’s fair, although this does fit into a mainline SQL topic:  cleaning up older code which might have been a good decision for SQL Server 2005 but doesn’t make as much sense in SQL Server 2016.  Speaking of which, I’m going to revamp the talk a bit and include some SQL Server 2016 functionality as well.
  3. Power BI Custom Visuals.  Already finished.  I gave this talk a couple of times and it was fun.  The problem is that it can’t crack the top ten talks, so it probably would be a niche topic for user groups.

Inserting Into Hadoop

See the entire Polybase series.

Today, we are going to explore inserting data into a Hadoop cluster using Polybase.  As always, I am using a Hortonworks sandbox, but as long as you are using a supported distribution, that’s not particularly relevant.

Step One:  Configuration

The first thing we need to do is enable data modification using Polybase.  I’m not quite sure why this is a configuration setting, but maybe we’ll find out as the series continues…  Fortunately, configuration is easy:  it’s just a setting in sp_configure.

USE OOTP
GO
EXEC sp_configure 'allow polybase export', 1;
GO
RECONFIGURE
GO

This does not require a server reboot, so as soon as I run the script, we’re ready to go.  Note that all of my work will be in the OOTP database, but you can use whichever database you’d like.

Step Two:  Working On A New Table

Create A Table

The next thing I’d like to do is create a new external table to support my insertions.  I could use an existing external table, but just in case I mess things up, I want to create a new one…  In this scenario, I’m going to suppose that I want to archive information on second basemen that I already have in a Player.SecondBasemen table in SQL Server.  I’d like to use the same structure as my SQL Server table (which was itself a table whose data I retrieved from Hadoop, but never mind that):

CREATE EXTERNAL TABLE [dbo].[SecondBasemenTest]
(
	[FirstName] [varchar](50) NULL,
	[LastName] [varchar](50) NULL,
	[Age] [int] NULL,
	[Throws] [varchar](5) NULL,
	[Bats] [varchar](5) NULL
)
WITH
(
	DATA_SOURCE = [HDP2],
	LOCATION = N'/tmp/ootp/SecondBasemenTest/',
	FILE_FORMAT = [TextFileFormat],
	REJECT_TYPE = VALUE,
	REJECT_VALUE = 5
)
GO

As a quick reminder, my data source and file format are Polybase settings I have already created.

After running the script above, I can check Ambari and see that I have a new folder:

1-secondbasementestfoldercreated

That folder is currently empty, and its owner is pdw_user, which is the account that Polybase tries to use by default.

Load Some Data

Once I have the table, I would like to run an insert statement to load data from Player.SecondBasemen into my SecondBasemenTest external table.

INSERT INTO dbo.SecondBasemenTest
(
	FirstName,
	LastName,
	Age,
	Throws,
	Bats
)
SELECT
	sb.FirstName,
	sb.LastName,
	sb.Age,
	sb.Bats,
	sb.Throws
FROM Player.SecondBasemen sb;

This query successfully inserts 777 records into the dbo.SecondBasemenTest folder.  If I check Ambari, I can see that the Polybase engine created eight separate files, only one of which it used this time:

2-filescreatedoninsert

If I click on the first file, I can see a preview of that file:

3 - SavedAsCSV.png

As we expected, the file is in comma-separated values format, as that’s what I defined TextFileFormat to be.

Loading More Data

Continuing on, I can insert more rows into this table.  To separate the old rows from the new rows, I added a few characters to each first name and last name:

INSERT INTO dbo.SecondBasemenTest
(
	FirstName,
	LastName,
	Age,
	Throws,
	Bats
)
SELECT
	sb.FirstName + 'elo',
	sb.LastName + 'elo',
	sb.Age,
	sb.Bats,
	sb.Throws
FROM Player.SecondBasemen sb;

The results are pretty interesting.  First, Polybase created eight new files rather than updating existing files:

4-newfiles

As if that weren’t interesting enough, this time it decided to split the data between two files instead of pushing it all into one file.  This says to me that the insert pattern is not trivial, that there are several factors that go into how the Polybase engine decides to write data to disk.

Reading Data From Disk

The next question is, does this affect how I query this external table?

SELECT
	sb.FirstName,
	sb.LastName,
	sb.Age,
	sb.Bats,
	sb.Throws
FROM dbo.SecondBasemenTest sb;

The query is the same as if I had loaded the data manually (or using some other process), and the results are what we would expect:

5-queryresults

Something of mild interest is that it does not iterate through all of the files one-by-one; if it did that, I would have expected the first “elo” players starting at row 778.  It does read in chunks, though, instead of interleaving records from each file.

Testing Data Modification

Now that we’ve played around with inserting data, I’d like to see if I can update or delete data.  Let’s try updating my data:

UPDATE sb
SET
	FirstName = FirstName + 'reddy'
FROM dbo.SecondBasemenTest sb;

The answer is apparent in the subsequent message:

Msg 46519, Level 16, State 16, Line 70
DML Operations are not supported with external tables.

DELETE
FROM dbo.SecondBasemenTest;

End result: same error message.  In short, you can insert data but you cannot update or delete that data through Polybase.  If you need to modify data later, use other methods.  For example, if you have partitioned data out into separate files and want to delete older partitions of data, you can delete the files themselves.  Alternatively, you could migrate rows that you want to keep into a new table and drop and re-create your external table script.

Execution Plan Differences

I was curious at this point what the execution plan would look like for an insert statement, and here it is:

6-executionplan

The new operator here is a Put operator, and we can hover over it to get more details:

7-putoperator

Note the number of distributions, which is equal to number of nodes * distributions per node.  I believe the number of distributions is what controls the number of files which get created.

Inserting Into A Single File

The last thing I wanted to try was to insert into a single file rather than a folder.  Ideally, I’d like one big file to which I’d keep appending data.  It was obvious that creating an external table and pointing it to a folder wouldn’t do the trick, so how about the following?

CREATE EXTERNAL TABLE [dbo].[SecondBasemenFileTest]
(
	[FirstName] [varchar](50) NULL,
	[LastName] [varchar](50) NULL,
	[Age] [int] NULL,
	[Throws] [varchar](5) NULL,
	[Bats] [varchar](5) NULL
)
WITH
(
	DATA_SOURCE = [HDP2],
	LOCATION = N'/tmp/ootp/SecondBasemenFileTest/secondbasemenfile.csv',
	FILE_FORMAT = [TextFileFormat],
	REJECT_TYPE = VALUE,
	REJECT_VALUE = 5
)
GO

--Insert into new table
INSERT INTO dbo.SecondBasemenFileTest
(
	FirstName,
	LastName,
	Age,
	Throws,
	Bats
)
SELECT
	sb.FirstName,
	sb.LastName,
	sb.Age,
	sb.Bats,
	sb.Throws
FROM Player.SecondBasemen sb;

The results were…not what I expected.

8-secondbasemenfirstfileattempt

Instead of creating a CSV file, it created a folder with the name secondbasemenfile.csv.  So let’s teach it a lesson.  Instead of letting it create a new folder, I would create a file called secondbasemen.csv and have it point to that file.

DROP EXTERNAL TABLE dbo.SecondBasemenFileTest;

--Moved an actual file into SecondBasemenFileTest/secondbasemen.csv
CREATE EXTERNAL TABLE [dbo].[SecondBasemenFileTest]
(
	[FirstName] [varchar](50) NULL,
	[LastName] [varchar](50) NULL,
	[Age] [int] NULL,
	[Throws] [varchar](5) NULL,
	[Bats] [varchar](5) NULL
)
WITH (DATA_SOURCE = [HDP2],LOCATION = N'/tmp/ootp/SecondBasemenFileTest/secondbasemen.csv', FILE_FORMAT = [TextFileFormat], REJECT_TYPE = VALUE, REJECT_VALUE = 5)
GO

--First, show that we have records in here.
SELECT
	sb.FirstName,
	sb.LastName,
	sb.Age,
	sb.Throws,
	sb.Bats
FROM dbo.SecondBasemenFileTest sb;

--Now try to insert:
INSERT INTO dbo.SecondBasemenFileTest
(
	FirstName,
	LastName,
	Age,
	Throws,
	Bats
)
SELECT
	sb.FirstName,
	sb.LastName,
	sb.Age,
	sb.Bats,
	sb.Throws
FROM Player.SecondBasemen sb;

In this case, it successfully dropped the old table and created a new external table based off of my CSV.  I ran the select statement to prove that I could query the data, and did in fact get 777 results.  When I tried to run the insert statement, however:

Msg 7320, Level 16, State 102, Line 162
Cannot execute the query “Remote Query” against OLE DB provider “SQLNCLI11” for linked server “SQLNCLI11”. EXTERNAL TABLE access failed because the specified path name ‘hdfs://sandbox.hortonworks.com:8020/tmp/ootp/SecondBasemenFileTest/secondbasemen.csv’ does not exist. Enter a valid path and try again.

What’s interesting is the error message itself is correct, but could be confusing.  Note that it’s looking for a path with this name, but it isn’t seeing a path; it’s seeing a file with that name.  Therefore, it throws an error.

This proves that you cannot control insertion into a single file by specifying the file at create time.  If you do want to keep the files nicely packed (which is a good thing for Hadoop!), you could run a job on the Hadoop cluster to concatenate all of the results of the various files into one big file and delete the other files.  You might do this as part of a staging process, where Polybase inserts into a staging table and then something kicks off an append process to put the data into the real tables.

Conclusion

In this post, we looked at inserting data into external tables which live on HDFS.  The only DML operation we can run is INSERT, and whenever we insert data, the Polybase engine creates new files with the inserted data.

Speaking: 2016 In Review

This has been a busy year for me as a speaker.  With my last public talk in the books, I get to take a few weeks of rest and preparation for next year.  Now, let’s see how close I got to my 2016 goals.  For reference, here’s how things looked on May 30th.

Here are my four goals:

  1. Present four new talks.
  2. Present to two brand new audiences.
  3. Speak at 12 SQL Saturdays and eight user groups.
  4. Get paid to speak at a conference.

Present Four New Talks

I had already succeeded by May 30th.  Since then, I added the following repeatable talks:

  1. Securing SQL Server
  2. R For the SQL Server Developer
  3. Getting Started with Custom Visuals in Power BI
  4. Kafka for .NET Developers

I have plans for three more talks next year.  I’ve actually gotten to the point that I’m archiving some older talks so that my active presentations page isn’t quite as large.

Present to Two Brand New Audiences

By brand new audience, I meant getting into places which almost never see relational database professionals.  I did speak at the Raleigh Future of Data group in August.  Aside from that, I spoke at the Carolina IT Pro Group in November; they’re typically sysadmins rather than developers or database administrators, and it was nice to be able to walk into a room full of people who could school me on most infrastructure items and talk to them about infrastructure (specifically, securing a SQL Server instance).  I helped them learn some, they taught me some things, and that’s how it should be.

Speak at 12 SQL Saturdays and 8 User Groups

Yep, done that.  Here’s the final breakdown for SQL Saturdays:

  1. Cleveland (2016-02-06)
  2. Tampa (2016-02-27)
  3. Chicago (2016-03-05)
  4. Richmond (2016-03-19)
  5. Madison (2016-04-09)
  6. Baltimore (2016-04-30)
  7. Jacksonville (2016-05-07)
  8. Rochester (2016-05-14)
  9. Atlanta (2016-05-21)
  10. Portland, ME (2016-06-04)
  11. Chattanooga (2016-06-25)
  12. Columbus, OH (2016-07-16)
  13. Louisville (2016-08-06)
  14. Indianapolis (2016-08-13)
  15. Spartanburg (2016-08-20)
  16. Columbus, GA (2016-08-27)
  17. Charlotte (2016-09-17)
  18. Kansas City (2016-09-24)
  19. Pittsburgh (2016-10-01)
  20. Orlando (2016-11-12)
  21. Washington, DC (2016-12-03)
  22. Providence (2016-12-10)

That was a lot of Saturdays away from home.  I have passed my user group goal as well.

All in all, I gave 53 public presentations at 45 separate events.

Get Paid to Speak at a Conference

Okay, really, that goal should be “speak at a paid conference.”  I did speak at DevTeach 2016; it was a great experience and I plan to submit again.  I also want to apply to a few more paid conferences as well, now that I have two years as a speaker under my belt.

Conclusion

I was able to succeed at all four goals, and I’m excited about next year.

Hall of Fame Coverage

I deeply apologize to you, dear readers. I have been quite busy with other projects and have failed you. We have two Hall of Fame ballots to discuss for baseball and I haven’t talked about either one! Shall we? Yes, I believe we shall!

Author’s note: They went and elected people to the HOF while this article was in draft mode, the jerks. Therefore, I will keep my original Veterans’ Committee piece, but will say who won at the end (so you don’t try to cheat).

First, the Veterans’ Committee will consider the “Today’s Game” Ballot, which includes players from 1988 to the Present. Like the normal HOF, you need 75% of the vote, which means 12 ballots. There are ten candidates:

Harold Baines: Baines was a very good player for a very long time, but if you’re a guy (or gal) obsessed with peak, he’s not your pick. He’s well regarded, which is a point in his favor, but he barely cracked the 5% mark. He’s a better version of Tony Perez without Joe Morgan in his corner. Compared to other OFs, he’s terrible, and he’s not a good enough DH to make it ahead of Edgar Martinez. Baines, offensively, just wasn’t a huge force. 121 OPS+ just isn’t hugely impressive. Pass.

Albert Belle: Belle is the opposite of Baines in many ways. Belle was awesome at his peak, especially 1995. He was also an asshole and made few friends. Belle’s peak is noteworthy. By rate statistics, Belle is an amazing offensive hitter. He only had two seasons of under 100+, and in one of those, he had all of 25 PAs. That said, his career was extremely short. If he’d played five more years, he could have approached 600 home runs, and this would be a much more interesting conversation. The combination of his personality and short career will doom him. Pass.

Will Clark: I kind of like the idea of Clark in the Hall of Fame. He was actually surprisingly good (I genuinely didn’t remember much about him before going to B-Ref). He’s not outstanding, but he’s a better candidate, arguably, than Baines or Belle. He was even pretty decent defensively, winning a Gold Glove. He’s even better than the average 1B in the HOF (although Perez is one of them, so…) That said, he’s not remarkable enough to really make the Hall. He lacked overwhelming power, and that’s almost sine qua non for a Hall of Fame 1B. He had one season of more than 30+ HRs, a career slugging percentage of < .500, and an OPS+ of only 137. Pass.

Orel Hershiser: Hershiser was solid. He peaked young–1987-1989–but was a perfectly serviceable innings eater for much of the rest of his career. He’s well regarded and pitched on a memorable team in 1988. That said, it’s hard to get excited about a pitcher with an ERA barely over league average and some frankly terrible FIP numbers in the 1990s. Pass.

Davey Johnson: Johnson, as a player, is not in the conversation. As a manager, he had an amazing career with the Mets, but settled into being quite good overall. In 17 seasons, he had 14 seasons over .500. He got into the playoffs seven times, but never quite reached the heights of 1986. His playoff record is a significant negative, however. Pass.

Mark McGwire: Big Mac was really, really good at hitting homers. Everything–good and bad–feeds off of that. He’s done a decent job of rehabilitating his name, especially his bizarre interview with Bob Costas. His offensive capabilities are undeniable, and in a neutral world, he’s a definite Hall of Famer. The question is the slippery slope argument: if McGwire gets in, you’ll have to let in other confirmed cheaters who were better players (see Clemens and Bonds). I would let him in, but I’m not sure the Hall will. Hit… but likely a pass from the actual committee.

Lou Pinella: A long career of barely above averaging managing. He was on some bad teams, some mediocre teams, and a single World Series title. He’s a better case than Davey Johnson, but only because of his lengthier career. Pass.

John Schuerholz: As a GM, he’s one of the best all time. Hit.

Bud Selig: I can’t imagine him not getting in. He essentially ended labor disputes, presided over significant expansions in the number of teams and playoffs spots, and saw baseball’s popularity explode. Hit.

George Steinbrenner: As an owner, it’s hard to think of a more successful individual. He took the Yankees when they were a joke and made them into a juggernaut again. Hit.

My ballot: McGwire, Steinbrenner, Selig, Schuerholz.

So who actually won? Click below!

Continue reading