T-SQL Tuesday 95: Polybase And HDInsight

tt_300_11FF5A1DToday is T-SQL Tuesday #95, hosted by Derik Hammer, and the topic is Big Data.

This is a continuation of my Polybase series.  If you haven’t seen this already, I have a twenty-some part series on Polybase in SQL Server 2016 and now SQL Server 2017.  Polybase hasn’t changed with the RTM release of SQL Server 2017, so I haven’t needed to write much new about the topic lately.  That’s my excuse for being lazy, at least.

Additional Resources

If you’re new here, you might be interested in the rest of my Polybase series, which I have in rough reading order.  I also turned this series into a presentation.

You may also like my Kafka series, which led to a presentation on the topic.  Kafka is a fantastic distributed message broker which lives in the Hadoop ecosystem, and I can’t say enough good things about it.

Maybe you’re curious about this Hadoop thing but don’t know where to start.  Go check out my presentation introducing Hadoop, which I have recorded as well.

After that, if you want to see several methods you could use to connect Hadoop to SQL Server (including Polybase), check out my presentation on the topic, which the folks at NDC Sydney helpfully recorded.

Finally, if you just want to get an idea of what the modern data platform space looks like, I have a presentation on that as well.  Carlos L Chacon interviewed me on the SQL Data Partners podcast in Episode 47 and we talked about this in depth, including where Hadoop fits into this ever-growing combination of technologies.

Now, On With The Show

In today’s post, I want to look at trying to connect a SQL Server instance to Azure HDInsight and see if I can take advantage of predicate pushdown.  The short answer is, not really.  The long answer is, read on.

 

Let’s Spin Up HDInsight

The first step here is to start up an HDInsight cluster.  We won’t need a powerful cluster for this demo.

First, in the Azure Portal, search for HDInsight and you’ll get the option to create a new cluster.  Here are my cluster configuration options:

1_ClusterConfiguration

I just wanted a classic Hadoop cluster and didn’t need to install HBase, Kafka, or anything else.  Note that Windows is an option for operating system, but nobody in their right mind runs Hadoop on Windows.  It’s dead; leave it be.

Next, fill out the basics.  Here’s my setup:

2_Basics

You can configure your storage options after that.  I’m choosing the default of Azure Storage, though you can also read from Azure Data Lake Storage.  If you want Hive or Oozie metadata to persist after you kill the cluster—something that you might want to do in production as you shut down the HDInsight cluster when not in use—you can choose an Azure SQL Database in the dropdowns at the bottom.

3_Storage

Once you’ve got storage set up, you want to edit the cluster size.  By default, you get two D12 head node and 4 D4 worker nodes.  This will run you $3.63 an hour in East US.  That’s about $87 a day, or roughly $2600 per month.

4_EditClusterSize

 

 

 

You probably want to trim that down a bit if you’re just playing around.  Here’s what I ended up with:

5_SmallerNodes

Just using 2 D3 nodes cuts the price in half.  It’d be a terrible demo for Hadoop and completely unfair in a performance test, but our purpose here is just to get a simple Polybase demo working, not provide a comprehensive study on Hadoop performance.

Once you’ve got everything set up, expect to stare at this for a while:

6_DeployingHDInsight

Configuring For Predicate Pushdown

Next up, we want to configure Polybase to use our Hadoop installation.  To get the right version of Hadoop, I need to SSH to the NameNode:

7_SSH

My HDInsight cluster is using Hortonworks Data Platform version 2.6.2.2-5:

8_HDPVersion

That means that in my yarn-site.xml file, all references will be to /usr/hdp/2.6.2.2-5.

The Problem

But now we run into a problem:  there are certain ports which need to be open for Polybase to work.  This includes port 50010 on each of the data nodes against which we want to run MapReduce jobs.  This goes back to the issue we see with spinning up data nodes in Docker:  ports are not available.  If you’ve put your HDInsight cluster into an Azure VNet and monkey around with ports, you might be able to open all of the ports necessary to get this working, but that’s a lot more than I’d want to mess with, as somebody who hasn’t taken the time to learn much about cloud networking.

One Workaround

There’s one consistent workaround that I see people use:  have Azure Blob Storage work as an intermediary.  In this scenario, you’d use Polybase to write to Azure Blob Storage and then pull that data into your Azure Storage local to the HDInsight cluster.  This works for data transfer, but it doesn’t allow for predicate pushdown.

Using A Deployment Template

Robert Alexander has created a deployment template to spin up a SQL Server instance and an HDInsight cluster with a sample data set, so if you have an Azure subscription and want to play along, it’s pretty easy.  Just hit the “Deploy” button, fill in a few details, and be prepared to wait half an hour.  Note that the average daily cost is just over $150 for this demo, so you probably want to delete the objects as soon as you can to save money.  You will be charged for an HDInsight cluster even if you’ve shut off all of the machines.  There is no free pause option; the only way not to pay for the service is to delete the cluster.

Unfortunately, I was not able to get this template working.  I got through step one of the deployment but got stuck with a status of Action Required:

10_ActionRequired

When clicking on the deployment link, it’d sit there spinning, waiting for something to load.  In my browser console, I saw the problem:

9_APIError

Trying to go to that link gives an error message:

{"message":"You don't have access to this deployment ({deploymentID})."}

In this case, {deploymentID} is the exposed GUID after /deployments/ in the link above.  I tried this with a couple of accounts, but was unable to spin up this demo with either of them.

Conclusion

I’m not particularly fond of HDInsight at this point.  There’s a lot of effort to get started, and frankly, if I’m going to go through all of this effort, I’ll just build my own Cloudera or Hortonworks cluster and get full control.  I’ve also been unable to use Polybase on my on-prem SQL Server instance against an HDInsight node.  There are cases where I can see HDInsight being useful, but I don’t think there’s enough time and administrative savings over building your own Hadoop cluster to make it a worthwhile choice.  Maybe in a few years they’ll have it nailed down, but for now, I’d stick to custom-built Hadoop either on-prem or as VMs in a cloud provider like Azure or AWS.

 

Advertisements

New Video: Much Ado About Hadoop

I’ve decided to start recording my presentations to ensure that I have them in video form.  They’re much closer to one-take videos than polished, professional videos, but you’ve got to start somewhere.

Here is my first video, Much Ado About Hadoop.

The main audience for this video is people who are interested in learning a bit about the Hadoop ecosystem from a high level but not sure where to begin.  I cover this from a historical perspective, explaining what the major components of Hadoop are, why they made sense in the context of contemporary hardware and software paradigms, and how the Hadoop ecosystem has evolved in response to changes in hardware and software.  This talk is approximately 90 minutes in length and is an expanded version of the talk I’ve given at SQL Saturdays.

From here, as I get a bit more familiar with my software and equipment—and as I get more time—I will post additional videos.  I already know what my second video will be and want to get it out in the next two or three weeks, depending upon circumstances.  From there, who knows?

NDC Sydney Talks Up

NDC has released videos for NDC Sydney.  The NDC Conferences channel has the full set, and there were some fantastic talks this year.

I gave two talks at the conference:

They were surprisingly not awful!  Probably due to all of the post-production…

My favorite talk from NDC Sydney was Laura Bell talking about security for startups.  Definitely watch that one too; she gives four easy tips for us to use to improve our companies’ security postures.

Feasel’s Law

I’ve had this idea on my mind for a while and figured I’d might as well write it down…

Feasel’s Law – Any sufficiently advanced data retrieval process will eventually have a SQL interface.

This is a bit tongue-in-cheek, but it plays out pretty well; typically, a major sign of database maturity is its reimplementation (at least in part) of SQL.  Examples:

  • Hadoop started out as writing map-reduce jobs in Java.  Yahoo then created Pig to work with ETL processes.  After that, Facebook created Hive, an implementation of SQL for Hadoop.
  • Spark started out using Java and Scala, and then Python.  Not too long after, we had Shark and then Spark SQL.
  • HBase is a NoSQL database on top of Hadoop.  Now we have Phoenix, which is a SQL interface to HBase.
  • The primary methods of operation with Cassandra are SQL-like statements.
  • Riak TS has a SQL interface.  Incidentally, I love the author’s comment that SQL isn’t dead yet…
  • The easiest way to access CosmosDB data?  SQL.  That interface isn’t fully baked yet—it doesn’t allow grouping, for example—but it’s getting there.

One of the few strong counter-examples is MongoDB, which doesn’t have a SQL interface but does have a SQL translation guide.  DynamoDB also does not offer a SQL interface, though there are third-party interfaces and management tools which give you the same effect.

Otherwise, if you’re using a database that was created after 1990, has any significant user base, and is a mature platform, chances are it’s got a native SQL interface.

Temporal Tables As SCD2 Tables

In this post, I am going to look at using SQL Server 2016 temporal tables as slowly changing dimension tables.  We will walk through the setup and then look at what I consider to be the biggest disappointment with temporal tables; I will then show a workaround for this issue.

Act 1:  The Setup

Temporal tables were introduced in SQL Server 2016.  They take a lot of the busy work out of creating history tables, handling things like data movement and building interfaces to query historical data for you.  This makes them very interesting from a T-SQL developer’s perspective, as it potentially allows us to keep our primary table smaller while still retaining historical data for user query.

In my example, I am going to build a table to predict quantity sold for a series of products.  In the real version of these tables, I have some fancy R code generating product models, and I store them in a table called ProductModel.  I want to have the latest version of the model readily available to me, but I would also like to store historical models for a product; that way, if I see a persistent problem with the model, I can load an older version of the model from SQL Server back into R and analyze the model further.

For this post, I’m going to do something a bit simpler; I’ll create a ProductModel table with a product ID and a model number.  This model number represents my R model and allows me to simplify the problem considerably by skipping the whole model creation part.

Here’s my ProductModel table:

CREATE TABLE [dbo].[ProductModel]
(
	ProductID INT NOT NULL,
	ModelNumber INT NOT NULL,
	StartDateGMT DATETIME2(3) GENERATED ALWAYS AS ROW START CONSTRAINT [DF_ProductModel_StartDateGMT] DEFAULT (SYSUTCDATETIME()),
	EndDateGMT DATETIME2(3) GENERATED ALWAYS AS ROW END CONSTRAINT [DF_ProductModel_EndDateGMT] DEFAULT(SYSUTCDATETIME()),
	PERIOD FOR SYSTEM_TIME (StartDateGMT, EndDateGMT),
	CONSTRAINT [PK_ProductModel] PRIMARY KEY CLUSTERED
	(
		ProductID
	)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 100, DATA_COMPRESSION = PAGE)
) WITH(SYSTEM_VERSIONING = ON (HISTORY_TABLE = dbo.ProductModelHistory));
GO

Note that I am creating a history table called dbo.ProductModelHistory, and the valid date range for each model will be from StartDateGMT until EndDateGMT.

Next up, I want to create a table to store my predictions.  In this table, I store the product ID, the date when I made the prediction, and how much I predicted we would sell of the product.  Note that this table does not have a model number:  the people who care about the predictions don’t have any idea what an R model is and don’t care how we store that model.

CREATE TABLE [dbo].[QuantitySoldPrediction]
(
	ProductID INT NOT NULL,
	DatePredictionMade DATE NOT NULL,
	PredictedQuantitySold INT NOT NULL,
	CONSTRAINT [PK_QuantitySoldPrediction] PRIMARY KEY CLUSTERED
	(
		ProductID,
		DatePredictionMade
	)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 100, DATA_COMPRESSION = PAGE)
);
GO

Now I want to populate these tables. Because I want to load historical data, I need to set system versioning off temporarily.

--Insert some historical data
ALTER TABLE dbo.ProductModel SET (SYSTEM_VERSIONING = OFF);

INSERT INTO dbo.ProductModelHistory
(
	ProductID,
	ModelNumber,
	StartDateGMT,
	EndDateGMT
)
VALUES
(1, 1, '2017-06-01', '2017-06-14'),
(1, 2, '2017-06-14', '2017-06-28'),
(1, 3, '2017-06-28', '2017-07-03 14:29:21'),
(2, 1, '2017-07-01', '2017-07-04 08:00:00');

ALTER TABLE dbo.ProductModel SET (SYSTEM_VERSIONING = ON (HISTORY_TABLE = dbo.ProductModelHistory));

INSERT INTO dbo.ProductModel
(
	ProductID,
	ModelNumber
)
VALUES
(1, 4),
(2, 2),
(3, 1);

--Insert predictions, some for the historical data and some for now.
INSERT INTO dbo.QuantitySoldPrediction
(
	ProductID,
	DatePredictionMade,
	PredictedQuantitySold
)
VALUES
(1, '2017-06-02', 8),
(1, '2017-06-15', 11),
(1, '2017-06-19', 9),
(1, '2017-06-29', 21),
(1, '2017-07-02', 23),
(2, '2017-07-02', 2),
(2, '2017-07-03', 1),
(1, GETUTCDATE(), 44),
(2, GETUTCDATE(), 6),
(3, GETUTCDATE(), 16);

At this point, we have the following:

  • Three products in our ProductModel table
  • Four historical ProductModel records.  Note that there is a gap between historical end dates and the start dates in the ProductModel table; that’s okay for what we’re doing.
  • Ten predictions, including seven historical predictions and three current predictions.

Act Two:  The Falling Action

Now that I have my data, I want to do something simple:  I want to show the model number associated with each prediction.  This involves looking at the ProductModel and ProductModelHistory tables based on StartDateGMT.

The documentation for temporal tables shows how you can use FOR SYSTEM_TIME AS OF to look at time frames.  For example:

SELECT
	pm.ProductID,
	pm.ModelNumber,
	pm.StartDateGMT,
	pm.EndDateGMT
FROM dbo.ProductModel FOR SYSTEM_TIME AS OF '2017-07-02 08:00:00' pm;

This brings us back the following result:

ProductModelTemporal

We can also use a variable, allowing us to look at what the table looked like at a user-defined point in time:

DECLARE
	@InterestingTime DATETIME = '2017-07-02 08:00:00';

SELECT
	pm.ProductID,
	pm.ModelNumber,
	pm.StartDateGMT,
	pm.EndDateGMT
FROM dbo.ProductModel FOR SYSTEM_TIME AS OF @InterestingTime pm;

This gives us back the same result. Thinking about this syntax, what I want to do is join to dbo.QuantitySoldPrediction table. Let’s start simple:

DECLARE
	@InterestingTime DATETIME = '2017-07-02 08:00:00';

SELECT
	pm.ProductID,
	pm.ModelNumber,
	pm.StartDateGMT,
	pm.EndDateGMT
FROM dbo.ProductModel FOR SYSTEM_TIME AS OF @InterestingTime pm
	INNER JOIN dbo.QuantitySoldPrediction qsp
		ON pm.ProductID = qsp.ProductID;

This query succeeds but returns results we don’t really want:

ProductModelTemporalSameDate

This brings back all 9 records tied to products 1 and 2 (because product 3 didn’t exist on July 2nd at 8 AM UTC). But it gives us the same start and end date, so that’s not right. What I really want to do is replace @InterestingTime with qsp‘s DatePredictionMade, so let’s try that:

ProductModelTemporalInvalid

This returns a syntax error. It would appear that at the time FOR SYSTEM_TIME is resolved, QuantitySoldPrediction does not yet exist. This stops us dead in our tracks.

Act Three: The Denouement

We don’t have an easy way of solving the problem, but we do have a complicated way. Under the covers, dbo.ProductModelHistory is just a table. That means we can still query it like any other table. What we want is the latest single product model date that fits within each quantity sold prediction’s date. To solve this, we pull one of my favorite rabbits out of the hat: the APPLY operator.

SELECT
	q.ProductID,
	q.DatePredictionMade,
	q.PredictedQuantitySold,
	pm.ModelNumber
FROM dbo.QuantitySoldPrediction q
	CROSS APPLY
	(
		SELECT TOP(1)
			pmbuild.ModelNumber
		FROM
		(
			SELECT
				pm.ModelNumber,
				pm.StartDateGMT
			FROM dbo.ProductModel pm
			WHERE
				pm.ProductID = q.ProductID

			UNION ALL

			SELECT TOP(1)
				pmh.ModelNumber,
				pmh.StartDateGMT
			FROM dbo.ProductModelHistory pmh
			WHERE
				pmh.ProductID = q.ProductID
				AND pmh.StartDateGMT <= q.DatePredictionMade
			ORDER BY
				pmh.StartDateGMT DESC
		) pmbuild
		ORDER BY
			pmbuild.StartDateGMT ASC
	) pm;

This query is quite a bit longer than I’d like, as I need to build a sub-query to union together current and historical data, and then grab the top row for each quantity sold prediction. On the plus side, the results are what I want:

ProductModelTemporalVictory

This is one of two reasons that make me think that temporal tables do not make for great slowly changing dimension implementations.  The other reason is that you can’t insert into the history table as long as you have SYSTEM_VERSIONING on.  These tables can be quite useful for storing fact table history, as the FOR SYSTEM_TIME AS OF point-in-time snapshot makes a lot more sense in that case.  And as we saw above, if you want to try hard enough when writing your queries to retrieve data, you can make them work.  In my case, I had one query like this I needed to write, so the benefits still outweighed the costs.

Companero Conference

SpeakerPlug_Kevin

I am extremely pleased to have the opportunity to speak at Compañero Conference. Compañero Conference (or CompaCon, as I can’t pronounce “Compañero” because I can’t roll that n) is a paid conference in Norfolk, Virginia, starting the evening of October 3rd and going on through the 5th.

The purpose of Compañero Conference is to give lone DBAs, accidental DBAs, and new DBAs a place to meet compatriots and learn some useful skills.  Carlos and Steve have a pattern in mind for an intimate affair:  a single track of sessions, social events to give people a chance to get to know someone in a similar situation, and unfettered access to great speakers (and me).

I’m going to do two sessions at CompaCon:  one on database security, as well as a gentle introduction to the Hadoop ecosystem.  Because it’s a small group, you’ll have plenty of chances to get one-on-one discussion time with any of the speakers, so if you need help convincing your boss to attend, there’s a built-in benefit:  you have some really sharp people (and me—oh, wait, I did that one already) trapped in a room and you’ll get to pick our brains about work topics.  And unlike a SQL Saturday, you’ll still get the whole weekend to yourself.

Tickets are currently available at a discounted price of $400, but that’s only for a limited time.

What Makes A Data Platform Professional?

Eugene Meidinger hates the broadness of the term “data platform professional”:

During PASS Summit, I wrote a post about the broadening data platform. I talked about the term Data Professional, and how I feel how it describes the changes going on the in SQL space. Here’s the problem: It’s a terrible guide. It’s a great description, it’s a wonderful attitude, it’s an ambitious goal; but it’s a terrible guide.

Being a data professional means being a jack of all trades. It means being a renaissance man (or woman). It’s means a career plan that looks like this:

Here’s my summary of Eugene’s argument:

  1. The concept of “data platform” is too broad to be meaningful, because
  2. nobody can gain expertise in the entire data platform.  Therefore,
  3. focus on a particular slice of the platform.

Before I go on to make a counter-argument, let me start by saying that I agree almost whole-heartedly with this summary.  But that never stopped me from arguing before…

So here’s my counter-argument:  the concept of “data platform” is quite broad and nobody will master it all.  Within that group, there are core skills, position-specific core skills, secondary skills, and tertiary skills.  I recommend one of two paths:

  1. The Polyglot:  be great at the core and position-specific core skills, good at secondary skills, and aware of tertiary skills.  Be in the top 25% of your field at one to two skills, ideally one core and one secondary.
  2. The Specialist:  be outstanding (in the top 1-3% of your field) at one to two skills.

With that in mind, let’s flesh this argument out.

Gauss and Mathematics

Carl Friedrich Gauss was the last true polyglot mathematician.  He was able to make significant contributions to pretty much every branch of mathematics at the time, something no mathematician has been able to do since.  The reason for this is not that Gauss was that much smarter than any other person since him, but rather that Gauss himself helped expand the world of mathematics considerably, so a 21st century Gauss would need to know about everything Gauss did plus what his contemporaries did plus what their chronological successors did.  This spider-webbed growth of knowledge makes it impossible for one person to repeat what Gauss did.

Even though nobody can be a “true” polyglot mathematician—in the sense of knowing everything about mathematics at the present time—anymore, it doesn’t mean that “mathematics” is so broad a term as to be meaningless as a career plan.  Instead, it means that we all have to specialize to some increasingly greater extent relative to the entire body of knowledge.

What’s The Right Level Of Specialization?

One of my colleagues, Brian Carrig, was working the SQL Clinic at SQL Saturday Raleigh.  When he lost his original “I can help you with…” badge, he created his own.

This…might not be the right level of specialization.  Brian’s a sharp enough guy that he knows more than the average practitioner on a wide range of topics, but I don’t think the choices are to be Brian or to be replaced by robo-developers; there are few enough people who can reach Brian’s level of skill that if these were the only choices, it’d be a dystopian nightmare for IT practitioners (and I’m not just saying that because I want Brian to provision me some more SQL Server instances).

So there has to be a middle ground between “know everything” and “exit the industry.”  I agree with Eugene that we have to specialize, and here’s what I see, at least based off the current landscape.

Sub-Data Platform Job Categories

To save this from being a 5000-word essay, let’s pick four very broad categories for data platform jobs.  These share some overlap and there are certainly people who don’t fit in any of these roles, so this is not a complete taxonomy.  It should serve as a guide for us, however.

The four broad categories I have in mind are as follows:  database developer, database administrator, Business Intelligence specialist, and data analyst.  Database developers focus on writing and tuning queries and tend to specialize in performance tuning.  Database administrators focus on backup and recovery, dealing with database corruption, and availability; they tend to specialize in process automation.  Business Intelligence specialists build warehouses and migrate data from different systems into warehouses; this is a broad enough term that it’s hard to say what they specialize in, but pick one piece of the puzzle (cubes, warehouse modeling, ETL) and you’ll find people who specialize there.  Finally, data analysts apply concepts of statistical analysis to business problems and come up with explanations or predictions of behavior.

Choosing Your Skills

I see four major categories of skill, but the specific details of what fits into each category will differ based on the role.  Again, this is not intended to be a taxonomy but rather a conceptual description.  We have the following concepts:  core skills, position-specific core skills, secondary skills, and tertiary skills.

Core skills are skills which are common to all data platform professionals.  These are relatively uncommon but tend to be fundamental to all positions.  Think of things such as an understanding of SQL and relatively basic query tuning (which includes figuring out when to tune a query and what information is available on your platform for tuning queries).  But really, when we think of core skills, we’re thinking of position-specific core skills.

As an example of a position-specific core skill, administrators need to know how to back up and restore the databases under their care.  How you do this will differ based on the product, but if you administer a database without knowing how to recover it, you’re running a major risk and have a major skill gap.  So basically, position-specific core skills are the things that you train juniors to do and expect mid-levels to know already.

Secondary and tertiary skills are even more nebulous, but I see them as skills which are ever-more-distant from the position-specific core skills.  For a database administrator, the ability to write .NET code is a secondary skill:  relatively few employers or practitioners think of a database administrator as someone who needs to write C# or F# code, but they can see how it’d apply to the position.  A language like R would be a tertiary skill:  a skill which the average practitioner has trouble tying back to day-to-day life.  Most DBAs never think of using R for anything (although I’m trying to change that in my own small way).

Now, skills move over time.  As Eugene points out in his post, I sincerely believe that administrators who don’t understand Powershell are at a serious disadvantage and that there will come a time that database administrators entirely lacking in Powershell scripts will be superfluous.  We’re probably the better part of a decade away from containerization technologies like Docker having the same impact, but it’s ramping up as well.  On the other side, an example of a technique that differentiated good from great database administrators a long time ago was the ability to lay out files on disk to minimize drive latency.  SANs and later SSDs killed that skill altogether.

I wouldn’t describe these skill shifts as fluid, but rather tectonic; they don’t change overnight but they do significantly alter the landscape when they happen, and you don’t want to be on the wrong side of that tectonic shift.

So What’s The Answer?

In my eyes, the answer is to build your skills along one of two paths:  the polyglot or the specialist.  The polyglot knows a little about a lot but has a few major points of depth.  A polyglot database developer might know a lot about writing PL/SQL and tuning Postgres queries, but also has enough experience to query Lucene, write some basic Informatica scripts, and maintain a Python-based ETL project.  At many companies, a broad slice with good depth in a couple skills and relatively little depth in several skills is enough, and for our polyglot developer, it keeps doors open in case the market for Postgres developers flattens out for a few years or our developer wants to go down a new road.

In contrast to the polyglot, a specialist developer is elite at certain skills and knowingly ignorant of most others.  A specialist SQL Server query tuner is in the top 1-3% of all developers at tuning SQL Server queries and knows all kinds of language and configuration tricks to squeeze percentages off of queries which take milliseconds or even microseconds, but might not know (or care) much about the right way to automate taking backups.  You go to the polyglot to solve general, overarching problems but go to the specialist because you have a particular problem which is beyond the polyglot’s skill level.

In case the parallel isn’t completely clear, this model fits with the model for medical doctors:  you have Primary Care Physicians/General Practitioners (PCPs or GPs) and you have specialists.  The PCP knows how to diagnose issues and sees patients with a wide range of maladies.  Sometimes, the PCP refers a patient to a specialist for further diagnosis or action. As an example, a PCP might stitch up a patient with a nasty gash, but that same PCP won’t rebuild a shattered femur; that PCP will refer the patient to a specialist in that area.

Is This Really The Right Direction?

A couple days before Eugene’s post, I had a discussion with a person about this topic.  She was getting back into development after a couple years doing something a bit different, and one thing she noticed was the expectation of employees being more and more polyglot.  Her argument is that we, as IT professionals, have a lot to do with this, as there’s a bit of a race to the bottom with developers wanting to learn more and willing to spend more and more time learning things.  This makes IT jobs feel like running on a treadmill:  you expend a lot of effort just trying to keep up.  And this shows in how job titles and job role expectations have changed, including the concept of a data scientist (which I’ll touch upon at the end).

I’m not sure I agree with this assessment, but it does seem that more positions require (or at least request) knowledge of a range of skills and technologies, that it’s not enough to be “just” a T-SQL stored procedure developer in most shops.  So to that extent, there seems to be a combination of developers innately moving this direction as well as job roles shifting in this direction.

To the extent that she is correct, there’s a good question as to how sustainable this strategy is, as the platform is expanding ever-further but we don’t have any more hours in the day.  But at the same time, take a step back and this is nothing new:  database developers are already a subset of all developers (much as we are loathe to admit this sometimes), so these categories are themselves cases of specialization.  But let’s shelve that for a moment.

Anecdote:  ElasticSearch And Me

It’s time for an anecdote.  A few months ago, I started running a predictive analytics team.  Our first project was to perform predictions of disk growth based on historical data.  No big deal at all, except that all the data was stored in ElasticSearch and our DBA team wanted the results in ElasticSearch as well.  My experience with ElasticSearch prior to this assignment was practically nil, but I went into it eager…at the beginning…

There were days that I wasted just figuring out how to do things that would take me five or ten minutes in SQL Server (particularly around aggregating data).  In that sense, it was a complete waste of time to use ElasticSearch, and throughout that time period I felt like an idiot for struggling so hard to do things that I intuitively knew were extremely simple.  It took a while, but I did muddle through the project, which means that I picked up some ElasticSearch experience.  I’m definitely not good at writing ElasticSearch queries, but I’m better than I was three months ago, and that experience can help me out elsewhere if I need to use ElasticSearch in other projects or even to give me an idea of other ways to store and query data.

This is one of the most common ways that people learn:  they muddle through things because they need to, either because the work requires it or because it’s a challenge or growth opportunity.  If you’re able to take the results of that muddling through and apply it to other areas, you’ve got a leg up on your competition.  And I think it’s easier to form quality systems when you have a few examples—it’s easier to reason accurately from several scenarios rather than just one scenario.

Conclusion

Summarizing a rather long blog post, I do agree with Eugene that “data platform” is a very broad space, and expecting someone to know everything about it would be folly.  But that’s not unique.  “Programmer” is extremely broad as well, but we don’t expect embedded systems developers to write databases (or even write database queries) or design responsive web applications.  Doctors and lawyers specialize to extreme degrees, as do plenty of other professionals, and I see no reason to expect anything different from data platform professionals.  I do believe that unless you are at the extreme right end of the distribution for certain skills (and can thus be a top-end specialist), you want to err in the direction of being broader than deeper, as it reduces the chances of getting caught in a sudden paradigm shift (remember how cool Web Forms was for about 4 years?) and risking your livelihood as a result.

One other point I want to make is that the broadness of this space shows the power of teamwork and complimentary skills.  There’s an argument that a good scrum team is made up of a bunch of generalists who can all fill in each other’s roles on demand.  I think that concept’s garbage for several reasons, one of which is that you often need specialists because specialists fix problems that generalists can’t.  So instead of having a team of generalists, you have a team of people with different skills, some of which overlap and some of which complement each other:  you have one or two data specialists, one or two UI specialists, one or two “backbone” specialists (usually .NET or Java developers), one or two QA specialists, etc.  This says to me that there’s less a difference in kind than a difference in degree, even between the polyglot type and the specialist type:  you can be a polyglot with respect to other data professionals (because you’re using several data platform technologies and are working across multiple parts of the stack) while being a specialist with respect to your development team (because you’re the database person).

Coda:  Data Science

One bit at the end of Eugene’s post is that he’s interested in digging into data science.  For a post criticizing the impossibility of living up to a definition, this did elicit a knowing chuckle.  The problem is that the term “data scientist” is a microcosm of the issues with “data platform professional.”  To be a data scientist, you should have development skills (preferably in multiple languages, including but not limited to SQL, R, and Python), a strong background in statistics (ideally having worked through PhD level courses), and a deep knowledge of the context of data (as in spending years getting to know the domain).  I saw the perfect t-shirt today to describe these people.

trust-me-i-m-a-unicorn-ladies-shirt-women-s-t-shirt

There are very few people who have all three skill sets and yet that’s what being a data scientist requires.  It’s the same problem as “data platform professional” but at a slightly smaller scale.