# Building A Genetic Algorithm

This is part of a series entitled Genetics In Action.

So far, we have learned a little bit about evolutionary algorithms and taken a look at just enough high school biology to review the basics of genetics.  Today, I want to look at one particular type of evolutionary algorithm called a genetic algorithm.

Although John Holland did not invent the concept of genetic algorithms, he is the man most responsible for popularizing and developing the concept.  Holland’s Adaptation in Natural and Artificial Systems is a classic of the field and ties together the genetic metaphor.  I highly recommend this book if you’re interested in the topic.

## Digging Into The Algorithm

In the last post, we looked at how the genetic metaphor ties development concepts to biological concepts, and today we will move beyond that high-level description and cover the specifics of how genetic algorithms work.  We will start by looking at the simplest type of genetic algorithm:  a single chromosome with a fixed number of genes, each of which has two alleles.  In computer science terms, think about a fixed-length array of boolean values.

Two sample chromosomes

In the image above, we can see two sample chromosomes, each of which has eight genes.  We want to build up a population of these chromosomes at random—one of the interesting parts of the genetic algorithm is that it (usually) doesn’t matter where in the total search space we begin, so starting at a bunch of random points is just as good as anything else.

When it comes to choosing the size of the population, there aren’t too many hard-and-fast rules.  I have read recommendations that you should have at least 2 * N, where N is the number of genes that each chromosome has.  If you’re looking at 10-30 genes, I’ve typically had good luck with a population size of somewhere between 100 and 500.  You’ll find out that there is a maximum interesting population size, after which you don’t really get any benefit:  you won’t converge on a solution any faster, and it will take longer to do all of the processing.

Once we have our population, the next step is to score each organism in the population.  To score an organism, we apply a fitness function.  In computer science terms, this is usually an actual function, where we use the organism’s chromosomal representation as the inputs and generate and return a score for each chromosome.

Scoring each organism

In the image above, we have defined a score for each organism.  This score is typically one of two things:  either it is the distance from an ideal point, or it is a valuation.  In the first case, think of a set of (x, y) coordinates.  We want to define a chromosome that, given an x coordinate, will generate its appropriate y coordinate.  We will calculate some distance between the predicted y and the actual y (for example, we could calculate the Root Mean Square Deviation), where a perfect match has a deviation score of 0.  On the other side, suppose that we can produce N separate products.  Each product has a cost and a price for sale.  Our genetic algorithm might describe the combination of goods we create, and the score would be the net margin (revenue minus cost) of those products.  In that case, a higher number is better for us.

## It’s A Generational Thing

Now that we have the basic idea of a fitness score behind us, let’s go to the next step:  making babies.

We use a complicated matrix with dozens of measures to find the best fit

Now I am showing the entire population, which has four members.  Each member of the population has its own score, and we will use those scores to help us figure out the next generation of organisms.  The mechanism I am showing in the picture above is the simplest mechanism for genetic algorithms, which is the roulette wheel selection.  Basically, take the fitness values for each member of the population and you get a total score—that’s the 508 above.  Figure out each member’s percentage of the score, and you have a set of values which sums up to 1.  Pick a random number between 0 and 1, and wherever you land on the cumulative distribution function, you have your choice of parent.  Note that you draw with replacement, meaning that you can pull the same organism more than once.

To motivate this example, let’s suppose that red owns numbers from 0 up to .335, blue owns numbers from .335 up to .587, green owns .587 until .815, and yellow owns .815 through 1.0.  Our first random number drawn is .008, so the lucky winner is red.  Then, we draw a second parent and pulled .661, which happens to be squarely in green territory.  We now have our two parents.

## Crossover

Now that we have our two parents, we are going to generate two children.  I need to introduce a new concept:  crossover.

When a mommy chromosome loves a daddy chromosome very much…

Crossover is the recombination of a segment of a chromosome.  In the example above, we are switching genes 3-5 from each of the parents for each of the children (though child #2 is shy and is hiding off-camera).

This action is part of the genius behind genetic algorithms.  We’ve already taken some of the fittest organisms (in a large population, we’re going to pull successful organisms much more frequently than unfit organisms, and there are other pulling techniques which bias toward successful chromosomes even more than roulette wheel), and by recombining slices of their genes, we are able to test the waters with new combinations to see if we can find something even more successful.

Of course, there’s no guarantee that the new combination will be more successful than its parents were, so we have a concept known as the crossover percentage.  That is, we only perform crossover a certain percentage of the time.  In practice, this is often anywhere from 60% to 90%.  If we don’t perform crossover, then the two chromosomes just mosey on into the next step of the process.  But if we do roll the dice and land on the ol’ chop-and-swap, then we have two more RNG rounds to play.

The first of these bonus random number pulls determines where we start the chop, and the second pull determines where we stop.  In the picture above, we start at gene 3 and end at gene 5, inclusive.  In genetic algorithms, we typically have fixed-size chromosomes (though that’s not always true!) and therefore symmetrical swaps.

## Turtle Power

The last step in our genetic factory floor is mutation.  One problem with survival-of-the-fittest is that, especially in later generations, we might run low on genetic diversity.  At an extreme, we end up with a population full of exactly the same chromosome, so no matter how you slice them, you get no novel patterns.  If we’ve reached the global maximum, that’s acceptable, but what if we ended up at only a local maximum and can’t jump over to the global max?  That’s where mutation comes into play.

Not pictured:  martial artist rat or antagonistic, anthropomorphic pig & rhino combo

Mutation works by modifying a particular gene’s value.  For each gene in each new chromosome, mutate with a probability p.  Usually p is somewhere between 0.001% and 1%, though I’ve read papers that argue in certain circumstances, you might need a mutation rate of 20% to 30%.  Those would be cases with very flat local minima/maxima where you can get trapped in that flat spot and need a kick out.

If you want a fitting metaphor for flat spots, I had an old Toyota Corolla with an old starter.  I’d be able to start the car up successfully several times in a row, but then I’d hit a dead spot in the flywheel and it just wouldn’t start.  Eventually, my dad bought me the Official Starter Repair Kit:  a crowbar.  His advice was to apply sufficient percussive force until I nudged the starter out of its dead spot, and then the car could start successfully.  Mutation provides benefits in much the same way.  And just like my beater bar, mutation is not a technique you want to rely upon constantly.  At the extreme, mutation is just random search, losing all of the important information that a genetic algorithm learns during the process.

## Finishing The Process

At this point, we have a finished organism.

All grown up and ready to take on the world

We do this for each slot in the population and then repeat the process:  score, choose, cross over (or not), mutate.  We have a few potential end conditions.  Some of them are:

1. Stop after a fixed number of generations
2. Stop after we reach a known ideal solution (e.g., 0 distance from the actual values)
3. Stop after we get close enough to a known ideal solution
4. Stop after we have stasis for a certain number of generations, where we have not improved the fitness score for the best agent in a while
5. Stop after a certain amount of time

There are other stop conditions as well, but these are the most common.

## Conclusion

Today we covered the basics of genetic algorithms and how the process works.  Tomorrow, we’ll look at using a genetic algorithm library to solve different types of problems.

# The Basics Of Genetics

This is part of a series entitled Genetics In Action.

In today’s post, I want to cover some extremely basic genetic concepts.  We use these concepts in genetic algorithms and genetic programming, and understanding the biological explanations will help us apply the metaphor to software development.

## Evolutionary Biology In 400 Words

We have a population of self-contained entities called organisms.  Each organism is entirely independent from other organisms, in the sense that one organism in our population does not depend upon another organism for survival.  In evolutionary algorithms, organisms are our candidate solutions to problems:  we have a number of solutions in our population, and each solution is independent of the other solutions (although they will interact in complex ways, as we’ll talk about later).

Each organism has at least one chromosome.  The purpose of a chromosome is to carry genes.  In evolutionary algorithms, we often simplify the problem by saying that an organism has one chromosome, and it can become easy to conflate organisms and chromosomes for that reason.

Digging deeper, each gene has a number of alternate forms, called alleles.  Combining genes and alleles gives us DNA sequences called genotypes.  A genotype is a possible genetic structure.  Suppose that we have 32 genes, each of which can have 2 alleles.  This would give us 2^32 possible combinations, or a total of 4,294,967,296 possible genotypes.

Genotypes determine phenotypes.  Phenotypes are observable physical characteristics.  Classic examples of phenotypes include eye color, hair color, height, size, and beak structure.  But it’s important to understand that there is no 1:1 correspondence between a genotype and a phenotype; some number of genotypes can end up generating the same phenotype.  Phenotypes depend upon a certain combination of alleles but not the entire genotype.

Also, this goes the other way as well:  a phenotype may only appear when a particular combination of alleles hold a particular value.  Close may count in horseshoes and hand grenades, but it doesn’t count with genetics:  if there are seven alleles which combine to cause a phenotype and only six are set, you will not see the physical characteristics.  Think of this like a password:  unlike in TV and movies, if you have 18 of the 19 characters in a password correct, you shouldn’t get any indicator that you’re ever-so-close.  If we’re using an evolutionary process to find that password (which probably isn’t a good use for an evolutionary algorithm), we won’t get positive feedback until we get it absolutely correct.

The last aspect of genetics that I want to cover today is an environmental niche.  A niche is a set of features which certain phenotypes can exploit.  Think about life on the Arctic:  features like thick fur and layers of fat can help animals survive in these frigid climes.

## Applying These To Evolutionary Algorithms

To apply biological concepts to evolutionary algorithms, I’m going to flop around a little bit and start at the end.

First, think about an environmental niche.  The programming concept we’re describing here is a fitness function.  An example of a fitness function may be a set of (x, y) coordinates, and our goal is to find something which maps as closely as possible to those coordinates.

An example of a fitness function

To solve the fitness function, we want to build up a population of organisms.  Our organisms are actually candidate solutions.  In the example above, these would be mathematical functions which fit the fitness function to a greater or lesser extent.

For the sake of simplicity, each organism will have one chromosome.  In its simplest form, a chromosome is an array which contains a set of elements.  Those elements are akin to genes in our biological nomenclature.  Each element has a set of alleles, that is, possible values.  Therefore, the genotype is the particular combination of elements in the chromosomal array.

A trivial example of a chromosome is a binary array:  [ 0, 0, 1, 0, 1, 1, 0, 1 ] would be a sample chromosome.  Here we have 8 genes, each with 2 alleles, for a total of 256 genotypes.  We apply this to the fitness function and get back a score, and that score tells us how fit the organism is.

In the next post, we’ll go deeper into this metaphor, diving into genetic algorithms.  I’ll explain more about fitness functions and explain what makes these evolutionary algorithms rather than biological algorithms.

# Evolutionary Algorithms: The Basics

This is part of a series entitled Genetics In Action.

I’m in the process of creating a new talk, entitled Genetics In Action:  Evolutionary Algorithms.  I have no idea where, outside of work, I’m going to give this talk; still, it’s a topic I love and maybe I’ll sneak it in at some conference somewhere.

Here’s my abstract:

Evolutionary algorithms help us solve optimization problems where we know what the answer is, but don’t know how to get that answer.  In this talk, we will look at how different evolutionary algorithms apply the concepts of genetics to discover formulas and patterns.  Specifically, we will look at genetic algorithms and genetic programming, digging into how they work and solving a number of problems with each.  We will also include a crash course on basic genetics, just in case high school biology isn’t fresh in your mind.

With that in mind, the next several posts will relate to evolutionary algorithms, how they work, and a few of the examples I’m going to use in my demos.

## What Makes For A Good Solution?

Revamping a blog post from a decade ago, good solutions to problems tend to have seven positive qualities.  They are:

1. Correctness – Abstracting away from imprecisions in measurement, the goal is to find the a priori correct solution.
2. Consistency – Do the same problem, get the same answer.
3. Justifiability – Cause implies effect; results come from premises through the application of logical rules.
4. Certainty – Chance should ideally not be involved in the solution.
5. Orderliness – The process is a logical, step-by-step process in which each step leads to the next.
6. Parsimony – Given several possible solutions which have the same explanatory power, the simplest is usually the best.
7. Decisiveness – Once a solution is found, the problem is solved.  In particular, there is a well-defined solution concept.

These are principles that John Holland talks about in his groundbreaking work, Adaptation in Natural and Artificial Systems.  They are seven properties that I think we can all agree are good things to have, and we’d expect them to be important in coming up with solutions to problems.

Evolutionary algorithms fail every one of these criteria.  Despite that, they are very useful tools for solving problems.

## What Evolutionary Algorithms Are Out There?

There are four major classes of evolutionary algorithms.  They are:

1. Genetic algorithms
2. Genetic programming
3. Evolutionary programming
4. Gene expression programming

For this series, I’m going to focus on the two that I’m most familiar with:  genetic algorithms and genetic programming.

## What Is An Evolutionary Algorithm?

Evolutionary algorithms are programming techniques which take advantage of biological metaphors to solve a particular class of problem.  At a high level, the algorithmic mechanisms aren’t that difficult to understand, although in practice, you can go significantly deeper than what I’m going to cover in this series.

All of my code for this series will be in R.  There are evolutionary algorithm libraries in a number of languages, including .NET and Java.

## What Kinds Of Problems Does An Evolutionary Algorithm Solve?

There is a particular class of problems that evolutionary algorithms are great at solving.  These are problems with a number of characteristics:

1. They include very large search spaces.  Let’s suppose you can set the values of variables x1…x20, and each variable can be an integer between 0 and 9.  One way of finding the best solution would be to loop through each potential solution, starting from (0, 0, 0, …, 0, 0, 0) up through (9, 9, 9, …, 9, 9, 9).  There’s just one problem:  doing this would take 10^20 trials.  10^20 is 100,000,000,000,000,000,000.  That’s a pretty big number.
2. There is a known way to score answers.  Evolutionary algorithms are supervised learning algorithms; in order to evolve a solution, we need an end goal.
3. The solution is too complex to do by hand.  If you can write it out yourself (or calculate it easily), you don’t need a complex algorithm!
4. We expect that we have the tools necessary to come up with a solution.  As we’ll see, evolutionary algorithms only work if we provide the right building blocks.
5. The environment is regularly-changing, meaning the appropriate solution changes regularly.  This is not a requirement, but in this scenario, evolutionary algorithms perform much better than many of their static counterparts.
6. The type of problem is a “hill-climbing” problem, where fitness is approximately continuous.  They can handle “wavy” fitness functions, so we definitely can deal with multiple peaked fitness functions.  But if the fitness function is overly discrete (mostly made up of non-continuous jumps in the fitness function), the likelihood of success goes down.

## Okay, So What Kinds Of Problems Does An Evolutionary Algorithm Solve?

Hey, wait, I already answered that!  Oh, fine, you want a list of specific examples…  Very well.  The short answer is that evolutionary algorithms are most useful for solving NP-Hard problems, where we don’t have a one-size-fits-all solution.

• Optimization Problems – Finding the minimum or maximum value of a mathematical function, circuit layout, antenna design, the traveling salesman problem, the knapsack problem.
• Machine Learning Problems – Finding optimal weights for neural nets, generating rules for classification engines, training robotic sensors.
• Economic Models – Portfolio bidding, game theory.
• Ecological Models – Host-parasite co-evolution, symbiosis, resource flow.

These are particular examples of cases in which we see the characteristics I described above.

## Why Not Use Evolutionary Algorithms For Everything?

Evolutionary algorithms aren’t suitable for all types of problems.  If you don’t have a way of programmatically judging correctness (e.g., if you’re performing an unsupervised learning task such as clustering), then evolutionary algorithms simply won’t work.  Also, evolutionary algorithms share a set of negative tendencies:

• There is no guarantee that an evolutionary algorithm will find a solution, even if there really is one.
• There is no guarantee of good performance or that the algorithm will finish in a reasonable time.
• The answers may differ each time you run the algorithm.
• It is often hard to tell when the algorithm should stop:  I’ve seen cases where there’s stasis for 40-50 generations and then a sudden breakthrough.
• You can get stuck in local minima/maxima with evolutionary algorithms (though EAs are usually better about breaking free of these than other techniques like simulated annealing).
• There is no guarantee that the solution your algorithm will provide will be in the easiest-to-understand form.  There may be ways to simplify the formula or decision tree further, as evolutionary algorithms do not prevent things like contradictory branches or vestigial code.  The “best” solution is equivalent to the solution which comes the closest to our desired solution, and usually the size of the evolved result is irrelevant in that regard.

Despite all of these issues, evolutionary algorithms are very powerful techniques.  Over the course of the next few blog posts, I’m going to go into more detail.  The next post will provide a crash course on genetics.  After that, we’ll focus on genetic algorithms and then genetic programming.

# What Makes A Data Platform Professional?

During PASS Summit, I wrote a post about the broadening data platform. I talked about the term Data Professional, and how I feel how it describes the changes going on the in SQL space. Here’s the problem: It’s a terrible guide. It’s a great description, it’s a wonderful attitude, it’s an ambitious goal; but it’s a terrible guide.

Being a data professional means being a jack of all trades. It means being a renaissance man (or woman). It’s means a career plan that looks like this:

Here’s my summary of Eugene’s argument:

1. The concept of “data platform” is too broad to be meaningful, because
2. nobody can gain expertise in the entire data platform.  Therefore,
3. focus on a particular slice of the platform.

Before I go on to make a counter-argument, let me start by saying that I agree almost whole-heartedly with this summary.  But that never stopped me from arguing before…

So here’s my counter-argument:  the concept of “data platform” is quite broad and nobody will master it all.  Within that group, there are core skills, position-specific core skills, secondary skills, and tertiary skills.  I recommend one of two paths:

1. The Polyglot:  be great at the core and position-specific core skills, good at secondary skills, and aware of tertiary skills.  Be in the top 25% of your field at one to two skills, ideally one core and one secondary.
2. The Specialist:  be outstanding (in the top 1-3% of your field) at one to two skills.

With that in mind, let’s flesh this argument out.

### Gauss and Mathematics

Carl Friedrich Gauss was the last true polyglot mathematician.  He was able to make significant contributions to pretty much every branch of mathematics at the time, something no mathematician has been able to do since.  The reason for this is not that Gauss was that much smarter than any other person since him, but rather that Gauss himself helped expand the world of mathematics considerably, so a 21st century Gauss would need to know about everything Gauss did plus what his contemporaries did plus what their chronological successors did.  This spider-webbed growth of knowledge makes it impossible for one person to repeat what Gauss did.

Even though nobody can be a “true” polyglot mathematician—in the sense of knowing everything about mathematics at the present time—anymore, it doesn’t mean that “mathematics” is so broad a term as to be meaningless as a career plan.  Instead, it means that we all have to specialize to some increasingly greater extent relative to the entire body of knowledge.

### What’s The Right Level Of Specialization?

One of my colleagues, Brian Carrig, was working the SQL Clinic at SQL Saturday Raleigh.  When he lost his original “I can help you with…” badge, he created his own.

This…might not be the right level of specialization.  Brian’s a sharp enough guy that he knows more than the average practitioner on a wide range of topics, but I don’t think the choices are to be Brian or to be replaced by robo-developers; there are few enough people who can reach Brian’s level of skill that if these were the only choices, it’d be a dystopian nightmare for IT practitioners (and I’m not just saying that because I want Brian to provision me some more SQL Server instances).

So there has to be a middle ground between “know everything” and “exit the industry.”  I agree with Eugene that we have to specialize, and here’s what I see, at least based off the current landscape.

### Sub-Data Platform Job Categories

To save this from being a 5000-word essay, let’s pick four very broad categories for data platform jobs.  These share some overlap and there are certainly people who don’t fit in any of these roles, so this is not a complete taxonomy.  It should serve as a guide for us, however.

The four broad categories I have in mind are as follows:  database developer, database administrator, Business Intelligence specialist, and data analyst.  Database developers focus on writing and tuning queries and tend to specialize in performance tuning.  Database administrators focus on backup and recovery, dealing with database corruption, and availability; they tend to specialize in process automation.  Business Intelligence specialists build warehouses and migrate data from different systems into warehouses; this is a broad enough term that it’s hard to say what they specialize in, but pick one piece of the puzzle (cubes, warehouse modeling, ETL) and you’ll find people who specialize there.  Finally, data analysts apply concepts of statistical analysis to business problems and come up with explanations or predictions of behavior.

I see four major categories of skill, but the specific details of what fits into each category will differ based on the role.  Again, this is not intended to be a taxonomy but rather a conceptual description.  We have the following concepts:  core skills, position-specific core skills, secondary skills, and tertiary skills.

Core skills are skills which are common to all data platform professionals.  These are relatively uncommon but tend to be fundamental to all positions.  Think of things such as an understanding of SQL and relatively basic query tuning (which includes figuring out when to tune a query and what information is available on your platform for tuning queries).  But really, when we think of core skills, we’re thinking of position-specific core skills.

As an example of a position-specific core skill, administrators need to know how to back up and restore the databases under their care.  How you do this will differ based on the product, but if you administer a database without knowing how to recover it, you’re running a major risk and have a major skill gap.  So basically, position-specific core skills are the things that you train juniors to do and expect mid-levels to know already.

Secondary and tertiary skills are even more nebulous, but I see them as skills which are ever-more-distant from the position-specific core skills.  For a database administrator, the ability to write .NET code is a secondary skill:  relatively few employers or practitioners think of a database administrator as someone who needs to write C# or F# code, but they can see how it’d apply to the position.  A language like R would be a tertiary skill:  a skill which the average practitioner has trouble tying back to day-to-day life.  Most DBAs never think of using R for anything (although I’m trying to change that in my own small way).

Now, skills move over time.  As Eugene points out in his post, I sincerely believe that administrators who don’t understand Powershell are at a serious disadvantage and that there will come a time that database administrators entirely lacking in Powershell scripts will be superfluous.  We’re probably the better part of a decade away from containerization technologies like Docker having the same impact, but it’s ramping up as well.  On the other side, an example of a technique that differentiated good from great database administrators a long time ago was the ability to lay out files on disk to minimize drive latency.  SANs and later SSDs killed that skill altogether.

I wouldn’t describe these skill shifts as fluid, but rather tectonic; they don’t change overnight but they do significantly alter the landscape when they happen, and you don’t want to be on the wrong side of that tectonic shift.

In my eyes, the answer is to build your skills along one of two paths:  the polyglot or the specialist.  The polyglot knows a little about a lot but has a few major points of depth.  A polyglot database developer might know a lot about writing PL/SQL and tuning Postgres queries, but also has enough experience to query Lucene, write some basic Informatica scripts, and maintain a Python-based ETL project.  At many companies, a broad slice with good depth in a couple skills and relatively little depth in several skills is enough, and for our polyglot developer, it keeps doors open in case the market for Postgres developers flattens out for a few years or our developer wants to go down a new road.

In contrast to the polyglot, a specialist developer is elite at certain skills and knowingly ignorant of most others.  A specialist SQL Server query tuner is in the top 1-3% of all developers at tuning SQL Server queries and knows all kinds of language and configuration tricks to squeeze percentages off of queries which take milliseconds or even microseconds, but might not know (or care) much about the right way to automate taking backups.  You go to the polyglot to solve general, overarching problems but go to the specialist because you have a particular problem which is beyond the polyglot’s skill level.

In case the parallel isn’t completely clear, this model fits with the model for medical doctors:  you have Primary Care Physicians/General Practitioners (PCPs or GPs) and you have specialists.  The PCP knows how to diagnose issues and sees patients with a wide range of maladies.  Sometimes, the PCP refers a patient to a specialist for further diagnosis or action. As an example, a PCP might stitch up a patient with a nasty gash, but that same PCP won’t rebuild a shattered femur; that PCP will refer the patient to a specialist in that area.

### Is This Really The Right Direction?

A couple days before Eugene’s post, I had a discussion with a person about this topic.  She was getting back into development after a couple years doing something a bit different, and one thing she noticed was the expectation of employees being more and more polyglot.  Her argument is that we, as IT professionals, have a lot to do with this, as there’s a bit of a race to the bottom with developers wanting to learn more and willing to spend more and more time learning things.  This makes IT jobs feel like running on a treadmill:  you expend a lot of effort just trying to keep up.  And this shows in how job titles and job role expectations have changed, including the concept of a data scientist (which I’ll touch upon at the end).

I’m not sure I agree with this assessment, but it does seem that more positions require (or at least request) knowledge of a range of skills and technologies, that it’s not enough to be “just” a T-SQL stored procedure developer in most shops.  So to that extent, there seems to be a combination of developers innately moving this direction as well as job roles shifting in this direction.

To the extent that she is correct, there’s a good question as to how sustainable this strategy is, as the platform is expanding ever-further but we don’t have any more hours in the day.  But at the same time, take a step back and this is nothing new:  database developers are already a subset of all developers (much as we are loathe to admit this sometimes), so these categories are themselves cases of specialization.  But let’s shelve that for a moment.

### Anecdote:  ElasticSearch And Me

It’s time for an anecdote.  A few months ago, I started running a predictive analytics team.  Our first project was to perform predictions of disk growth based on historical data.  No big deal at all, except that all the data was stored in ElasticSearch and our DBA team wanted the results in ElasticSearch as well.  My experience with ElasticSearch prior to this assignment was practically nil, but I went into it eager…at the beginning…

There were days that I wasted just figuring out how to do things that would take me five or ten minutes in SQL Server (particularly around aggregating data).  In that sense, it was a complete waste of time to use ElasticSearch, and throughout that time period I felt like an idiot for struggling so hard to do things that I intuitively knew were extremely simple.  It took a while, but I did muddle through the project, which means that I picked up some ElasticSearch experience.  I’m definitely not good at writing ElasticSearch queries, but I’m better than I was three months ago, and that experience can help me out elsewhere if I need to use ElasticSearch in other projects or even to give me an idea of other ways to store and query data.

This is one of the most common ways that people learn:  they muddle through things because they need to, either because the work requires it or because it’s a challenge or growth opportunity.  If you’re able to take the results of that muddling through and apply it to other areas, you’ve got a leg up on your competition.  And I think it’s easier to form quality systems when you have a few examples—it’s easier to reason accurately from several scenarios rather than just one scenario.

### Conclusion

Summarizing a rather long blog post, I do agree with Eugene that “data platform” is a very broad space, and expecting someone to know everything about it would be folly.  But that’s not unique.  “Programmer” is extremely broad as well, but we don’t expect embedded systems developers to write databases (or even write database queries) or design responsive web applications.  Doctors and lawyers specialize to extreme degrees, as do plenty of other professionals, and I see no reason to expect anything different from data platform professionals.  I do believe that unless you are at the extreme right end of the distribution for certain skills (and can thus be a top-end specialist), you want to err in the direction of being broader than deeper, as it reduces the chances of getting caught in a sudden paradigm shift (remember how cool Web Forms was for about 4 years?) and risking your livelihood as a result.

One other point I want to make is that the broadness of this space shows the power of teamwork and complimentary skills.  There’s an argument that a good scrum team is made up of a bunch of generalists who can all fill in each other’s roles on demand.  I think that concept’s garbage for several reasons, one of which is that you often need specialists because specialists fix problems that generalists can’t.  So instead of having a team of generalists, you have a team of people with different skills, some of which overlap and some of which complement each other:  you have one or two data specialists, one or two UI specialists, one or two “backbone” specialists (usually .NET or Java developers), one or two QA specialists, etc.  This says to me that there’s less a difference in kind than a difference in degree, even between the polyglot type and the specialist type:  you can be a polyglot with respect to other data professionals (because you’re using several data platform technologies and are working across multiple parts of the stack) while being a specialist with respect to your development team (because you’re the database person).

### Coda:  Data Science

One bit at the end of Eugene’s post is that he’s interested in digging into data science.  For a post criticizing the impossibility of living up to a definition, this did elicit a knowing chuckle.  The problem is that the term “data scientist” is a microcosm of the issues with “data platform professional.”  To be a data scientist, you should have development skills (preferably in multiple languages, including but not limited to SQL, R, and Python), a strong background in statistics (ideally having worked through PhD level courses), and a deep knowledge of the context of data (as in spending years getting to know the domain).  I saw the perfect t-shirt today to describe these people.

There are very few people who have all three skill sets and yet that’s what being a data scientist requires.  It’s the same problem as “data platform professional” but at a slightly smaller scale.

# Analyzing DBA Salaries

Recently, Brent Ozar put together a data professional salary survey.  He did some early research and opened the floor for others, now that the results are in.  In this post, I’m going to take that data set and do some fairly simple descriptive statistics with R.  So let’s fire up RTVS and get started.

### Getting A Picture Of The Data

The first thing we want to do is grab the Excel file and load it into R.  There’s a nice package called XLConnect which makes working with Excel a breeze. Aside from that, I’m going to load dplyr and tidyr to help with data cleansing and the to get some data we’ll use later in conjunction with salary data.

```install.packages("XLConnect")
install.packages("tidyr")
install.packages("dplyr")
install.packages("pwt9", repos = 'http://cran.rstudio.com/')
library(XLConnect)
library(tidyr)
library(dplyr)
library(pwt9)

salary.data <- readWorksheet(wb, sheet = "Salary Survey", region = "A4:T2902")
```

The salary survey worksheet really starts on row 4, with the first three rows being pre-header descriptive information. The first thing we want to do is check out the summary of our data frame to see if there are any immediate goodies:

```summary(salary.data)
```

From the summary, we can see that there’s a wide range of salary values, from \$430 a year up to \$1.45 million. There are certainly some values which seem suspicious, and that’s a common problem with data sets like this: misreported values, typos, bad translations, etc.  We’re going to have to do the best we can with the data we have, but in a practical data science scenario, we might want to investigate these values to see if they’re legit and clean them up if not.

Because salary is our primary fact value here, I figured it’d be interesting to see the distribution of salaries in a histogram. That’s easy to do with the hist() function, but if you just build a histogram of salaries, you’ll get skewed results. There are several orders of magnitude difference in salaries in this data set, and the way we smooth out those differences and capture meaningful trend differences is to take the log of salary. It doesn’t really matter which log base you use, so long as you’re consistent. I’m going to use log10 because that gives us an intuitive understanding of values: 10^5 is \$10,000 per annum, 10^6 is \$100,000, and so on.

```hist(log10(salary.data\$SalaryUSD))
```

The resulting histogram looks like this:

This is not really a normal distribution; it skews left (thanks in part to that \$435 per year value, but there are plenty of very low values in the data set).  It also shows that a strong majority of people answered between approximately 10^4.8 to 10^5.2.  If you want to see what the exact bin widths are (assuming that, like me, you didn’t define them in the first place), that’s easy:

```h <- hist(log10(salary.data\$SalaryUSD))
h\$breaks
```
```> h\$breaks
[1] 2.6 2.8 3.0 3.2 3.4 3.6 3.8 4.0 4.2 4.4 4.6 4.8 5.0 5.2 5.4 5.6 5.8 6.0 6.2```

To give you some context, 10^4.6 is \$39,810 per year, 10^4.8 is \$63,095 per year, 10^5.0 is \$100,000 per year, and 10^5.2 is \$158,489 per year.  Most data professionals surveyed make between \$63K and \$159K per year.

### Breakdowns By Country

Now let’s look at some country-level breakdowns.  For this, I’m going to use some dplyr function chains, and my first step is to see what we’ve got.

```entrants.by.country <- data.frame(salary.data %>% group_by(Country) %>% summarize(mean.salary = mean(SalaryUSD), n = n()))
```

This will create a new data frame called entrants.by.country, which includes the country name, the number of entrants (n), and the mean salary. We can see that Anglosphere countries dominate the survey results, which makes sense given that it was on an English-language blog:

```entrants.by.country[order(entrants.by.country\$n, decreasing = TRUE),]
```
```> entrants.by.country[order(entrants.by.country\$n, decreasing = TRUE),]
Country mean.salary    n
64         United States   106741.99 1879
63        United Kingdom    62570.73  293
26                 India    22507.34   70
3              Australia    91087.17   69
20               Germany    76408.07   45
39           Netherlands    65543.22   36
57                Sweden    60151.21   34
55          South Africa    54313.75   28
40           New Zealand    78336.92   24
15               Denmark   101767.35   23
27               Ireland    80177.25   20
46                Poland    32028.70   20
58           Switzerland   120138.89   18```

That said, we have data for people from 66 countries; I’m only including some of the results in the list above for space reasons. If we want to see where the highest mean salary is for data professionals, we just need to reorder the sort criteria:

```entrants.by.country[order(entrants.by.country\$mean.salary, decreasing = TRUE),]
```
```> entrants.by.country[order(entrants.by.country\$mean.salary, decreasing = TRUE),]
Country mean.salary    n
58           Switzerland   120138.89   18
6                Bermuda   120000.00    1
64         United States   106741.99 1879
25               Iceland   105500.00    2
15               Denmark   101767.35   23
42                Norway    97820.89    9
3              Australia    91087.17   69```

These results say that Switzerland and Bermuda are nice places to work if you want to maximize expected salary. But there’s a problem: we only have one entrant who chose Bermuda, so that’s not very helpful for data comparison. What if we picked the one guy in Bermuda who makes more than \$100K a year? Or maybe everybody makes more than this guy? We don’t know, and there just aren’t enough entrants in this sample to give us good information, so I want to filter out any country with fewer than 10 entrants. Why 10? Either because I hate Norway or I like round numbers; your call.

```filtered.entrants.by.country <- entrants.by.country %>% filter(n > 9)
count(filtered.entrants.by.country)
filtered.entrants.by.country[order(filtered.entrants.by.country\$mean.salary, decreasing = TRUE),]
```
```> count(filtered.entrants.by.country)
Source: local data frame [1 x 1]

n
(int)
1 23```
```> filtered.entrants.by.country[order(filtered.entrants.by.country\$mean.salary, decreasing = TRUE),]
Country mean.salary    n
21    Switzerland   120138.89   18
23  United States   106741.99 1879
5         Denmark   101767.35   23
1       Australia    91087.17   69
10        Ireland    80177.25   20
14    New Zealand    78336.92   24
11         Israel    77200.00   10
8         Germany    76408.07   45
13    Netherlands    65543.22   36
6         Finland    63640.00   10
22 United Kingdom    62570.73  293
20         Sweden    60151.21   34
2         Belgium    59354.55   11
18   South Africa    54313.75   28
7          France    53063.64   11
12          Italy    51410.00   15
19          Spain    40233.33   15
15         Poland    32028.70   20
3          Brazil    31152.18   11
16        Romania    27284.00   15
9           India    22507.34   70
17         Russia    21693.54   13```

We’re now down to 23 entrants, so I included the full set here. Honestly, I think even 10 is way too low to make an accurate guess; I’d want 30-50 before I start feeling at all confident, and even that’s low considering all the factors which go into salary within a country.

Looking at these countries, we see Switzerland in its own tier, followed by the United States and Denmark.  Australia kind of fits between the next tier, which includes Ireland, New Zealand, Israel, Germany, and Canada.  Down and down it goes, until we land on India and Russia at the bottom.

Looking at salary data in a vacuum can be interesting, but I think we all have an intuitive understanding that it’s more expensive to live in the United States, UK, or Germany than in Romania, Russia, or India.  Economists have done a lot of research on the topic, particularly around purchasing power parity.  The quick idea is, supposing I have a basket of goods that I can purchase in one area, how much would it cost me to purchase an equivalent basket of goods in another area?  Within one country, there can be vast differences:  compare prices in New York City versus Iowa City.  Between countries, there is also a significant difference, and PPP tries to adjust for that difference.

In our case, I want the purchasing power parity-adjusted Gross Domestic Product of each country; I’ll treat that as the norm.  Before anybody jumps down my throat about how per capita GDP isn’t the same as average income, the various ways in which GDP is flawed as a calculation metric, that per-capita GDP includes people outside the workforce, that workforce compositions can differ between countries, etc. etc., I know and I agree.  My hope here is that most of the differences will balance out without having to go through major analytic hurdles. We’re futzing about, not writing academic papers.

Anyhow, how do we get this data?  Well, that’s the Penn World Table package we loaded early on. Building the data set is pretty easy. In this case, I just want to compare my data set against he latest year in the data set, which is 2014.

```data("pwt9.0")
countries.2014 <- pwt9.0 %>% filter(year == 2014)
```

In this data set, there are two PPP GDP measures: an expenditure-side measure and an output-side measure. The expenditure-side standard is better for comparing living standards for people in those countries, whereas the output-side measure is better for understanding productivity differences between countries. We’re going to use the expenditure-side GDP values, although the differences aren’t huge. Also, I should note that population and GDP are both measured in millions, so to get per capita GDP, we can do straight division.

```country.gdp <- countries.2014 %>% select(Country = country, rgdpe, pop)
country.gdp\$gdp.per.capita <- country.gdp\$rgdpe / country.gdp\$pop
country.gdp <- lapply(country.gdp, gsub, pattern = "United States of America", replacement = "United States")
country.gdp <- as.data.frame(lapply(country.gdp, gsub, pattern = "Russian Federation", replacement = "Russia") ```

I also did some data cleanup, as there are two countries whose names do not match the values in the salary data set. To allow me to join these two sets together, I’m going to rename the countries in the GDP data set. That allows me to run a query like the following:

``` country.gdp %>% filter(Country == "United States") %>% select(Country, gdp.per.capita)
```
```> country.gdp %>% filter(Country == "United States") %>% select(Country, gdp.per.capita)
Country gdp.per.capita
1 United States 52292.281832077```

This says that in 2014, per-capita GDP (with PPP adjustments) was \$52,292.28. My goal is to compare that per-capita GDP with the mean salary for a country and get a ratio for data professionals versus a normalized salary per country, still understanding that this is a very imprecise way of solving the problem.

What I need to do is join the data sets together, and dplyr lets me do that quite nicely. I’m going to use the left_join function to link the data sets, and then do some additional work. First, I need to convert gdp.per.capita from a factor to a numeric value so I can perform computations, like calculating salary ratios.

```salary.versus.gdp <- left_join(filtered.entrants.by.country, country.gdp, by = "Country")
#The GDP decimal values became factors, so let's defactorize them.
salary.versus.gdp\$gdp.per.capita <- as.numeric(as.character(salary.versus.gdp\$gdp.per.capita))
#Now compute a salary ratio
salary.versus.gdp\$salary.ratio <- salary.versus.gdp\$mean.salary / salary.versus.gdp\$gdp.per.capita salary.versus.gdp[order(salary.versus.gdp\$salary.ratio, decreasing = TRUE),] %>% select(Country, n, salary.ratio)
```

Here’s the full result set:

```> salary.versus.gdp[order(salary.versus.gdp\$salary.ratio, decreasing = TRUE),] %>% select(Country, n, salary.ratio)
Country    n salary.ratio
18   South Africa   28     4.478221
9           India   70     4.308437
11         Israel   10     2.320406
5         Denmark   23     2.265324
14    New Zealand   24     2.255296
1       Australia   69     2.114833
3          Brazil   11     2.094886
21    Switzerland   18     2.054748
23  United States 1879     2.041257
8         Germany   45     1.662452
10        Ireland   20     1.644101
6         Finland   10     1.575217
22 United Kingdom  293     1.554880
12          Italy   15     1.435772
13    Netherlands   36     1.387466
2         Belgium   11     1.359226
20         Sweden   34     1.348734
7          France   11     1.347673
16        Romania   15     1.310630
15         Poland   20     1.273202
19          Spain   15     1.188078
17         Russia   13     0.902426```

A couple things stand out to me. First, in a bare majority of the countries in this list, data professionals make 1.5-2x the “expected” value, saying that being a data professional is a relatively lucrative business. In South Africa and India, those ratios are even higher. I could expect it from India, where there are still large pockets of poverty and where IT jobs tend to be among the best-paying jobs in the country. South Africa is an odd duck to me here, but that might be my ignorance of their circumstances.

The other thing that stands out is Russia. Russia is the only country in which our salary ratio is under 1.0, indicating that being a data professional is a sub-par job. I decided to take a look at what makes Russia so special here.

#### What Makes Russia So Special?

The first thing I want to do is filter the Russian data set and take a quick look at their results to see if anything stands out.

```russians <- salary.data %>% filter(Country == "Russia")
summary(russians\$SalaryUSD)
```

Immediately, I see something pop out:

```> summary(russians\$SalaryUSD)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1050 12000 20000 21690 35000 39570```

Wait, there’s a person making \$1050 a year? Let’s look at that person’s details:

```salary.data %>% filter(SalaryUSD == 1050)
```
```> salary.data %>% filter(SalaryUSD == 1050)
Timestamp SalaryUSD Country      PrimaryDatabase YearsWithThisDatabase OtherDatabases   EmploymentStatus                          JobTitle
1 2017-01-10 00:29:51      1050  Russia Microsoft SQL Server                     4            Full time employee Developer: App code (C#, JS, etc)
ManageStaff YearsWithThisTypeOfJob OtherPeopleOnYourTeam DatabaseServers           Education EducationIsComputerRelated   Certifications
1          No                      4           More than 5               4 Bachelors (4 years)                        Yes No, I never have
HoursWorkedPerWeek            TelecommuteDaysPerWeek EmploymentSector LookingForAnotherJob Counter
1                 40 None, or less than 1 day per week Private business                   No       1```

According to this, the person is a full-time app developer with 4 years of experience and a Bachelor’s degree. He’s not looking for another job, so if this is legit data, somebody might want to snatch him up at \$2000 a year. I’d take a risk on a developer for \$2K.

Realistically, this is probably bad data—maybe it’s \$1050 a month? Or \$10,500? In any event, if you take this guy out, the mean jumps from \$21,690 to \$23,413. That’s still below the Russian GDP of \$24,039, but is close enough to be almost 1:1…but that’s still not a great ratio in comparison to the other countries on this list.

### By Primary Database

The next thing I looked at was primary database.

```primary.database <- data.frame(salary.data %>% group_by(PrimaryDatabase) %>% summarize(mean.salary = mean(SalaryUSD), n = n()))
primary.database[order(primary.database\$mean.salary, decreasing = TRUE),]
```
```> primary.database[order(primary.database\$mean.salary, decreasing = TRUE),]
PrimaryDatabase mean.salary    n
1  Amazon RDS (any flavor)   325000.00    3
5            Elasticsearch   156500.00    2
8                  MongoDB   134000.00    3
3                Cassandra   130000.00    1
11                   Other   103418.44   16
13                     SAP   101357.14    7
10                  Oracle    90506.67   89
7     Microsoft SQL Server    90395.03 2729
12              PostgreSQL    86613.41   17
2             Azure SQL DB    84050.00    6
4                      DB2    72941.67    6
6         Microsoft Access    61700.00    3
14                  SQLite    44000.00    1```

Given that 94% of entrants chose SQL Server, there’s no way we’re getting anything good from this slice. That said, it does feed my biases toward DB2 and Access…

### By Job Title

I now want to look at salary by job title. Again, I’m going to filter out any job titles with fewer than 10 entries.

```job.title <- data.frame(salary.data %>% group_by(JobTitle) %>% summarize(mean.salary = mean(SalaryUSD), n = n()))
filtered.job.title <- job.title %>% filter(n > 9)
filtered.job.title[order(filtered.job.title\$mean.salary, decreasing = TRUE),]
```
```> filtered.job.title[order(filtered.job.title\$mean.salary, decreasing = TRUE),]
JobTitle mean.salary    n
3                                         Data Scientist   123317.55   11
2                                              Architect   120086.86  244
9                                                Manager   116003.52  142
8                                               Engineer    99983.54  147
4                                                    DBA    87697.75 1514
10                                                 Other    83594.62   53
7                                       Developer: T-SQL    83310.70  264
6  Developer: Business Intelligence (SSRS, PowerBI, etc)    80021.84  203
1                                                Analyst    79985.09  128
5                      Developer: App code (C#, JS, etc)    78001.04  181```

Self-described data scientists have the highest mean salary (go team!), followed by architects and managers. The one thing that surprised me a bit was that T-SQL developers (AKA Database Engineers) make slightly less in this data set than Database Administrators. There could be a number of factors, such as country-level differences, but the different isn’t vast, so I’ll go with it.

At the bottom of the list are app developers, data analysts, and BI developers/report writers. This more or less fits my preconceived notions, and the numbers don’t seem outlandish.

### Significance Testing

The last thing I want to do is perform a few significance tests to get an idea of how robust our findings are—in other words, are we able to draw specific conclusions from this data?

#### Data Scientist Versus DBA

The first question I want to ask is, based on this data set, can we expect the wide world of data scientists to make more than the whole population of DBAs? Intuitively, I’d say yes, and we can see that there’s a \$35K a year difference in the results, but the problem is that the number of data scientists in the sample is only 11.

The way that I’m going to compare these data sets is by running a t-test on the log10 salaries of each data set. I know that the distributions are not normal, so I can’t make the assumption that they are. That said, t-tests are robust to normality, meaning that although they were designed to work with data following a normal distribution, they can still function without that requirement.

```data.scientists <- salary.data %>% filter(JobTitle == "Data Scientist")
dbas <- salary.data %>% filter(JobTitle == "DBA")
t.test(log10(data.scientists\$SalaryUSD), log10(dbas\$SalaryUSD))
```
```> t.test(log10(data.scientists\$SalaryUSD), log10(dbas\$SalaryUSD))

Welch Two Sample t-test

data:  log10(data.scientists\$SalaryUSD) and log10(dbas\$SalaryUSD)
t = 1.9772, df = 10.173, p-value = 0.07574
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.01828365  0.31237969
sample estimates:
mean of x mean of y
5.031427  4.884379```

I’m taking a two sample t-test (because I have two samples: a sample of data scientists and a sample of database administrators) and get a t score of 1.9772 and a p value of 0.076. Neither of these is good enough to give us enough confidence to state that, based solely on this data set, a data scientist in the overall population earns more than a database administrator. We can visualize a big part of the reason just by looking at the two histograms:

3 of the 10 data scientists are in the 10^5.2–5.4 range, which is well above the peak for DBAs. But most data scientists fit into the upper end of the DBA salary range, so even though the means are vastly different, it’s hard to confirm—based on so few data scientists in the sample—that we can make a confident prediction that there’s a real difference rather than just being sample-related noise.

But let’s do a quick thought experiment and add three more data scientists to the survey, each of whom has a salary of our mean, \$123,317.50. If we do that, the results become quite different:

```ds.thought.experiment <- c(data.scientists\$SalaryUSD, 123317.5, 123317.5, 123317.5)
t.test(log10(ds.thought.experiment), log10(dbas\$SalaryUSD))
```
```> t.test(log10(ds.thought.experiment), log10(dbas\$SalaryUSD))

Welch Two Sample t-test

data:  log10(ds.thought.experiment) and log10(dbas\$SalaryUSD)
t = 2.7378, df = 13.368, p-value = 0.01658
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.0340569 0.2855808
sample estimates:
mean of x mean of y
5.044198  4.884379```

Now the t score is 2.74 and the p value is under 0.02. This makes sense: by adding a few more people to the sample (and assuming that the sample mean is anywhere near accurate), we see a clearer bifurcation.

#### DBA Versus DBE

We saw that we didn’t have enough data scientist entries to differentiate data scientists from DBAs in terms of salary, so what can we do about this debate? Spoilers: not much.

```dbes <- salary.data %>% filter(JobTitle == "Developer: T-SQL")
t.test(log10(dbes\$SalaryUSD), log10(dbas\$SalaryUSD))
```
```> t.test(log10(dbes\$SalaryUSD), log10(dbas\$SalaryUSD))

Welch Two Sample t-test

data:  log10(dbes\$SalaryUSD) and log10(dbas\$SalaryUSD)
t = -1.6626, df = 336.98, p-value = 0.09733
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.072605456  0.006090578
sample estimates:
mean of x mean of y
4.851122  4.884379```

There’s a t score of -1.66, indicating that DBE salaries are lower, but the p value is nowhere near good enough for us to be confident that there’s a statistically significant difference.

#### US Versus Russia

Okay, with two failed tests, let’s go for a gimmie. American data professionals earn more than Russian data professionals. This isn’t quite a layup because there are only 13 Russians in the sample, so we might end up with something like the data scientist scenario.

```americans <- salary.data %>% filter(Country == "United States")
t.test(log10(americans\$SalaryUSD), log10(russians\$SalaryUSD))
```
```> t.test(log10(americans\$SalaryUSD), log10(russians\$SalaryUSD))

Welch Two Sample t-test

data:  log10(americans\$SalaryUSD) and log10(russians\$SalaryUSD)
t = 6.613, df = 12.019, p-value = 2.472e-05
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.5325389 1.0557546
sample estimates:
mean of x mean of y
5.001766  4.207620```

Success! We have a t score over 6.6 and a p value way below 1, meaning that based on our sample, it’s safe to say that Americans earn more.

#### US Versus UK

```uk <- salary.data %>% filter(Country == "United Kingdom")
t.test(log10(americans\$SalaryUSD), log10(uk\$SalaryUSD))
```
```> t.test(log10(americans\$SalaryUSD), log10(uk\$SalaryUSD))

Welch Two Sample t-test

data:  log10(americans\$SalaryUSD) and log10(uk\$SalaryUSD)
t = 22.971, df = 363.35, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.2171545 0.2578154
sample estimates:
mean of x mean of y
5.001766  4.764281```

We have enough members of both samples to be very confident in the difference. Note that even though the means are closer between the US and UK samples, the t score is much higher; this is because we have so many more members in the sample and can draw more confident conclusions.

#### US Versus Australia

Australia’s a tougher call because the sample is a bit smaller and the mean is a lot closer.

```aussies <- salary.data %>% filter(Country == "Australia")
t.test(log10(americans\$SalaryUSD), log10(aussies\$SalaryUSD))
```
```> t.test(log10(americans\$SalaryUSD), log10(aussies\$SalaryUSD))

Welch Two Sample t-test

data:  log10(americans\$SalaryUSD) and log10(aussies\$SalaryUSD)
t = 3.9512, df = 73.376, p-value = 0.0001771
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.03386076 0.10277441
sample estimates:
mean of x mean of y
5.001766  4.933449```

Even with that hedging, there are enough members of the Australian sample for us to be confident that there is a statistically significant difference between the populations.

#### US Versus Denmark

So how about the US’s nearest neighbor in terms of average data professional salary?

```denmark <- salary.data %>% filter(Country == "Denmark")
t.test(log10(americans\$SalaryUSD), log10(denmark\$SalaryUSD))
```
```> t.test(log10(americans\$SalaryUSD), log10(denmark\$SalaryUSD))

Welch Two Sample t-test

data:  log10(americans\$SalaryUSD) and log10(denmark\$SalaryUSD)
t = 1.1847, df = 22.086, p-value = 0.2487
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.06728552  0.24668863
sample estimates:
mean of x mean of y
5.001766  4.912065```

The t score is small, the p value is large, and the 95% confidence interval crosses the 0 boundary. In other words, there aren’t enough people in the sample to show a statistically significant difference between the US and Denmark.

### Conclusions

I wouldn’t draw too many conclusions from this sample, but it is rich enough to get some ideas.  We were able to see some common trends across most western countries regarding the relative pricing of data professionals—even though Americans make more in the data set than people in all other countries save for Switzerland (after filtering out the low-count countries), the earnings ratio relative to other Americans is nearer to the middle of the pack, meaning that data professionals tend to be at similar relative levels of affluence.

We also learned that you should avoid DB2…but I don’t think you needed me to tell you that…

Hopefully this inspires you to grab the data set yourself and play around with it.  There are plenty of other tests that people can run on the data; I’m just scratching the surface.

# Let’s Build A Portable Hadoop Cluster, Part 1

A little while back, I had a mini-series on building a Hadoop cluster.  Part 1 covered what I bought.  Part 2 introduced the idea of a Dockerized Hadoop cluster.  Part 3 covered installation and setup.

That’s all well and good, but one of my goals is to make this a portable Hadoop cluster so I can present with it.  The Intel NUC does not have a built-in battery like a laptop, so if you pull the plug (hint:  don’t), it’ll just shut itself right off.  When you power it back on, you’ll find that your Hadoop cluster has gone into hiding:

I’ll show how to bring the cluster back to life in the next blog post, but I will say that it takes about 10-15 minutes for everything to come up, and I might not have 10-15 minutes before my talk to set things up.  I’d prefer to be able to attach the NUC to an external battery pack an hour or two before the talk begins and let it coast from there.

In addition, I also need to have a network connection so I can talk to the NUC.  I don’t want to trust that my presentation location will have a good internet connection and I don’t want my NUC exposed to the network, so I need a miniature router as well.

Here’s what I landed on:

The TP-Link router was already in my bag, so I didn’t buy it specifically for this project.  It’s an alright travel router but is probably the weak link here and if I were buying new, I’d probably go with something a little more modern and powerful.

I did a lot more research on rechargable power packs, and the BP220 seemed to be the best for the job.  The Intel NUC that I have draws about 17 watts when idling and can spike up to 77 at load (and I’ve even read that it could spike into the 80s when you push it hard).  The BP220 supports that load and provides 223 watt-hours of juice per charge.  That means I could stretch out a battery charge for up to 13 hours (223 / 17), although a more realistic figure would be an average of about 35 watts, so maybe 6-7 hours.  Still, that’s more than I need to get me through a one-hour presentation and two hours of prep.

The battery pack itself is a little heavy, weighing in at a little over 3 pounds—in other words, it’s heavier than my laptop, especially if you pack the power brick as well.  Combined with the NUC, it’s about 7-8 pounds of extra gear, meaning that I’m fine taking it with me to present but wouldn’t want to schlep it around all the time.  That said, it’s also pretty compact.  At 10.6″ long, it fits nicely into my laptop bag, and it and the NUC can share the inside pocket while my laptop fits into the outside pocket.  At that point, I’m essentially carrying two laptops, but I did that for a while anyhow, so no big deal.

Finally, the power strip makes it so that I can plug in these devices along with my laptop.  Power outlets aren’t always conveniently located, and you rarely get more than one or maybe two outlets, so that’s in my bag just in case I do run low on battery power and need to plug everything in.

# Working With Azure SQL DW

This is a continuation of my Polybase series.

The first step to working with Polybase in Azure SQL Data Warehouse is to provision an Azure SQL Data Warehouse instance.  I’m going to follow along with the data I used in my Azure Blob Storage examples and load that data into Azure SQL Data Warehouse using Polybase.

The best way to connect to Azure SQL Data Warehouse is not SQL Server Management Studio, but rather SQL Server Data Tools.  So we’ll connect using SSDT.

In my scenario, I have an Azure SQL Data Warehouse instance named csdwdb.

After putting in the correct details, I can see my instance in the SQL Server Object Explorer in Visual Studio.

My plan is to do two things:  first, I want to pull all of my flight data into Azure SQL Data Warehouse, putting it into a table with a clustered columnstore index.  Second, I want to perform some processing of the data and put the results back into Azure Blob Storage to allow me to migrate data back to an on-prem SQL Server instance.

#### External Types

The first thing I need to do is to create an external data source, external file format, and external table for my Azure Blob Storage flights folder.  Looking back at my post on Azure Blob Storage, I can repurpose most of that code for this scenario as well:

```CREATE MASTER KEY ENCRYPTION BY PASSWORD = '<password>';
GO
CREATE DATABASE SCOPED CREDENTIAL AzureStorageCredential
WITH IDENTITY = 'cspolybase',
SECRET = '<access key>';
GO
CREATE EXTERNAL DATA SOURCE WASBFlights
WITH (
LOCATION = 'wasbs://csflights@cspolybase.blob.core.windows.net',
CREDENTIAL = AzureStorageCredential
);
GO
CREATE EXTERNAL FILE FORMAT [CsvFileFormat] WITH
(
FORMAT_TYPE = DELIMITEDTEXT,
FORMAT_OPTIONS
(
FIELD_TERMINATOR = N',',
USE_TYPE_DEFAULT = True
)
);
GO
CREATE EXTERNAL TABLE [dbo].[Flights2008]
(
[year] int NULL,
[month] int NULL,
[dayofmonth] int NULL,
[dayofweek] int NULL,
deptime VARCHAR(100) NULL,
crsdeptime VARCHAR(100) NULL,
arrtime VARCHAR(100) NULL,
crsarrtime VARCHAR(100) NULL,
uniquecarrier VARCHAR(100) NULL,
flightnum VARCHAR(100) NULL,
tailnum VARCHAR(100) NULL,
actualelapsedtime VARCHAR(100) NULL,
crselapsedtime VARCHAR(100) NULL,
airtime VARCHAR(100) NULL,
arrdelay VARCHAR(100) NULL,
depdelay VARCHAR(100) NULL,
origin VARCHAR(100) NULL,
dest VARCHAR(100) NULL,
distance VARCHAR(100) NULL,
taxiin VARCHAR(100) NULL,
taxiout VARCHAR(100) NULL,
cancelled VARCHAR(100) NULL,
cancellationcode VARCHAR(100) NULL,
diverted VARCHAR(100) NULL,
carrierdelay VARCHAR(100) NULL,
weatherdelay VARCHAR(100) NULL,
nasdelay VARCHAR(100) NULL,
securitydelay VARCHAR(100) NULL,
lateaircraftdelay VARCHAR(100) NULL
)
WITH
(
LOCATION = N'historical/2008.csv.bz2',
DATA_SOURCE = WASBFlights,
FILE_FORMAT = CsvFileFormat,
-- Up to 5000 rows can have bad values before Polybase returns an error.
REJECT_TYPE = Value,
REJECT_VALUE = 5000
);
GO
CREATE EXTERNAL TABLE [dbo].[FlightsAll]
(
[year] int NULL,
[month] int NULL,
[dayofmonth] int NULL,
[dayofweek] int NULL,
deptime VARCHAR(100) NULL,
crsdeptime VARCHAR(100) NULL,
arrtime VARCHAR(100) NULL,
crsarrtime VARCHAR(100) NULL,
uniquecarrier VARCHAR(100) NULL,
flightnum VARCHAR(100) NULL,
tailnum VARCHAR(100) NULL,
actualelapsedtime VARCHAR(100) NULL,
crselapsedtime VARCHAR(100) NULL,
airtime VARCHAR(100) NULL,
arrdelay VARCHAR(100) NULL,
depdelay VARCHAR(100) NULL,
origin VARCHAR(100) NULL,
dest VARCHAR(100) NULL,
distance VARCHAR(100) NULL,
taxiin VARCHAR(100) NULL,
taxiout VARCHAR(100) NULL,
cancelled VARCHAR(100) NULL,
cancellationcode VARCHAR(100) NULL,
diverted VARCHAR(100) NULL,
carrierdelay VARCHAR(100) NULL,
weatherdelay VARCHAR(100) NULL,
nasdelay VARCHAR(100) NULL,
securitydelay VARCHAR(100) NULL,
lateaircraftdelay VARCHAR(100) NULL
)
WITH
(
LOCATION = N'historical/',
DATA_SOURCE = WASBFlights,
FILE_FORMAT = CsvFileFormat,
-- Up to 5000 rows can have bad values before Polybase returns an error.
REJECT_TYPE = Value,
REJECT_VALUE = 5000
);
GO
```

Note that there is no database defined anymore. I’m running the query against my Azure SQL DW database, so I’m not going to specify a separate database here.

I want to make sure that everything’s working, so a quick select statement is in order:

```SELECT TOP(100) * FROM dbo.FlightsAll;
```

So, “quick” might be an overstatement, but I did get results I wanted and proof that we have data available.

My next step is to take this data and load it into Azure SQL Data Warehouse.  The smart way to create tables in Azure SQL Data Warehouse is to use the CREATE TABLE AS SELECT syntax.  I’m going to use that, as well as the advice they give on getting a fairly unique distribution key, and create my 2008 flights table as follows:

```CREATE TABLE[dbo].[Flights2008DW]
WITH
(
CLUSTERED COLUMNSTORE INDEX,
DISTRIBUTION = HASH(tailnum)
)
AS SELECT * FROM dbo.Flights2008;
GO
```

After a little over a minute, I get my results:

Query completed. Rows were rejected while reading from external source(s).
1 row rejected from external table [Flights2008] in plan step 3 of query execution:
Location: ‘/historical/2008.csv.bz2’ Column ordinal: 0, Expected data type: INT, Offending value: Year (Column Conversion Error), Error: Error converting data type NVARCHAR to INT.

That’s a header row, and I’m okay with it not making its way in.  As a quick aside, I should note that I picked tailnum as my distribution key.  The airplane’s tail number is unique to that craft, so there absolutely will be more than 60 distinct values, and as I recall, this data set didn’t have too many NULL values.  After loading the 2008 data, I loaded all years’ data the same way, except selecting from dbo.Flights instead of Flights2008.

#### Querying The Data

So let’s do some quick queries and make sure that everything looks alright.  Going back to earlier this year, I had an example of a query against a larger data set.  My subset of flights is not exactly the same here as it was there—I only have flight data for the years 2003 through 2008 in Azure SQL Data Warehouse—but I figured it’d be a good start.

Here’s my slightly revised query:

```SELECT
f.origin,
f.distance
FROM dbo.FlightsAllDW f
WHERE
f.year = 2005
AND f.dest = 'CMH';
```

And against an Azure SQL Data Warehouse instance with 200 DWUs, it took 6 seconds to load the results.

Again, that’s not a fair comp:  I’m getting 200 DWU for data partitioned across 60 servers versus running a query on my laptop.  That said, it was pleasant seeing results come back that quickly.

### Migrating Data Out Using Polybase

I’m going to show one last trick here:  outmigration using CETAS:  CREATE EXTERNAL TABLE AS SELECT.

```CREATE EXTERNAL TABLE [dbo].[CMHFlights]
WITH
(
LOCATION = N'columbus/',
DATA_SOURCE = WASBFlights,
FILE_FORMAT = CsvFileFormat,
-- Up to 5000 rows can have bad values before Polybase returns an error.
REJECT_TYPE = Value,
REJECT_VALUE = 5000
)
AS SELECT * FROM dbo.FlightsAllDW WHERE dest = 'CMH';
GO
```

This query takes data from my FlightsAllDW Azure SQL DW table and pumps them into Azure Blob Storage into a new external table.

In eight seconds, it figured out the 220K records that interested me and wrote them to a new external table called CMHFlights, and in a folder marked columbus.  If I look at the folder, I can see that the Polybase engine created 60 files:

Remember that when we did this ourselves using on-prem Polybase (e.g., writing to Hadoop), we ended up the data split out into 8 files.  Azure SQL Data Warehouse is special.

### Conclusion

Polybase inside Azure SQL Data Warehouse isn’t that different from on-prem Polybase.  Its primary focus is to get data into and out of Azure SQL Data Warehouse for cases in which you don’t want to use Azure Data Factory.  One really cool thing you can do with Polybase in Azure SQL DW that you can’t do on-prem is CREATE EXTERNAL TABLE AS SELECT.  Otherwise, if you understand how the on-prem product works, you’ll get how it works in SQL Data Warehouse.