Genetic Programming: DBA Salaries

This is part of a series entitled Genetics In Action.

Today’s post shows off another variant of genetic programming.  Last time around, we looked at mathematical functions using rgp.  Today, we are going to build conditional programs using evtree, another evolutionary library in R.

Time To Buy Me A DBA

The data set we will use is the survey of data professional salaries that Brent Ozar put together in 2016.  I’ve already looked at the data once, but now I want to apply a different approach:  I want to see how well I can explain and predict variance in DBA salaries.

Starting out, I’m going to load three packages:  evtree (which lets me build evolutionary trees), XLConnect (which lets me read Excel files), and tidyverse (which is the Swiss army knife of R packages):

if(!require(evtree)) {
    install.packages("evtree", repos = "http://cran.us.r-project.org")
    library(evtree)
}

if(!require(XLConnect)) {
  install.packages("XLConnect", repos = "http://cran.us.r-project.org")
  library(XLConnect)
}

if(!require(tidyverse)) {
  install.packages("tidyverse", repos = "http://cran.us.r-project.org")
  library(tidyverse)
}

Let’s pull in the salary data using XLConnect.

wb <- loadWorkbook("Data/2017_Data_Professional_Salary_Survey_Responses.xlsx")
salary_data <- readWorksheet(wb, sheet = "Salary Survey", region = "A4:T2902")

Next up, I would like to look at the structure of the salary data, which I can do easily by running str(salary_data).

We have a sample of approximately 3000 data professionals, as well as 18 variables per observation—I’m ignoring Timestamp and Counter, as they are not relevant to the discussion.

What To Do, What To Include

Last time, I looked at purchasing power parity to try to find countries where data professionals are relatively better off. This time around, I’m going to limit my search to data professionals in the United States. Even with the United States, there are differences in PPP—after all, a DBA living in San Francisco or New York City making $100K a year is quite a different story than a DBA in Peoria or Richmond making $100K a year—but there’s a smaller spread. It’s a spread that I cannot control, so I’m expecting that I won’t be able to explain a healthy chunk of the variance, but we proceed apace.

Next, I will focus only on those who selected Microsoft SQL Server as their primary database. I don’t want to deal with level differences between Oracle, DB2, MySQL, and SQL Server. Also, a big majority of respondents selected SQL Server, so there’s probably some bias in the non-SQL Server participants.

After that, I want to include only full-time employees, filtering out part-time, freelance, or contract employees. That’s not a value judgment, but it does constrain my sample a bit further.

Finally, I want to filter out anybody making at least $300K a year. There are only a couple of those in the data set and they skew the results pretty hard. If you’re making $300K+ a year as a full-time employee, great; but you’re so atypical a case that I don’t care about you. Alternatively, at least one of those results might be a typo, but we’re not sure.

Now that I’ve explained the filters, I want to discuss my interesting variables. I want to explain SalaryUSD using five variables: YearsWithThisDatabase, HoursWorkedPerWeek, Education, EmploymentSector, and ManageStaff. The other variables are probably helpful as well, but for my demo, I want to keep the number of variables in play relatively low.

Now that I’ve discussed what I plan to do, let’s actually do it in many fewer lines in R:

salary_pred <- salary_data %>%
  filter(Country == "United States") %>%
  filter(PrimaryDatabase == "Microsoft SQL Server") %>%
  filter(EmploymentStatus == "Full time employee") %>%
  filter(SalaryUSD <= 300000) %>%
  select(SalaryUSD, YearsWithThisDatabase, HoursWorkedPerWeek, Education, EmploymentSector, ManageStaff)

salary_pred$Education <- as.factor(salary_pred$Education)
salary_pred$EmploymentSector <- as.factor(salary_pred$EmploymentSector)
salary_pred$ManageStaff <- as.factor(salary_pred$ManageStaff)

I now have a data set called salary_pred which includes the interesting variables. After putting in all of my filters, I still have 1624 observations, so I’m satisfied that I should have enough data to find something interesting.

Something Interesting

I am now going to build a regression tree with evtree. A regression tree is a fairly simple idea. It’s a decision tree, except that we build a regression for each leaf node. The syntax for building a genetic program which gives us a regression tree is pretty simple:

ev <- evtree(SalaryUSD ~ YearsWithThisDatabase + HoursWorkedPerWeek + Education + EmploymentSector + ManageStaff,
             data = salary_pred, minbucket = 10, maxdepth = 4)

This churns for a little bit and then, when it’s done, we can call plot(ev) to get the following image:

This image is a little busy; I actually find it easier to read the results by running ev instead:

Model formula:
SalaryUSD ~ YearsWithThisDatabase + HoursWorkedPerWeek + Education + 
    EmploymentSector + ManageStaff

Fitted party:
[1] root
|   [2] YearsWithThisDatabase < 8
|   |   [3] YearsWithThisDatabase < 5: 73507.922 (n = 244, err = 117381058953.5) | | [4] YearsWithThisDatabase >= 5: 90571.821 (n = 274, err = 126256913002.2)
|   [5] YearsWithThisDatabase >= 8
|   |   [6] EmploymentSector in Local government, State/province government, Student: 91592.034 (n = 117, err = 53016079035.9)
|   |   [7] EmploymentSector in Federal government, Private business
|   |   |   [8] ManageStaff in No
|   |   |   |   [9] YearsWithThisDatabase < 16: 105536.492 (n = 453, err = 253588315377.2) | | | | [10] YearsWithThisDatabase >= 16: 117556.179 (n = 285, err = 172506498805.9)
|   |   |   [11] ManageStaff in Yes: 123739.729 (n = 251, err = 200634714045.6)

Number of inner nodes:    5
Number of terminal nodes: 6

Let’s sum this up in a list, where each selection is exclusive of the others:

  1. If you have spent less than 5 years as a SQL Server professional, we expect your salary to be around $73,507.
  2. If you have spent between 5 and 8 years as a SQL Server professional, we expect your salary to be around $90,571.
  3. If you have spent at least eight years as a SQL Server professional and you are in local/state government or you are a student, we expect your salary to be around $91,592.
  4. If you are not a manager and have spent 8-16 years as a SQL Server professional, we expect your salary to be around $105,536.
  5. If you are not a manager and have spent at least 16 years as a SQL Server professional, we expect your salary to be around $117,556.
  6. If you are a manager and have spent at least 8 years as a SQL Server professional, we expect your salary to be around $123,739.

On the plus side, each of these nodes has at least 100 members, and most have 250+, so it’s a fairly broad distribution in each. On the minus side, the error terms seem rather large.

Thinking Linearly

To give a contrast to our regression tree, let’s try a simple linear regression.

salary_lm <- lm(SalaryUSD ~ YearsWithThisDatabase + HoursWorkedPerWeek + Education + EmploymentSector + ManageStaff,
                data = salary_pred)

There’s nothing fancy to this regression, and when I call summary(salary_lm), I get the following:

Call:
lm(formula = SalaryUSD ~ YearsWithThisDatabase + HoursWorkedPerWeek + 
    Education + EmploymentSector + ManageStaff, data = salary_pred)

Residuals:
    Min      1Q  Median      3Q     Max 
-110294  -15585   -1307   12911  144645 

Coefficients:
                                           Estimate Std. Error t value Pr(>|t|)
(Intercept)                                67621.12    6151.45  10.993  < 2e-16
YearsWithThisDatabase                       2262.35     100.98  22.403  < 2e-16
HoursWorkedPerWeek                           155.67      93.77   1.660 0.097090
EducationBachelors (4 years)                7288.25    1958.67   3.721 0.000205
EducationDoctorate/PhD                     17428.29    8201.28   2.125 0.033733
EducationMasters                           12071.38    2279.88   5.295 1.36e-07
EducationNone (no degree completed)         4931.55    2375.89   2.076 0.038083
EmploymentSectorLocal government          -21701.26    5074.87  -4.276 2.01e-05
EmploymentSectorPrivate business           -5810.89    4357.09  -1.334 0.182502
EmploymentSectorState/province government -23477.37    4985.01  -4.710 2.70e-06
EmploymentSectorStudent                   -29033.34   10683.44  -2.718 0.006646
ManageStaffYes                             11518.96    1452.14   7.932 3.99e-15
                                             
(Intercept)                               ***
YearsWithThisDatabase                     ***
HoursWorkedPerWeek                        .  
EducationBachelors (4 years)              ***
EducationDoctorate/PhD                    *  
EducationMasters                          ***
EducationNone (no degree completed)       *  
EmploymentSectorLocal government          ***
EmploymentSectorPrivate business             
EmploymentSectorState/province government ***
EmploymentSectorStudent                   ** 
ManageStaffYes                            ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 23930 on 1612 degrees of freedom
Multiple R-squared:  0.3228,	Adjusted R-squared:  0.3182 
F-statistic: 69.86 on 11 and 1612 DF,  p-value: < 2.2e-16

What we see is that the linear model explains approximately 32% of the variance in salaries. As I mentioned above, I believe that relative purchasing power plays a significant role in explaining variance, but because we don’t have the data at that grain, I cannot test this conjecture.

That aside, we can see some interesting contrasts. Note that the linear model picked up local/state government and student as major negatives, knocking off at least $21K and calling local/state government statistically significant. The linear model, meanwhile, does not consider hours worked per week significant, either statistically or in terms of power. Each extra hour of work per week leads to $155 in salary per year. I’d hazard a guess that this is because most full-time data professionals are salaried employees.

Let’s Compare

Now that we have a genetic model and a linear model, we want to compare the two and see which turns out better.

First, let’s use our models to figure out predictions over the data set. Then I want to calculate the root mean square deviance for each of the two.

salary_pred$EVPrediction <- predict(ev)
salary_pred$LMPrediction <- predict(salary_lm)

sqrt(mean((salary_pred$SalaryUSD - salary_pred$EVPrediction)^2))
sqrt(mean((salary_pred$SalaryUSD - salary_pred$LMPrediction)^2))

The evolutionary regression tree has an RMSD of $23,845 whereas the linear regression has an RMSD of $23,841. In other words, they’re practically the same.

But let’s not allow aggregates to conceal the important mechanisms of each model. Let’s see how each predicts for different people.

The Lives Of Others (But Mostly Me)

In this test, I want to build up three test cases. The first test is a new graduate, right out of school with a 4-year Bachelor’s. The second test is me, after having spent 5 years working for the state. The final test is me today.

test_cases <- data.frame(
    YearsWithThisDatabase = c(0, 5, 11),
    HoursWorkedPerWeek = c(40, 40, 40),
    Education = c("Bachelors (4 years)", "Masters", "Masters"),
    EmploymentSector = c("Private business", "State/province government", "Private business"),
    ManageStaff = c("No", "No", "Yes")
)
YearsWithThisDatabase HoursWorkedPerWeek Education EmploymentSector ManageStaff
0 40 Bachelors (4 years) Private business No
5 40 Masters State/province government No
11 40 Masters Private business Yes

We can see how each of the models would predict the people in this sample.

predict(ev, test_cases)
predict(salary_lm, test_cases)

In the first case, for the new person, both models are in harmony: the evolutionary regression tree came up with a value of $73,507 and the linear regression was $75,325.

In the second case, the evolutionary regression tree came up with $90,571 and the linear regression was $73,753. That’s a huge difference between the two!

In the third case, we see another fairly big difference: $123,739 for the evolutionary model versus $116,513 for the linear model.

Even though the two models have the same top-line deviance from the mean, it’s clear which model I prefer!

Conclusion

This wraps up genetic programming.  There’s a lot more to the topic—after all, John Koza wrote numerous books on it!—but hopefully that foments some consideration on where it might be useful to use a genetic programming algorithm.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s