# How Much Can We Earn? Implementing A Model

This is part six of a series on launching a data science project.

Last time around, we walked through the idea of what building a model entails.  We built a clean(er) data set and did some analysis earlier, and in this post, I’m going to build on that.

### Modeling

Because our question is a “how much?” question, we want to use regression to solve the problem. The most common form of regression that you’ll see in demonstrations is linear regression, because it is easy to teach and easy to understand. In today’s demo, however, we’re going to build a neural network with Keras. Although our demo is in R, Keras actually uses Python on the back end to run TensorFlow. There are other libraries out there which can run neural networks strictly within R (for example, Microsoft Machine Learning’s R implemenation has the `RxNeuralNet()` function), but we will use Keras in this demo because it is a popular library.

Now that we have an algorithm and implementation in mind, let’s split the data out into training and test subsets. I want to use Country as the partition variable because I want to ensure that we retain some data from each country in the test set. To make this split, I am using the `createDataPartition()` function in `caret`. I’ll then split out the data into training and test data.

```trainIndex <- caret::createDataPartition(survey_2018\$Country, p = 0.7, list = FALSE, times = 1)
train_data <- survey_2018[trainIndex,]
test_data <- survey_2018[-trainIndex,]
```

We will have 1976 training rows and 841 testing rows.

Once I have this data split, I want to perform some operations on the training data. Specifically, I want to think about the following:

• One-Hot Encode the categorical data
• Mean-center the data, so that the mean of each numeric value is 0
• Scale the data, so that the standard deviation of each value is 1

The bottom two are called normalizing the data. This is a valuable technique when dealing with many algorithms, including neural networks, as it helps with optimizing gradient descent problems.

In order to perform all of these operations, I will create a `recipe`, using the `recipes` package.

NOTE: It turns out that normalizing the features results in a slightly worse outcome in this case, so I’m actually going to avoid that. You can uncomment the two sections and run it yourself if you want to try. In some problems, normalization is the right answer; in others, it’s better without normalization.

```rec_obj <- recipes::recipe(SalaryUSD ~ ., data = train_data) %>%       # Build out a set of rules we want to follow (a recipe)
step_dummy(all_nominal(), -all_outcomes()) %>%              # One-hot encode categorical data
#step_center(all_predictors(), -all_outcomes()) %>%          # Mean-center the data
#step_scale(all_predictors(), -all_outcomes()) %>%           # Scale the data
prep(data = train_data)

rec_obj
```
```Data Recipe

Inputs:

role #variables
outcome          1
predictor         17

Training data contained 1976 data points and no missing data.

Operations:

Dummy variables from Country, EmploymentStatus, JobTitle, ... [trained]```

Now we can `bake` our data based on the recipe above. Note that I performed all of these operations only on the training data. If we normalize the training + test data, our optimization function can get a sneak peek at the distribution of the test data based on what is in the training set, and that will bias our result.

After building up the `x_` series of data sets, I’ll build vectors which contain the salaries for the training and test data. I need to make sure to remove the SalaryUSD variable; we don’t want to make that available to the trainer as an independent variable!

```x_train_data <- recipes::bake(rec_obj, newdata = train_data)
x_test_data <- recipes::bake(rec_obj, newdata = test_data)
y_train_vec <- pull(x_train_data, SalaryUSD)
y_test_vec  <- pull(x_test_data, SalaryUSD)
# Remove the SalaryUSD variable.
x_train_data <- x_train_data[,-1]
x_test_data <- x_test_data[,-1]
```

At this point, I want to build the Keras model. I’m creating a `build_model` function in case I want to run this over and over. In a real-life scenario, I would perform various optimizations, do cross-validation, etc. In this scenario, however, I am just going to run one time against the full training data set, and then evaluate it against the test data set.

Inside the function, we start by declaring a Keras model. Then, I add three layers to the model. The first layer is a dense (fully-connected) layer which accepts the training data as inputs and uses the Rectified Linear Unit (ReLU) activation mechanism. This is a decent first guess for activation mechanisms. We then have a dropout layer, which reduces the risk of overfitting on the training data. Finally, I have a dense layer for my output, which will give me the salary.

I compile the model using the `RMSProp` optimizer. This is a good default optimizer for neural networks, although you might try `Adagrad``Adam`, or `AdaMax` as well. Our loss function is Mean Squared Error, which is good for dealing with finding the error in a regression. Finally, I’m interested in the Mean Absolute Error–that is, the dollar amount difference between our function’s prediction and the actual salary. The closer to \$0 this is, the better.

```build_model <- function() {
model <- keras_model_sequential() %>%
layer_dense(units = 256, input_shape = c(ncol(x_train_data)), activation = "relu") %>%
layer_dropout(rate = 0.2) %>%
layer_dense(units = 512, activation = "relu") %>%
layer_dropout(rate = 0.2) %>%
layer_dense(units = 1, activation = "linear") # No activation --> linear layer

# RMSProp is a nice default optimizer for a neural network.
# Mean Squared Error is a classic loss function for dealing with regression-style problems, whether with a neural network or otherwise.
# Mean Average Error gives us a metric which directly translates to the number of dollars we are off with our predictions.
model %>% compile(
optimizer = "rmsprop",
loss = "mse",
metrics = c("mae")
)
}
```

Building out this model can take some time, so be patient.

```model <- build_model() model %>% fit(as.matrix(x_train_data), y_train_vec, epochs = 100, batch_size = 16, verbose = 0)
result <- model %>% evaluate(as.matrix(x_test_data), y_test_vec)
result
```
```\$loss
863814393.60761

\$mean_absolute_error
19581.9413644471```

What this tells us is that, after generating our model, we are an average of `mean_absolute_error` dollars off from reality. In my case, that was just under \$20K. That’s not an awful amount off. In fact, it’s an alright start, though I wouldn’t trust this model as-is for for my negotiations. With a few other enhancements, we might see that number drop a bit and start getting into the trustworthy territory.

With a real data science project, I would dig further, seeing if there were better algorithms available, cross-validating the training set, etc. As-is, this result isn’t good enough for a production scenario, but we can pretend that it is.

Now let’s test a couple of scenarios. First up, my salaries over time, as well as a case where I moved to Canada last year.  There might be some exchange rate shenanigans but there were quite a few Canadian entrants in the survey so it should be a pretty fair comp.

```test_cases <- test_data[1:4, ]

test_cases\$SalaryUSD = c(1,2,3,4)
test_cases\$Country = c("United States", "United States", "United States", "Canada")
test_cases\$YearsWithThisDatabase = c(0, 5, 11, 11)
test_cases\$EmploymentStatus = c("Full time employee", "Full time employee", "Full time employee", "Full time employee")
test_cases\$JobTitle = c("Developer: App code (C#, JS, etc)", "DBA (General - splits time evenly between writing & tuning queries AND building & troubleshooting servers)", "Manager", "Manager")
test_cases\$ManageStaff = c("No", "No", "Yes", "Yes")
test_cases\$YearsWithThisTypeOfJob = c(0, 5, 0, 0)
test_cases\$OtherPeopleOnYourTeam = c(5, 0, 2, 2)
test_cases\$DatabaseServers = c(8, 12, 150, 150)
test_cases\$Education = c("Bachelors (4 years)", "Masters", "Masters", "Masters")
test_cases\$EducationIsComputerRelated = c("Yes", "Yes", "Yes", "Yes")
test_cases\$Certifications = c("No, I never have", "Yes, and they're currently valid", "Yes, but they expired", "Yes, but they expired")
test_cases\$HoursWorkedPerWeek = c(40, 40, 40, 40)
test_cases\$TelecommuteDaysPerWeek = c("None, or less than 1 day per week", "None, or less than 1 day per week", "None, or less than 1 day per week", "None, or less than 1 day per week")
test_cases\$LookingForAnotherJob = c("No", "Yes", "No", "No")
test_cases\$CareerPlansThisYear = c("Stay with the same employer, same role", "Stay with the same role, but change employers", "Stay with the same employer, same role", "Stay with the same employer, same role")
test_cases\$Gender = c("Male", "Male", "Male", "Male")

# Why is this only letting me fit two objects at a time?
x_test_cases_1 <- recipes::bake(rec_obj, newdata = head(test_cases,2))
x_test_cases_2 <- recipes::bake(rec_obj, newdata = tail(test_cases,2))
x_test_cases <- rbind(x_test_cases_1, x_test_cases_2)
x_test_cases <- x_test_cases %>% select(-SalaryUSD)

model %>% predict(as.matrix(x_test_cases))
```
 58330.6 75734.8 109290 78821.7

The first prediction was pretty close to right, but the next two were off.  Also compare them to my results from last year.  The Canadian rate is interesting considering the exchange rate for this time was about 75-78 US cents per Canadian dollar, and the Canadian rate is about 72%.

Note that I had a bit of difficulty running the bake function against these data sets.  When I tried to build up more than two rows, I would get a strange off-by-one error in R.  For example, here’s what it looks like when I try to use `head(test_cases, 3)` instead of 2:

```Error in data.frame(..., check.names = FALSE): arguments imply differing number of rows: 3, 2
Traceback:

1. recipes::bake(rec_obj, newdata = head(test_cases, 3))
2. bake.recipe(rec_obj, newdata = head(test_cases, 3))
3. bake(object\$steps[[i]], newdata = newdata)
4. bake.step_dummy(object\$steps[[i]], newdata = newdata)
5. cbind(newdata, as_tibble(indicators))
6. cbind(deparse.level, ...)
7. data.frame(..., check.names = FALSE)
8. stop(gettextf("arguments imply differing number of rows: %s",
.     paste(unique(nrows), collapse = ", ")), domain = NA)```

I haven’t figured out the answer to that yet, but we’ll hand-wave that problem away for now and keep going with our analysis.

Next, what happens if we change me from Male to Female in these examples?

```test_cases\$Gender = c("Female", "Female", "Female", "Female")
# Why is this only letting me fit two objects at a time?
x_test_cases_1 <- recipes::bake(rec_obj, newdata = head(test_cases,2))
x_test_cases_2 <- recipes::bake(rec_obj, newdata = tail(test_cases,2))
x_test_cases <- rbind(x_test_cases_1, x_test_cases_2)
x_test_cases <- x_test_cases %>% select(-SalaryUSD)

model %>% predict(as.matrix(x_test_cases))
```
 52563.5 69958.5 103513 73491.9

In my scenario, there is a \$5,776.65 difference between male and female salaries. There is no causal explanation here (nor will I venture one in this post), but we can see that men earn more than women based on data in this survey.

### Conclusion

In today’s post, we used Keras to build up a decent first attempt at a model for predicting data professional salaries.  In reality, there’s a lot more to do before this is ready to roll out, but we’ll leave the subject here and move on to the next topic, so stay tuned.

## 2 thoughts on “How Much Can We Earn? Implementing A Model”

1. Ron Hendrickson says:

Hi, I am following your excellent tutorial, but I am getting this error on this line, in the model build out section, where x_test_data is a data frame and y_test_vec is a set of values:

> result % evaluate(as.matrix(x_test_data), y_test_vec)
Error in do.call(object\$evaluate, args) :
‘what’ must be a function or character string

2. Kevin Feasel says:

Hi, Ron. Apologies for the late response. It looks like maybe you copied an extra word, as “what” doesn’t get used in any of the code but it is the next line in the explanatory text.