GitPitch: Revamping My Slide Decks

Over the past few months, I’ve started transitioning my slide decks to GitPitch.  I’ve used reveal.js for years and enjoyed being able to put together presentations in HTML format.  I can run these slides in any browser (or even phones) and don’t have to worry about Power Point messing up when I go to presentation mode.  I also get the ability to transition up and down as well as left and right, which lets me “drill into” topics.  For example, from my Genetic Algorithms talk:

If I want to go into a topic, I can slide down; if I need to skip it for time, I can bypass the section.

GitPitch also uses Reveal, so many of the skills that I’ve built up still work for me.  But there are a few compelling reasons for me to migrate.

Reasons For Moving To GitPitch

So why did I move to GitPitch?  It started with an offer to Microsoft MVPs where they offer a free year of GitPitch Pro.  For transparency purposes, I am taking advantage of that free year.  After the year is up, I’m going to start paying for the service, but I did get it for free.

So why did I make the jump?  There are a few reasons.

Background Images

One of the best features of GitPitch (and a reason I went with Pro over the Free edition) is the ability to use a background image and set the transparency level.  This lets me add flavor to my slide decks without overwhelming the audience:

There are a number of good sources for getting free images you can use for these purposes, including UnsplashCreative Commons Search, and SkyPixel for drone photos.  I’m also working to introduce my own photography when it makes sense.

Markdown

Reveal.js uses HTML to build slides, but GitPitch uses Markdown.  Markdown is a fairly easy syntax to pick up and is a critical part of Jupyter Notebooks.

To build the slide above, the markdown looks like this:

---?image=presentation/assets/background/Background_13.jpg&size=cover&opacity=15

### Outlook

|Outlook|Yes|No|P(Yes)|P(No)|
|-------|---|--|------------|
|Sunny|2|3|2/9|3/5|
|Overcast|4|0|4/9|0/5|
|Rainy|3|2|3/9|2/5|
|Total|9|5|100%|100%|

No HTML and it’s pretty easy to make changes.  Because it’s a text-based format, major changes are also pretty easy to find-and-replace, something hard to do with Power Point.

Math

Recently, I did a presentation on Naive Bayes classifiers.  To display math, we can use MathJax.  Here’s an example slide:

---

### Applying Bayes' Theorem

Supposing multiple inputs, we can combine them together like so:

`$P(B|A) = \dfrac{P(x_1|B) * P(x_2|B) * ... * P(x_n|B)}{P(A)}$`

This is because we assume that the inputs are **independent** from one another.

Given `$B_1, B_2, ..., B_N$` as possible classes, we want to find the `$B_i$` with the highest probability.

And here’s how it looks:

I’m definitely not very familiar with MathJax, but there are online editors to help you out.

Code Samples

Another cool feature is the ability to include code samples:

What’s really cool is the ability to walk through step by step:

---

```r
if(!require(naivebayes)) {
install.packages("naivebayes")
library(naivebayes)
}
if(!require(caret)) {
install.packages("caret")
library(caret)
}

data(iris)

set.seed(1773)
irisr <- iris[sample(nrow(iris)),]
irisr <- irisr[sample(nrow(irisr)),]

iris.train <- irisr[1:120,]
iris.test <- irisr[121:150,]

nb <- naivebayes::naive_bayes(Species ~ ., data = iris.train)
#plot(nb, ask=TRUE)

iris.output <- cbind(iris.test, 
prediction = predict(nb, iris.test))
table(iris.output$Species, iris.output$prediction)
confusionMatrix(iris.output$prediction, iris.output$Species)
```

@[1-4](Install the naivebayes package to generate a Naive Bayes model.)
@[5-8](Install the caret package to generate a confusion matrix.)
@[12-14](Pseudo-randomize the data set. This is small so we can do it by hand.)
@[16-17](Generate training and test data sets.)
@[19](Building a Naive Bayes model is as simple as a single function call.)
@[22-25](Generate predictions and analyze the resulting predictions for accuracy.)

I haven’t used this as much as I want to, as my talks historically have not included much code—I save that for the demos.  But with the ability to walk through code a section at a time, it makes it easier to present code directly.

Up On GitHub

Something I wasn’t a huge fan of but grew to be okay with was that the markdown files and images are stored in your GitHub repo.  For example, my talk on the data science process is just integrated into my GitHub repo for the talk.  This has upsides and downsides.  The biggest upside is that I don’t have to store the slides work anywhere else, but I’ll describe the downside in the next section.

Tricky Portions

There are a few things that I’ve had to work out in the process.

Handling Updates

Testing things out can be a bit annoying because you have to push changes to GitHub first.  Granted, this isn’t a big deal—commit and push, test, commit and push, test, rinse and repeat.  But previously I could save the HTML file, open it locally, and test the results.  That was a smoother process, though as I’ve built up a couple of decks, I have patterns I can follow so that reduces the pain quite a bit.

Online Only (Mostly)

There is an offline slideshow mode in GitPitch Pro and a desktop version, but I haven’t really gotten into those.  It’s easier for me just to work online and push to my repos.

When presenting, I do need to be online to grab the presentation, but that’s something I can queue up in my hotel room the night before if I want—just keep the browser window open and I can move back and forth through the deck.

Down Transitions And Background Images

One thing I’ve had to deal with when migrating slide decks is that although I can still use the same up-down-left-right mechanics in Reveal.js, when using transparency on images, I’ve noticed that the images tend to bleed together when moving downward.  I’ve dealt with the problem by going back to a “classic” left-right only slide transition scheme.

Wrapping Up

I’ve become enough of a fan of GitPitch to decide that I’m going to use that as my default.  I think the decks are more visually compelling:  for example, there’s my original data science process slide deck and my GitPitch version.  As far as content goes, both decks tell the same story, but I think the GitPitch version retains interest a bit better.  As I give these talks in the new year, I’ll continue transitioning to GitPitch, and new talks will go there by default.

Advertisements

Enter The Tidyverse: Full-Day R Training In Charlotte

I am giving a full-day R training on Friday, October 19th in Charlotte, North Carolina.  Tickets are still available for this training event.  Here’s the abstract:

DESCRIPTION

SQL Saturday Charlotte and the Charlotte BI Group are pleased to offer this full day workshop with Kevin Feasel.

Enter The Tidyverse: R for the Data Professional

In this day-long training, you will learn about R, the premier language for data analysis. We will approach the language from the standpoint of data professionals: database developers, database administrators, and data scientists. We will see how data professionals can translate existing skills with SQL to get started with R. We will also dive into the tidyverse, an opinionated set of libraries which has modernized R development. We will see how to use libraries such as dplyr, tidyr, and purrr to write powerful, set-based code. In addition, we will use ggplot2 to create production-quality data visualizations.
Over the course of the day, we will look at several problem domains. For database administrators, areas of note will include visualizing SQL Server data, predicting error occurrences, and estimating backup times for new databases. We will also look at areas of general interest, including analysis of open source data sets.
No experience with R is necessary. The only requirements are a laptop and an interest in leveling up your data professional skillset.

With Azure Notebooks, teaching and learning R becomes a lot easier—rather than fighting computers trying to install all of the pre-requisite software, we can jump straight to the notebooks.

Upcoming Speaking Engagements

It’s been a busy few months, so I’m going to keep radio silent for a little while longer (though Curated SQL is going strong).  In the meantime, here’s where I’m going to be over the next month:

  1. On Saturday, May 19th, I’ll be in New York City for SQL Saturday NYC, where I’m going to do two presentations:  Using Kafka for Real-Time Data Ingestion with .NET and Much Ado About Hadoop.
  2. On Tuesday, May 22nd, I’ll be in Charlotte presenting for the Enterprise Developers Guild, giving a talk entitled A .NET Developer’s View of the OWASP Top 10.
  3. On Thursday, May 31st, I’m going to give a paid pre-con entitled Enter the Tidyverse for SQL Saturday Mexico.  Click the link for instructions on signing up.
  4. Then, on Saturday, June 2nd, I’ll be in Mexico City for SQL Saturday Mexico, where I will present two talks:  Data Cleansing with SQL and R and Working Effectively with Legacy SQL.
  5. On Thursday, June 7th, I will present my R for the SQL Server Developer talk to the Roanoke Valley .NET User Group.
  6. Then, on June 9th, I’m going to give 3 sessions at SQL Saturday South Florida.  Those talks are Eyes on the Prize, Much Ado About Hadoop, and APPLY Yourself.
  7. I’ll be in Houston, Texas on June 23rd for SQL Saturday Houston.  There’s no official confirmation on talk(s) just yet, but I can confirm that I’ll be there and will do at least one session.

I have a few more irons in the fire as well, but this wraps up my May and June.

Enter The Tidyverse, Columbus Edition

In conjunction with SQL Saturday Columbus, I am giving a full-day training session entitled Enter the Tidyverse:  R for the Data Professional on Friday, July 27th.  This is a training that I did earlier in the year in Madison, Wisconsin, and aside from having no voice at the end, I think it went really well.  I’ve tweaked a couple of things to make this training even better; it’s well worth the low, low price of $100 for a full day of training on the R programming language.

I use the term “data professional” on purpose:  part of what I do with this session is show attendees how, even if they are database administrators, it can pay to know a bit about the R programming language.  Database developers, application developers, and budding data scientists will also pick up a good bit of useful information during this training, so it’s fun for the whole data platform.

Throughout the day, we will use a number of data sources which should be familiar to database administrators:  wait stats, database backup times, Reporting Services execution log metrics, CPU utilization statistics, and plenty more.  These are the types of things which database administrators need to deal with on a daily basis, and I’ll show you how you can use R to make your life easier.

If you sign up for the training in Columbus, the cost is only $100 and you’ll walk away with a better knowledge of how you can level up your database skills with the help of a language specially designed for analysis.  Below is the full abstract for my training session.  If this sounds interesting to you, sign up today!  I’m not saying you should go out and buy a couple dozen tickets today, but you should probably buy one dozen today and maybe a dozen more tomorrow; pace yourself, that’s all I’m saying.

Course Description

In this day-long training, you will learn about R, the premiere language for data analysis.  We will approach the language from the standpoint of data professionals:  database developers, database administrators, and data scientists.  We will see how data professionals can translate existing skills with SQL to get started with R.  We will also dive into the tidyverse, an opinionated set of libraries which has modernized R development.  We will see how to use libraries such as dplyr, tidyr, and purrr to write powerful, set-based code.  In addition, we will use ggplot2 to create production-quality data visualizations.

Over the course of the day, we will look at several problem domains.  For database administrators, areas of note will include visualizing SQL Server data, predicting error occurrences, and estimating backup times for new databases.  We will also look at areas of general interest, including analysis of open source data sets.

No experience with R is necessary.  The only requirements are a laptop and an interest in leveling up your data professional skillset.

Intended Audience

  • Database developers looking to tame unruly data
  • Database administrators with an interest in visualizing SQL Server metrics
  • Data analysts and budding data scientists looking for an overview of the R landscape
  • Business intelligence professionals needing a powerful language to cleanse and analyze data efficiently

Contents

Module 0 — Prep Work

  • Review data sources we will cover during the training
  • Ensure laptops are ready to go

Module 1 — Basics of R

  • What is R?
  • Basic mechanics of R
  • Embracing functional programming in R
  • Connecting to SQL Server with R
  • Identifying missing values, outliers, and obvious errors

Module 2 — Intro To The Tidyverse

  • What is the Tidyverse?
  • Tidyverse principles
  • Tidyverse basics:  dplyr, tidyr, readr, tibble

Module 3 — Dive Into The Tidyverse

  • Data loading:  rvest, httr, readxl, jsonlite, xml2
  • Data wrangling:  stringr, lubridate, forcats, broom
  • Functional programming:  purrr

Module 4 — Plotting

  • Data visualization principles
  • Chartjunk
  • Types of plots:  good, bad, and ugly
  • Plotting data with ggplot2
    • Exploratory plotting
    • Building professional quality plots

Module 5 — Putting it Together:  Analyzing and Predicting Backup Performance

  • A capstone notebook which covers many of the topics we covered today, focusing on Database Administration use cases
  • Use cases include:
    • Gathering CPU statistics
    • Analyzing Disk Utilization
    • Analyzing Wait Stats
    • Investigating Expensive Reports
    • Analyzing Temp Table Creation Stats
    • Analyzing Backup Times

Course Objectives

Upon completion of this course, attendees will be able to:

  • Perform basic data analysis with the R programming language
  • Take advantage of R functions and libraries to clean up dirty data
  • Build a notebook using Jupyter Notebooks
  • Create data visualizations with ggplot2

Pre-Requisites

No experience with R is necessary, though it would be helpful.  Please bring a laptop to follow along with exercises and get the most out of this course.

What Comes After Go-Live?

This is part eight of a series on launching a data science project.

At this point in the data science process, we’ve launched a product into production.  Now it’s time to kick back and hibernate for two months, right?  Yeah, about that…

Just because you’ve got your project in production doesn’t mean you’re done.  First of all, it’s important to keep checking the efficacy of your models.  Shift happens, where a model might have been good at one point in time but becomes progressively worse as circumstances change.  Some models are fairly stable, where they can last for years without significant modification; others have unstable underlying trends, to the point that you might need to retrain such a model continuously.  You might also find out that your training and testing data was not truly indicative of real-world data, especially that the real world is a lot messier than what you trained against.

The best way to guard against unbeknownst model shift is to take new production data and retrain the model.  This works best if you can keep track of your model’s predictions versus actual outcomes; that way, you can tell the actual efficacy of the model, figuring out how frequently and by how much your model was wrong.

Depending upon your choice of algorithm, you might be able to update the existing model with this new information in real time.  Models like neural networks and online passive-aggressive algorithms allow for continuous training, and when you’ve created a process which automatically feeds learned data back into your continuously-training model, you now have true machine learning. Other algorithms, however, require you to retrain from scratch.  That’s not a show-stopper by any means, particularly if your underlying trends are fairly stable.

Regardless of model selection, efficacy, and whether you get to call what you’ve done machine learning, you will want to confer with your stakeholders and ensure that your model actually fits their needs; as I mentioned before, you can have the world’s best regression, but if the people with the sacks of cash want a recommendation engine, you’re not getting the goods.  But that doesn’t mean you should try to solve all the problems all at once; instead, you want to start with a Minimum Viable Product (MVP) and gauge interest.  You’ve developed a model which solves the single most pressing need, and from there, you can make incremental improvements.  This could include relaxing some of the assumptions you made during initial model development, making more accurate predictions, improving the speed of your service, adding new functionality, or even using this as an intermediate engine to derive some other result.

Using our data platform survey results, assuming the key business personnel were fine with the core idea, some of the specific things we could do to improve our product would be:

  • Make the model more accurate.  Our MAE was about $19-20K, and reducing that error makes our model more useful for others.  One way to do this would be to survey more people.  What we have is a nice starting point, but there are too many gaps to go much deeper than a national level.
  • Introduce intra-regional cost of living.  We all know that $100K in Manhattan, NY and $100K in Manhattan, KS are quite different.  We would want to take into account cost of living, assuming we have enough data points to do this.
  • Use this as part of a product helping employers find the market rate for a new data professional, where we’d ask questions about the job location, relative skill levels, etc. and gin up a reasonable market estimate.

There are plenty of other things we could do over time to add value to our model, but I think that’s a nice stopping point.

What’s Old Is New Again

Once we get to this phase, the iterative nature of this process becomes clear.

The Team Data Science Project Lifecycle (Source)

On the micro level, we bounce around within and between steps in the process.  On the macro level, we iterate through this process over and over again as we develop and refine our models.  There’s a definite end game (mostly when the sacks of cash empty), but how long that takes and how many times you cycle through the process will depend upon how accurate and how useful your models are.

In wrapping up this series, if you want to learn more, check out my Links and Further Information on the topic.

Deploying A Model: The Microservice Approach

This is part seven of a series on launching a data science project.

Up to this point, we’ve worked out a model which answers important business questions.  Now our job is to get that model someplace where people can make good use of it.  That’s what today’s post is all about:  deploying a functional model.

Back in the day (by which I mean, say, a decade ago), one team would build a solution using an analytics language like R, SAS, Matlab, or whatever, but you’d almost never take that solution directly to production.  These were analytical Domain-Specific Languages with a set of assumptions that could work well for a single practitioner but wouldn’t scale to a broad solution.  For example, R had historically made use of a single CPU core and was full of memory leaks.  Those didn’t bother analysts too much because desktops tended to be single-core and you could always reboot the machine or restart R.  But that doesn’t work so well for a server—you need something more robust.

So instead of using the analytics DSL directly in production, you’d use it indirectly.  You’d use R (or SAS or whatever) to figure out the right algorithm and determine weights and construction and toss those values over the wall to an implementation team, which would rewrite your model in some other language like C.  The implementation team didn’t need to understand all of the intricacies of the problem, but did need to have enough practical statistics knowledge to understand what the researchers meant and translate their code to fast, efficient C (or C++ or Java or whatever).  In this post, we’ll look at a few changes that have led to a shift in deployment strategy, and then cover what this shift means for practitioners.

Production-Quality Languages

The first shift is the improvement in languages.  There are good libraries for Java, C#, and other “production” languages, so that’s a positive.  But that’s not one of the two positives I want to focus on today.  The first positive is the general improvement in analytical DSLs like R.  We’ve gone from R being not so great when running a business to being production-quality (although not without its foibles) over the past several years.  Revolution Analytics (now owned by Microsoft) played a nice-sized role in that, focusing on building a stable, production-ready environment with multi-core support.  The same goes for RStudio, another organization which has focused on making R more useful in the enterprise.

The other big positive is the introduction of Python as a key language for data science.  With libraries like NumPy, scikit-learn, and Pandas, you can build quality models.  And with Cython, a data scientist can compile those models down to C to make them much faster.  I think the general acceptance of Python in this space has helped spur on developers around other languages (whether open-source like R or closed-source commercial languages like SAS) to get better.

The Era Of The Microservice

The other big shift is a shift away from single, large services which try to solve all of the problems.  Instead, we’ve entered the era of the microservice:  a small service dedicated to providing a single answer to a single problem.  A microservice architecture lets us build smaller applications geared toward solving the domain problem rather than trying to solve the integration problem.  Although you can definitely configure other forms of interoperation, most microservices typically are exposed via web calls and that’s the scenario I’ll discuss today.  The biggest benefit to setting up a microservice this way is that I can write my service in R, you can call it from your Python service, and then some .NET service could call yours, and nobody cares about the particular languages used because they all speak over a common, known protocol.

One concern here is that you don’t want to waste your analysts time learning how to build web services, and that’s where data science workbenches and deployment tools like DeployR come into play.  These make it easier to deploy scalable predictive services, allowing practitioners to build their R scripts, push them to a service, and let that service host the models and turn function calls into API calls automatically.

But if you already have application development skills on your team, you can make use of other patterns.  Let me give two examples of patterns that my team has used to solve specific problems.

Machine Learning Services

The first pattern involves using SQL Server Machine Learning Services as the core engine.  We built a C# Web API which calls ML Services, passing in details on what we want to do (e.g., generate predictions for a specific set of inputs given an already-existing model).  A SQL Server stored procedure accepts the inputs and calls ML Services, which farms out the request to a service which understands how to execute R code.  The service returns results, which we interpret as a SQL Server result set, and we can pass that result set back up to C#, creating a return object for our users.

In this case, SQL Server is doing a lot of the heavy lifting, and that works well for a team with significant SQL Server experience.  This also works well if the input data lives on the same SQL Server instance, reducing data transit time.

APIs Everywhere

The second pattern that I’ll cover is a bit more complex.  We start once again with a C# Web API service.  On the opposite end, we’re using Keras in Python to make predictions against trained neural network models.  To link the two together, we have a couple more layers:  first, a Flask API (and Gunicorn as the production implementation).  Then, we stand nginx in front of it to handle load balancing.  The C# API makes requests to nginx, which feeds the request to Gunicorn, which runs the Keras code, returning results back up the chain.

So why have the C# service if we’ve already got nginx running?  That way I can cache prediction results (under the assumption that those results aren’t likely to change much given the same inputs) and integrate easily with the C#-heavy codebase in our environment.

Notebooks

If you don’t need to run something as part of an automated system, another deployment option is to use notebooks like JupyterZeppelin, or knitr.  These notebooks tend to work with a variety of languages and offer you the ability to integrate formatted text (often through Markdown), code, and images in the same document.  This makes them great for pedagogical purposes and for reviewing your work six months later, when you’ve forgotten all about it.

Using a Jupyter notebook to review Benford’s Law.

Interactive Visualization Products

Another good way of getting your data into users’ hands is Shiny, a package which lets you use Javascript libraries like D3 to visualize your data.  Again, this is not the type of technology you’d use to integrate with other services, but if you have information that you want to share directly with end users, it’s a great choice.

Conclusion

Over the course of this post, I’ve looked at a few different ways of getting model results and data into the hands of end users, whether via other services (like using the microservice deployment model) or directly (using notebooks or interactive applications).  For most scenarios, I think that we’re beyond the days of needing to have an implementation team rewrite models for production, and whether you’re using R or Python, there are good direct-to-production routes available.

How Much Can We Earn? Implementing A Model

This is part six of a series on launching a data science project.

Last time around, we walked through the idea of what building a model entails.  We built a clean(er) data set and did some analysis earlier, and in this post, I’m going to build on that.

Modeling

Because our question is a “how much?” question, we want to use regression to solve the problem. The most common form of regression that you’ll see in demonstrations is linear regression, because it is easy to teach and easy to understand. In today’s demo, however, we’re going to build a neural network with Keras. Although our demo is in R, Keras actually uses Python on the back end to run TensorFlow. There are other libraries out there which can run neural networks strictly within R (for example, Microsoft Machine Learning’s R implemenation has the RxNeuralNet() function), but we will use Keras in this demo because it is a popular library.

Now that we have an algorithm and implementation in mind, let’s split the data out into training and test subsets. I want to use Country as the partition variable because I want to ensure that we retain some data from each country in the test set. To make this split, I am using the createDataPartition() function in caret. I’ll then split out the data into training and test data.

trainIndex <- caret::createDataPartition(survey_2018$Country, p = 0.7, list = FALSE, times = 1)
train_data <- survey_2018[trainIndex,]
test_data <- survey_2018[-trainIndex,]

We will have 1976 training rows and 841 testing rows.

Once I have this data split, I want to perform some operations on the training data. Specifically, I want to think about the following:

  • One-Hot Encode the categorical data
  • Mean-center the data, so that the mean of each numeric value is 0
  • Scale the data, so that the standard deviation of each value is 1

The bottom two are called normalizing the data. This is a valuable technique when dealing with many algorithms, including neural networks, as it helps with optimizing gradient descent problems.

In order to perform all of these operations, I will create a recipe, using the recipes package.

NOTE: It turns out that normalizing the features results in a slightly worse outcome in this case, so I’m actually going to avoid that. You can uncomment the two sections and run it yourself if you want to try. In some problems, normalization is the right answer; in others, it’s better without normalization.

rec_obj <- recipes::recipe(SalaryUSD ~ ., data = train_data) %>%       # Build out a set of rules we want to follow (a recipe)
  step_dummy(all_nominal(), -all_outcomes()) %>%              # One-hot encode categorical data
  #step_center(all_predictors(), -all_outcomes()) %>%          # Mean-center the data
  #step_scale(all_predictors(), -all_outcomes()) %>%           # Scale the data
  prep(data = train_data)

rec_obj
Data Recipe

Inputs:

      role #variables
   outcome          1
 predictor         17

Training data contained 1976 data points and no missing data.

Operations:

Dummy variables from Country, EmploymentStatus, JobTitle, ... [trained]

Now we can bake our data based on the recipe above. Note that I performed all of these operations only on the training data. If we normalize the training + test data, our optimization function can get a sneak peek at the distribution of the test data based on what is in the training set, and that will bias our result.

After building up the x_ series of data sets, I’ll build vectors which contain the salaries for the training and test data. I need to make sure to remove the SalaryUSD variable; we don’t want to make that available to the trainer as an independent variable!

x_train_data <- recipes::bake(rec_obj, newdata = train_data)
x_test_data <- recipes::bake(rec_obj, newdata = test_data)
y_train_vec <- pull(x_train_data, SalaryUSD)
y_test_vec  <- pull(x_test_data, SalaryUSD)
# Remove the SalaryUSD variable.
x_train_data <- x_train_data[,-1]
x_test_data <- x_test_data[,-1]

At this point, I want to build the Keras model. I’m creating a build_model function in case I want to run this over and over. In a real-life scenario, I would perform various optimizations, do cross-validation, etc. In this scenario, however, I am just going to run one time against the full training data set, and then evaluate it against the test data set.

Inside the function, we start by declaring a Keras model. Then, I add three layers to the model. The first layer is a dense (fully-connected) layer which accepts the training data as inputs and uses the Rectified Linear Unit (ReLU) activation mechanism. This is a decent first guess for activation mechanisms. We then have a dropout layer, which reduces the risk of overfitting on the training data. Finally, I have a dense layer for my output, which will give me the salary.

I compile the model using the RMSProp optimizer. This is a good default optimizer for neural networks, although you might try AdagradAdam, or AdaMax as well. Our loss function is Mean Squared Error, which is good for dealing with finding the error in a regression. Finally, I’m interested in the Mean Absolute Error–that is, the dollar amount difference between our function’s prediction and the actual salary. The closer to $0 this is, the better.

build_model <- function() {
  model <- keras_model_sequential() %>%
    layer_dense(units = 256, input_shape = c(ncol(x_train_data)), activation = "relu") %>%
    layer_dropout(rate = 0.2) %>%
    layer_dense(units = 512, activation = "relu") %>%
    layer_dropout(rate = 0.2) %>%
    layer_dense(units = 1, activation = "linear") # No activation --> linear layer

  # RMSProp is a nice default optimizer for a neural network.
  # Mean Squared Error is a classic loss function for dealing with regression-style problems, whether with a neural network or otherwise.
  # Mean Average Error gives us a metric which directly translates to the number of dollars we are off with our predictions.
  model %>% compile(
    optimizer = "rmsprop",
    loss = "mse",
    metrics = c("mae")
  )
}

Building out this model can take some time, so be patient.

model <- build_model() model %>% fit(as.matrix(x_train_data), y_train_vec, epochs = 100, batch_size = 16, verbose = 0)
result <- model %>% evaluate(as.matrix(x_test_data), y_test_vec)
result
$loss
863814393.60761

$mean_absolute_error
19581.9413644471

What this tells us is that, after generating our model, we are an average of mean_absolute_error dollars off from reality. In my case, that was just under $20K. That’s not an awful amount off. In fact, it’s an alright start, though I wouldn’t trust this model as-is for for my negotiations. With a few other enhancements, we might see that number drop a bit and start getting into the trustworthy territory.

With a real data science project, I would dig further, seeing if there were better algorithms available, cross-validating the training set, etc. As-is, this result isn’t good enough for a production scenario, but we can pretend that it is.

Now let’s test a couple of scenarios. First up, my salaries over time, as well as a case where I moved to Canada last year.  There might be some exchange rate shenanigans but there were quite a few Canadian entrants in the survey so it should be a pretty fair comp.

test_cases <- test_data[1:4, ]

test_cases$SalaryUSD = c(1,2,3,4)
test_cases$Country = c("United States", "United States", "United States", "Canada")
test_cases$YearsWithThisDatabase = c(0, 5, 11, 11)
test_cases$EmploymentStatus = c("Full time employee", "Full time employee", "Full time employee", "Full time employee")
test_cases$JobTitle = c("Developer: App code (C#, JS, etc)", "DBA (General - splits time evenly between writing & tuning queries AND building & troubleshooting servers)", "Manager", "Manager")
test_cases$ManageStaff = c("No", "No", "Yes", "Yes")
test_cases$YearsWithThisTypeOfJob = c(0, 5, 0, 0)
test_cases$OtherPeopleOnYourTeam = c(5, 0, 2, 2)
test_cases$DatabaseServers = c(8, 12, 150, 150)
test_cases$Education = c("Bachelors (4 years)", "Masters", "Masters", "Masters")
test_cases$EducationIsComputerRelated = c("Yes", "Yes", "Yes", "Yes")
test_cases$Certifications = c("No, I never have", "Yes, and they're currently valid", "Yes, but they expired", "Yes, but they expired")
test_cases$HoursWorkedPerWeek = c(40, 40, 40, 40)
test_cases$TelecommuteDaysPerWeek = c("None, or less than 1 day per week", "None, or less than 1 day per week", "None, or less than 1 day per week", "None, or less than 1 day per week")
test_cases$EmploymentSector = c("State/province government", "State/province government", "Private business", "Private business")
test_cases$LookingForAnotherJob = c("No", "Yes", "No", "No")
test_cases$CareerPlansThisYear = c("Stay with the same employer, same role", "Stay with the same role, but change employers", "Stay with the same employer, same role", "Stay with the same employer, same role")
test_cases$Gender = c("Male", "Male", "Male", "Male")

# Why is this only letting me fit two objects at a time?
x_test_cases_1 <- recipes::bake(rec_obj, newdata = head(test_cases,2))
x_test_cases_2 <- recipes::bake(rec_obj, newdata = tail(test_cases,2))
x_test_cases <- rbind(x_test_cases_1, x_test_cases_2)
x_test_cases <- x_test_cases %>% select(-SalaryUSD)

model %>% predict(as.matrix(x_test_cases))
58330.57
75734.77
109289.84
78821.73

The first prediction was pretty close to right, but the next two were off.  Also compare them to my results from last year.  The Canadian rate is interesting considering the exchange rate for this time was about 75-78 US cents per Canadian dollar, and the Canadian rate is about 72%.

Note that I had a bit of difficulty running the bake function against these data sets.  When I tried to build up more than two rows, I would get a strange off-by-one error in R.  For example, here’s what it looks like when I try to use head(test_cases, 3) instead of 2:

Error in data.frame(..., check.names = FALSE): arguments imply differing number of rows: 3, 2
Traceback:

1. recipes::bake(rec_obj, newdata = head(test_cases, 3))
2. bake.recipe(rec_obj, newdata = head(test_cases, 3))
3. bake(object$steps[[i]], newdata = newdata)
4. bake.step_dummy(object$steps[[i]], newdata = newdata)
5. cbind(newdata, as_tibble(indicators))
6. cbind(deparse.level, ...)
7. data.frame(..., check.names = FALSE)
8. stop(gettextf("arguments imply differing number of rows: %s", 
 .     paste(unique(nrows), collapse = ", ")), domain = NA)

I haven’t figured out the answer to that yet, but we’ll hand-wave that problem away for now and keep going with our analysis.

Next, what happens if we change me from Male to Female in these examples?

test_cases$Gender = c("Female", "Female", "Female", "Female")
# Why is this only letting me fit two objects at a time?
x_test_cases_1 <- recipes::bake(rec_obj, newdata = head(test_cases,2))
x_test_cases_2 <- recipes::bake(rec_obj, newdata = tail(test_cases,2))
x_test_cases <- rbind(x_test_cases_1, x_test_cases_2)
x_test_cases <- x_test_cases %>% select(-SalaryUSD)

model %>% predict(as.matrix(x_test_cases))
52563.52
69958.53
103513.19
73491.90

In my scenario, there is a $5,776.65 difference between male and female salaries. There is no causal explanation here (nor will I venture one in this post), but we can see that men earn more than women based on data in this survey.

Conclusion

In today’s post, we used Keras to build up a decent first attempt at a model for predicting data professional salaries.  In reality, there’s a lot more to do before this is ready to roll out, but we’ll leave the subject here and move on to the next topic, so stay tuned.