This is part five of a series on launching a data science project.
At this point, we have done some analysis and cleanup on a data set. It might not be perfect, but it’s time for us to move on to the next step in the data science process: modeling.
Modeling has five major steps, and we’ll look at each step in turn. Remember that, like the rest of the process, I may talk about “steps” but these are iterative and you’ll bounce back and forth between them.
Feature engineering involves creating relevant features from raw data. A few examples of feature engineering include:
- Creating indicator flags, such as IsMinimumAge: Age >= 21, or IsManager: NumberOfEmployeesManaged > 0. These are designed to help you slice observations and simplify model logic, particularly if you’re building something like a decision tree.
- Calculations, such as ClickThroughRate = Clicks / Impressions. Note that this definition doesn’t imply multicollinearity, though, as ClickThroughRate isn’t linearly related to either Clicks or Impressions.
- Geocoding latitude and longitude from a street address.
- Aggregating data. That could be aggregation by day, by week, by hour, by 36-hour period, whatever.
- Text processing: turning words into arbitrary numbers for numeric analysis. Common techniques for this include TF-IDF and word2vec.
Once we’ve engineered interesting features, we want to use feature selection to winnow down the available set, removing redundant, unnecessary, or highly correlated features. There are a few reasons that we want to perform feature selection:
- If one explanatory variable can predict another, we have multicollinearity, which can make it harder to give credit to the appropriate variable.
- Feature selection makes it easier for a human to understand the model by removing irrelevant or redundant features.
- We can perform more efficient training with fewer variables.
- We reduce the risk of an irrelevant or redundant feature causing spurious correlation.
Now that we have some data and a clue of what we’re going to feed into an algorithm, it’s time to step up our training regimen. First up, we’re going to take some percentage of our total data and designate it for training and validation, leaving the remainder for evaluation (aka, test). There are no hard rules on percentages, but a typical reserve rate is about 70-80% for training/validation and 20-30% for test. We ideally want to select the data randomly but also include the relevant spreads and distributions of observations by pertinent variables in our training set; fortunately, there are tools available which can help us do just this, and we’ll look at them in a bit.
First up, though, I want to cover the four major branches of algorithms.
The vast majority of problems are supervised learning problems. The idea behind a supervised learning problem is that we have some set of known answers (labels). We then train a model to map input data to those labels in order to have the model predict the correct answer for unlabeled records.
Going back to the first post in this series, I pointed out that you have to listen to the questions people ask. Here’s where that pays off: the type of algorithm we want to choose depends in part on the nature of those questions. Major supervised learning classes and their pertinent driving questions include:
- Regression — How many / how much?
- Classification — Which?
- Recommendation — What next?
For example, in our salary survey, we have about 3000 labeled records: 3000(ish) cases where we know the salary in USD based on what people have reported. My goal is to train a model which can then take some new person’s inputs and spit out a reasonable salary prediction. Because my question is “How much money should we expect a data professional will make?” we will solve this using regression techniques.
With unsupervised learning, we do not know the answers beforehand, so we’re trying to derive answers within the data. Typically, we’ll use unsupervised learning to gain more insight about the data set, which can hopefully give us some labels we can use to convert this into a relevant supervised learning problem. The top forms of unsupervised learning include:
- Clustering — How can we segment?
- Dimensionality reduction — What of this data is useful?
Typically your business users won’t know or care about dimensionality reduction (that is, techniques like Principal Component Analysis) but we as analysts can use dimensionality reduction to narrow down on useful features.
Wait, isn’t self-supervised learning just a subset of supervised learning? Sure, but it’s pretty useful to look at on its own. Here, we use heuristics to guesstimate labels and train the model based on those guesstimates. For example, let’s say that we want to train a neural network or Markov chain generator to read the works of Shakespeare and generate beautiful prose for us. The way the recursive model would work is to take what words have already been written and then predict the most likely next word or punctuation character.
We don’t have “labeled” data within the works of Shakespeare, though; instead, our training data’s “label” is the next word in the play or sonnet. So we train our model based on the chains of words, treating the problem as interdependent rather than a bunch of independent words just hanging around.
Reinforcement learning is where we train an agent to observe its environment and use those environmental clues to make a decision. For example, there’s a really cool video from SethBling about MariFlow:
The idea, if you don’t want to watch the video, is that he trained a recurrent neural network based on hours and hours of his Mario Kart gameplay. The neural network has no clue what a Mario Kart is, but the screen elements below show how it represents the field of play and state of the game, and uses those inputs to determine which action to take next.
Choose An Algorithm
Once you understand the nature of the problem, you can choose the
form of your destructor algorithm. There are often several potential algorithms which can solve your problem, so you will want to try different algorithms and compare. There are a few major trade-offs between algorithms, so each one will have some better or worse combination of the following features:
- Accuracy and susceptibility to overfitting
- Training time
- Ability for a human to be able to understand the result
- Number of hyperparameters
- Number of features allowed. For example, a model like ARIMA doesn’t give you many features—it’s just the label behavior over time.
It is, of course, not comprehensive, but it does set you in the right direction. For example, we already know that we want to predict values, and so we’re going into the Regression box in the bottom-left. From there, we can see some of the trade-offs between different algorithms. If we use linear regression, we get fast training, but the downside is that if our dependent variable is not a linear function of the independent variables, then we won’t end up with a good result.
By contrast, a neural network regression tends to be fairly accurate, but can take a long time to finish or require expensive hardware to finish in any reasonable time.
Once you have an algorithm, features, and labels (if this is a supervised learning problem), you can train the model. Training a model involves solving a system of equations, minimizing a loss function. For example, here is an example of a plot with a linear regression thrown in:
In this chart, I have a straight line which represents the best fitting line for our data points, where best fit is defined as the line which minimizes the sum of the squares of errors (i.e., the sum of the square of the distance between the dot and our line). Computers are great at this kind of math, so as long as we set up the problem the right way and tell the computer what we want it to do, it can give us back an answer.
But we’ve got to make sure it’s a good answer. That’s where the next section helps.
Validate The Model
Instead of using up all of our data for training, we typically want to perform some level of validation within our training data set to ensure that we are on the right track and are not overfitting our model. Overfitting happens when a model latches onto the particulars of a data set, leaving it at risk of not being able to generalize to new data. The easiest way to tell if you are overfitting is to test your model against unseen data. If there is a big dropoff in model accuracy between the training and testing phases, you are likely overfitting.
Here’s one technique for validation: let’s say that we reserved 70% of our data for training. Of the 70%, we might want to slice off 10% for validation, leaving 60% for actual training. We feed the 60% of the data to our algorithm, generating a model. Then we predict the outcomes for our validation data set and see how close we were to reality, and how far off the accuracy rates are for our validation set versus our training set.
Another technique is called cross-validation. Cross-validation is a technique where we slice and dice the training data, training our model with different subsets of the total data. The purpose here is to find a model which is fairly robust to the particulars of a subset of training data, thereby reducing the risk of overfitting. Let’s say that we cross-validate with 4 slices. In the first step, we train with the first 3/4 of the data, and then validate with the final 1/4. In the second step, we train with slices 1, 2, and 4 and validate against slice 3. In the third step, we train with 1, 3, and 4 and validate against slice 2. Finally, we train with 2, 3, and 4 and validate against slice 1. We’re looking to build up a model which is good at dealing with each of these scenarios, not just a model which is great at one of the four but terrible at the other three.
Often times, we won’t get everything perfect on the first try. That’s when we move on to the next step.
Tune The Model
Most models have hyperparameters. For example, a neural network has a few hyperparameters, including the number of training epochs, the number of layers, the density of each layer, and dropout rates. For another example, random forests have hyperparameters like the maximum size of each decision tree and the total number of decision trees in the forest.
We tune our model’s hyperparameters using the validation data set. With cross-validation, we’re hoping that our tuning will not accidentally lead us down the road to spurious correlation, but we have something a bit better than hope: we have secret data.
Evaluate The Model
Model evaluation happens when we send new, never before seen data to the model. Remember that 20-30% that we reserved early on? This is where we use it.
Now, we want to be careful and make sure not to let any information leak into the training data. That means that we want to split this data out before normalizing or aggregating the training data set, and then we want to apply those same rules to the test data set. Otherwise, if we normalize the full data set and then split into training and test, a smart model can surreptitiously learn things about the test data set’s distribution and could train toward that, leading to overfitting our model to the test data and leaving it less suited for the real world.
Another option, particularly useful for unlabeled or self-learning examples, is to build a fitness function to evaluate the model. Genetic algorithms (for a refresher, check out my series) are a common tool for this. For example, MarI/O uses a genetic algorithm to train a neural network how to play Super Mario World.
Just like with data processing, I’m going to split this into two parts. Today, we’ve looked at some of the theory behind modeling. Next time around, we’re going to implement a regression model to try to predict salaries.