This is part three of a series on launching a data science project.

The Three Steps Of Data Processing

Data processing is made up of a few different activities:  data gathering, data cleansing, and data analysis.  Most estimates are that data scientists spend about 80% of their time in data processing (particularly in data cleansing).  If anything, I consider this an underestimation:  based on my experiences, I would probably put that number closer to 90%.

As with everything else, these separate concepts tend to jumble together and you’ll bounce back and forth between them as you learn about new resources, determine the viability of sources, and try to link data sources together.  But for the purposes of elucidation, we’ll treat them independently.  So let’s start with data gathering.

Data Gathering

At first, you might not know exactly what data you need, but as you flesh out your models and gain a better understanding of the problem, you will go back to the well several times and probably end up getting data from a variety of sourses.  These sources can include, but are not limited to, the following:

  • Internal proprietary data
  • Open data sources (often governmental or academic, but can also be private sources like our data professional survey)
  • Paid APIs or data sources from third parties
  • Data from surveys that you commission

Data Gathering In Our Example

For our example, we will stick to using just the data professional survey.  But if we wanted to expand our analysis out a bit, there are some places we could take it.  For example, we could use the Penn World Table and calculate Purchasing Power Parity GDP per capita to normalize salaries across countries.  This could help us find if there were countries in which database professionals were relatively higher-paid compared to the norm, and is pretty similar to work I did last year.

We could get a geocoding data set to visualize results on a map, or get a data set which has cost of living by ZIP Code.  This would help us normalize salaries within the United States, as long as the ZIP Code data in our set is good enough (spoilers:  it’s not).  We could also use census information to build out more data by ZIP Code or greater metropolitan area.  Finally, we could use data from other surveys to add more information to the sample.  Our sample is pretty small, at just 6000 entries across two years, so supplementing this data could help us a lot.

Data Cleansing

Once we have relevant (or at least potentially relevant) data sets, we probably will want to join them together to gain insight from the mashup of these data sets.  It is rare, however, to have all of your data sets match up exactly on the first try.  Instead, we’re going to have to work with those data sets to get the puzzle pieces to fit together.

Soft Corinthian Data With A Beautiful Grain

For example, we probably need to change the grain of some of our data sets.  Suppose have a data set which is at the daily grain and another which is at the monthly grain.  We can connect these two together, but we have to make a decision.  Do we take the daily grain and aggregate it up to the monthly level?  Do we take the monthly level and create some kind of daily allocation rule?  Do we figure out the month to which the date belongs and include month-level aggregations for each of those daily values?  Each of these is a viable strategy, but has different implications on what you can do with the data.

Next up, you might have to define join criteria.  There won’t always be an obvious natural join key like a date.  You might have a data set from one source which contains product titles and manufacturer codes, and another data set which contains UPCs and gussied up titles.  In this case, you might be able to map manufacturer codes to UPCs using a third data set, or maybe you can write some regular expressions to get the product titles to match up, but there’s work involved.

We may also need to reshape the data.  As mentioned above with grains, we might need to aggregate or disaggregate data to fit a join.  We may also need to perform operations like pivoting or unpivoting data (or gathering and spreading using the tidyverse parlance).

Dealing With Mislabels, Mismatches, And Just Plain Wrong Data

This next category of data cleansing has to do with specific values.  I want to look at three particular sub-categories:  mislabeled data, mismatched data, and incorrect data.

Mislabeled data happens when the label is incorrect.  In a data science problem, the label is the thing that we are trying to explain or predict.  For example, in our data set, we want to predict SalaryUSD based on various inputs.  If somebody earns $50,000 per year but accidentally types 500000 instead of 50000, it can potentially affect our analysis.  If you can fix the label, this data becomes useful again, but if you cannot, it increases the error, which means we have a marginally lower capability for accurate prediction.

Mismatched data happens when we join together data from sources which should not have been joined together.  Let’s go back to the product title and UPC/MFC example.  As we fuss with the data to try to join together these two data sets, we might accidentally write a rule which joins a product + UPC to the wrong product + MFC.  We might be able to notice this with careful observation, but if we let it through, then we will once again thwart reality and introduce some additional error into our analysis.  We could also end up with the opposite problem, where we have missed connections and potentially drop useful data out of our sample.

Finally, I’m calling incorrect data where something other than the label is wrong.  For example, in the data professional salary survey, there’s a person who works 200 hours per week.  While I admire this person’s dedication and ability to create 1.25 extra days per week that the rest of us don’t experience, I think that person should have held out for more than just $95K/year.  I mean, if I had the ability to generate spare days, I’d want way more than that.

Missing Data

People don’t always fill out the entirety of every form, and when you make fields optional, people are liable not to fill them in.  The flipside to this is that if you make optional fields required, people might input junk data just to get the form to submit, so simply marking fields as required isn’t the answer.

When we are missing data, we have a few options available to us.  First of all, if we’re looking at a small percentage of the total, we might just get rid of those records with important, missing data.  Once we start removing more than 2-3% of the data set for this reason, though, I’m liable to get a bit edgy about removing more.  If we’re missing information from non-vital columns like middle initial or address line 3, I’d just remove those features altogether and save the hassle.

Otherwise, if we can’t remove the feature and don’t want to remove the row, we can use statistical techniques to fill in the blanks.  We might substitute missing values with a dummy value, such as “Unknown” or a reasonable default like 0 (assuming 0 is a reasonable default in this case).  We might substitute with an “average” value like mean, median, or mode.  The default here is probably median, but there are times when you want to use mean or mode.  We could also do a quick regression and replace our missing value with the regressed value, assuming that there’s some correlation (positive or negative) between the filled-in features and our missing feature.

These statistical techniques can help fill in gaps, but I wouldn’t use them for more than a few percent of the data; if you’re seeing 15+ percent of the feature values missing, this is probably a compromised feature and you don’t want to use it in analysis.  Also, by applying these statistical techniques, we are knowingly introducing noise, making our models that much less accurate.  That is, we are saying that the value is X, but it’s probably somewhere in the range { X-y, X+y} and we are giving undue precision based on our own made-up calculations in the hopes that y is small enough that it won’t matter too much.  If this hope turns out to be the case, the benefit to salvaging incomplete records could overcome the cost of adding more noise to our analysis.

Inconsistent Data

As I mentioned earlier in the series, one of my data adages is that independent systems tend to end up with inconsistent data.  Sometimes people forget to fill out both systems the same way.  Sometimes there are typos in one data source but not the other (or the typos are different in each).  Sometimes there are subtle differences in data sets which lead to differing results.  Sometimes one data source is newer than the other, or at least one particular record is newer than what’s in the other system.

Regardless of the reason, we need to be able to handle these differences.  One way to handle differences is to make one data source canonical and trust it over the other when there comes time for a discrepancy.  If there’s an authoritative source, it makes sense to behave this way, and your business team might be able to explain which sources (if any) are authoritative.

More commonly, you will have to institute one or more rules to handle data discrepancies.  For example, if both records have a timestamp, pick the later timestamp…although wall clocks lie.  Another example of a rule might be, if the discrepancy is over a numeric value, go with the lower number as it is more likely to be correct.  These rules will be specific to your data sources and systems, and will often require quite a bit of sussing out.

Misshapen Data

If you happen to have data stored in a textual format, you can end up with misshapen data.  For example, you might have XML or JSON data where some of the records are in an invalid format.  Or maybe you have newlines in the middle of a record.  Maybe you have too many delimiters in that CSV, or maybe not enough.  Maybe the file starts as a CSV and then moves to pipe-delimited in the middle because the source system appended data over time using different output mechanisms.

Regular expressions can help out here sometimes.  You might also be able to get satisfaction from the owner of the source data and have them fix the problem.  Sadly, it’s just as likely that your team becomes the owner of the data and gets the pleasure of figuring it out.  If you store data in relational databases and adhere to first normal form (or better), data shape is already enforced, so you can pat yourself on the back for this one and move on to the next problem.

Data Shaping

Speaking of data shapes, we often need to perform some kinds of data shaping activities to turn our raw data into something useful for modeling.  Examples of data shaping include:

  • Vectorizing words.  That is, turning words in a document into numbers for analysis.
  • Categorizing data.  That is, turning strings (or maybe numeric values) into factors.
  • Normalizing values.  That is, transforming numeric data to have a mean of 0 and a standard deviation of 1.  The benefit of normalization is that you can talk about two features which have radically different scales in the context of the same model.  Let’s say that we want to measure a person’s desire to purchase a car based on its gas mileage and its sticker price.  Gas mileage typically ranges from about 15-50, but sticker price can range from $9000-90,000 (or more).  Both the sheer numbers and the wide range of sticker price will overwhelm gas mileage in many models, so the way we get around this is to normalize both features so that each has a mean of 0 and standard deviation of 1.  That way, one feature’s effect cannot swamp another’s.
  • Binning data.  That is, converting continuous values to discrete.  For example, we might have the age of each person in our data set, but we’re more interested in age cohorts, like 18-25 or 30-34.

Data Analysis

As part of building a data dictionary, I’ve already brought up a couple simple data analysis concepts.  Here, I’ll go into a few, focusing on techniques common in Exploratory Data Analysis (EDA).  The goal with EDA is to summarize and visualize the data in a way that furthers your understanding of the data.

Cardinality Analysis

The first simple technique you can use is finding the cardinality of variables.  For example, in the salary survey, there were four options available for Employment Status.

Not included: use multiverse versions of self to do work, thereby freeing up time to lounge on a beach in Cabo.

Cardinality tells us a few things.  For a text field like employment status, it indicates that we probably got the data from a proscribed list—often a dropdown list or set of radio buttons—and we probably want to treat this as categorical data, as people choose one to the exclusion of others.

For free-form entry, like number of hours worked, looking at the cardinality lets us see how many different values people chose, and if you compare the number selected versus the range from minimum to maximum, you can get a fair idea of how packed this set is.

Summarization And Visualization

Next up, we can run a five-number summary of a feature.

The 5-Number Summary: buy five numbers, get one free.

This summary, which is easy to perform in R, gives us five pertinent values:  the minimum, maximum, 25th percentile, 75th percentile, and 50th percentile (i.e., median).  R also gives you the mean, which isn’t technically part of the five-number summary but is helpful, as the difference between median and mean clues you into how skewed the distribution of your sample data is.

You can also visualize these five-number summaries with a box plot.

A box plot.  Buy five boxes, get one free?

This type of plot will show you min, max, 25th, 50th, and 75th percentiles, where 25-50-75 make up the box and min and max are the furthest points in the plot.  This plot also shows you “expected” values (those within 1.5 * [75th percentile – 25th percentile], shifted to the next data point in each direction), as well as outliers—those which fall outside the interquartile range.  Box plots are great at visualizing differences by category.

When looking at the distribution of data within a category, you can easily build a histogram.

A histogram using base-10 logged values because if you can’t do exponential math in your head, get out of my classroom.

This distribution visualization lets us see the skew—in this case, our plot skews left, meaning that the left-hand side of the median is more filled out than the right-hand side.  This is fairly common when looking at income data, so I wasn’t surprised to see this setup when I created the above graph using the 2017 survey data.

Correlation Checking

Another part of data analysis involves testing for correlation between explanatory variables.  You want correlation between your independent (explanatory) variables and your dependent variable—that’s what data analysis is all about, after all—but having tightly related explanatory variables can lead to analytical problems down the road.  This correlation is called multicollinearity and is one of the bugbears of linear regression analysis.

As an example of this, suppose that we want to regress plant growth rates using two variables:  amount of rainfall in inches and amount of precipitation in centimeters.  There is collinearity between rainfall in inches and precipitation in inches:  precipitation is rainfall + snowfall, so we can easily get from one to the other and their movements over time will be linked.  If we use both of these variables to try to model plant growth, it becomes harder for the algorithm we choose to differentiate between the two variables and assign them the correct rate.

The cor() function in R gives us correlation between two variables.

Diamond dimension correlations.

In this case, depth and table have a mild, negative correlation.  By contrast, the x and y dimensions are highly correlated.  If I were performing a linear regression, I would have no qualms about including both depth and table, but would think twice about including both x and y.

Conclusion (For Now)

In the next post, I will perform some data cleansing & analysis work on the survey, as this post is long enough as-is.  As mentioned, this is by far the longest phase of any data science project; what I’ve touched on in this post is just a brief smattering of the things you should expect to try when working through a project.

2 thoughts on “Data Processing: The Other 90%

Leave a comment