Data Processing: An Example

This is part four of a series on launching a data science project.

An Example Of Data Processing

Last time around, I spent a lot of time talking about data acquisition, data cleansing, and basic data analysis. Today, we’re going to walk through a little bit of it with the data professional salary survey.

First, let’s install some packages:

if(!require(tidyverse)) {
  install.packages("tidyverse", repos = "http://cran.us.r-project.org")
  library(tidyverse)
}

if(!require(XLConnect)) {
  install.packages("XLConnect", repos = "http://cran.us.r-project.org")
  library(XLConnect)
}

if(!require(caret)) {
  install.packages("caret", repos = "http://cran.us.r-project.org")
  library(caret)
}

if(!require(recipes)) {
  install.packages("recipes", repos = "http://cran.us.r-project.org")
  library(recipes)
}

if(!require(data.table)) {
  install.packages("data.table", repos = "http://cran.us.r-project.org")
  library(data.table)
}

if(!require(devtools)) {
  install.packages("devtools", repos = "http://cran.us.r-project.org")
  library(devtools)
}

if(!require(keras)) {
  devtools::install_github("rstudio/keras")
  library(keras)
  install_keras(method = "auto", conda = "auto", tensorflow = "default", extra_packages = NULL)
}

The tidyverse package is a series of incredibly useful libraries in R, and I can’t think of doing a data science project in R without it. The XLConnectpackage lets me read an Excel workbook easily and grab the salary data without much hassle. The caret library provides some helpful tooling for working with data, including splitting out test versus training data, like we’ll do below. The recipes package will be useful for normalizing data later, and we will use data.table to get a glimpse at some of our uneven data. We need the devtools package to install keras from GitHub. Keras is a deep learning library which implements several neural network libraries, including TensorFlow, which we will use later in this series. We need to install TensorFlow on our machine. Because this is a small data set, and because I want this to run on machines without powerful GPUs, I am using the CPU-based version of TensorFlow. Performance should still be adequate for our purposes.

Once we have the required packages loaded, we will then load the Excel workbook. I have verified the Excel worksheet and data region are correct, so we can grab the survey from the current directory and load it into salary_data.

wb <- XLConnect::loadWorkbook("2018_Data_Professional_Salary_Survey_Responses.xlsx")
salary_data <- XLConnect::readWorksheet(wb, sheet = "Salary Survey", region = "A4:Z6015")

We can use the glimpse function inside the tidyverse to get a quick idea of what our salary_data dataframe looks like. In total, we have 6011 observations of 26 variables, but this covers two survey years: 2017 and 2018. Looking at the variable names, we can see that there are some which don’t matter very much (like Timestamp, which is when the user filled out the form; and Counter, which is just a 1 for each record.

glimpse(salary_data)

Observations: 6,011
Variables: 26
$ Survey.Year                <dbl> 2017, 2017, 2017, 2017, 2017, 2017, 2017...
$ Timestamp                  <dttm> 2017-01-05 05:10:20, 2017-01-05 05:26:2...
$ SalaryUSD                  <chr> "200000", "61515", "95000", "56000", "35...
$ Country                    <chr> "United States", "United Kingdom", "Germ...
$ PostalCode                 <chr> "Not Asked", "Not Asked", "Not Asked", "...
$ PrimaryDatabase            <chr> "Microsoft SQL Server", "Microsoft SQL S...
$ YearsWithThisDatabase      <dbl> 10, 15, 5, 6, 10, 15, 16, 4, 3, 8, 4, 22...
$ OtherDatabases             <chr> "MySQL/MariaDB", "Oracle, PostgreSQL", "...
$ EmploymentStatus           <chr> "Full time employee", "Full time employe...
$ JobTitle                   <chr> "DBA", "DBA", "Other", "DBA", "DBA", "DB...
$ ManageStaff                <chr> "No", "No", "Yes", "No", "No", "No", "No...
$ YearsWithThisTypeOfJob     <dbl> 5, 3, 25, 2, 10, 15, 11, 1, 2, 10, 4, 8,...
$ OtherPeopleOnYourTeam      <chr> "2", "1", "2", "None", "None", "None", "...
$ DatabaseServers            <dbl> 350, 40, 100, 500, 30, 101, 20, 25, 3, 5...
$ Education                  <chr> "Masters", "None (no degree completed)",...
$ EducationIsComputerRelated <chr> "No", "N/A", "Yes", "No", "Yes", "No", "...
$ Certifications             <chr> "Yes, and they're currently valid", "No,...
$ HoursWorkedPerWeek         <dbl> 45, 35, 45, 40, 40, 35, 40, 36, 40, 45, ...
$ TelecommuteDaysPerWeek     <chr> "1", "2", "None, or less than 1 day per ...
$ EmploymentSector           <chr> "Private business", "Private business", ...
$ LookingForAnotherJob       <chr> "Yes, but only passively (just curious)"...
$ CareerPlansThisYear        <chr> "Not Asked", "Not Asked", "Not Asked", "...
$ Gender                     <chr> "Not Asked", "Not Asked", "Not Asked", "...
$ OtherJobDuties             <chr> "Not Asked", "Not Asked", "Not Asked", "...
$ KindsOfTasksPerformed      <chr> "Not Asked", "Not Asked", "Not Asked", "...
$ Counter                    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...

Our first data cleansing activity will be to filter our data to include just 2018 results, which gives us a sample size of 3,113 participants. There are also results for 2017, but they asked a different set of questions and we don’t want to complicate the analysis or strip out the new 2018 questions.

survey_2018 <- filter(salary_data, Survey.Year == 2018)
nrow(survey_2018) # << 3113 records returned

Looking at the survey, there are some interesting data points that we want:

SalaryUSD (our label, that is, what we are going to try to predict)
Country
YearsWithThisDatabase
EmploymentStatus
JobTitle
ManageStaff
YearsWithThisTypeOfJob
OtherPeopleOnYourTeam
DatabaseServers
Education
EducationIsComputerRelated
Certifications
HoursWorkedPerWeek
TelecommuteDaysPerWeek
EmploymentSector
LookingForAnotherJob
CareerPlansThisYear
Gender

For each of these variables, we want to see the range of options and perform any necessary cleanup. The first thing I’d look at is the cardinality of each variable, followed by a detailed anlaysis of the smaller ones.

PrimaryDatabase is another variable which looks interesting, but it skews so heavily toward SQL Server that there’s more noise than signal to it. Because there are so many platforms with 10 or fewer entries and about 92% of entrants selected SQL Server, we’ll throw it out.

rapply(survey_2018, function(x) { length(unique(x)) })

Survey.Year – 1
Timestamp – 3112
SalaryUSD – 865
Country – 73
PostalCode – 1947
[… continue for a while]

unique(survey_2018$Country)

‘United States’
‘Australia’
‘Spain’
‘United Kingdom’
[… continue for a while]

unique(survey_2018$EmploymentStatus)

‘Full time employee’
‘Full time employee of a consulting/contracting company’
‘Independent consultant, contractor, freelancer, or company owner’
‘Part time’

We can use the setDT function on data.table to see just how many records we have for each level of a particular factor. For example, we can see the different entries for PrimaryDatabase and EmploymentSector below. Both of these are troublesome for our modeling because they both have a number of levels with 1-2 entries. This makes it likely that we will fail to collect a relevant record in our training data set, and that will mess up our model later. To rectify this, I am going to remove PrimaryDatabase as a feature and remove the two students from our sample.

data.table::setDT(survey_2018)[, .N, keyby=PrimaryDatabase]

To the three MongoDB users: you have my sympathy.

data.table::setDT(survey_2018)[, .N, keyby=EmploymentSector]

In a way, aren’t we all students? No. Only two of us are.

Most of these columns came from dropdown lists, so they’re already fairly clean. But there are some exceptions to the rule. They are:

SalaryUSD
YearsWithThisDatabase
YearsWithThisTypeOfJob
DatabaseServers
HoursWorkedPerWeek
Gender

All of these were text fields, and whenever a user gets to enter text, you can assume that something will go wrong. For example:

survey_2018 %>%
  distinct(YearsWithThisDatabase) %>%
  arrange(desc(YearsWithThisDatabase)) %>%
  slice(1:10)

Someone with 53,716 years working with their primary database of choice? That’s commitment! You can also see a couple of people who clearly put in the year they started rather than the number of years working with it, and someone who maybe meant 10 years? But who knows, people type in weird stuff.

Anyhow, let’s see how much that person with at least 10 thousand years of experience makes:

survey_2018 %>%
  filter(YearsWithThisDatabase > 10000)

Experience doesn’t pay after the first century or two.

That’s pretty sad, considering their millennia of work experience. $95-98K isn’t even that great a number.

Looking at years of experience with their current job roles, people tend to be more reasonable:

survey_2018 %>%
  distinct(YearsWithThisTypeOfJob) %>%
  arrange(desc(YearsWithThisTypeOfJob)) %>%
  slice(1:10)

Next up, we want to look at the number of database servers owned. 500,000+ database servers is a bit excessive. Frankly, I’m suspicious about any numbers greater than 5000, but because I can’t prove it otherwise, I’ll leave them be.

survey_2018 %>%
  distinct(DatabaseServers) %>%
  arrange(desc(DatabaseServers)) %>%
  slice(1:5)

survey_2018 %>%
  filter(DatabaseServers >= 5000) %>%
  arrange(desc(DatabaseServers))

The first entry looks like bogus data: a $650K salary, a matching postal code, and 500K database servers, primarily in RDS? Nope, I don’t buy it.

The rest don’t really look out of place, except that I think they put in the number of databases and not servers. For these entrants, I’ll change the number of servers to the median to avoid distorting things.

Now let’s look at hours per week:

survey_2018 %>%
  distinct(HoursWorkedPerWeek) %>%
  arrange(desc(HoursWorkedPerWeek)) %>%
  slice(1:10)

One of these numbers is not like the others. The rest of them are just bad.

To the person who works 200 hours per week: find a new job. Your ability to pack more than 7*24 hours of work into 7 days is too good to waste on a job making just $120K per year.

survey_2018 %>%
  filter(HoursWorkedPerWeek >= 168) %>%
  arrange(desc(HoursWorkedPerWeek))

What would I do with an extra day and a half per week? Sleep approximately an extra day and a half per week.

As far as Gender goes, there are only three with enough records to be significant: Male, Female, and Prefer not to say. We’ll take Male and Female and bundle the rest under “Other” to get a small but not entirely insignificant set there.

survey_2018 %>%
  group_by(Gender) %>%
  summarize(n = n())

To the one Reptilian in the survey, I see you and I will join forces with Rowdy Roddy Piper to prevent you from taking over our government.

survey_2018 %>%
  group_by(Country) %>%
  summarize(n = n()) %>%
  filter(n >= 20)

Probably the most surprising country on this list is The Netherlands. India is a close second, but for the opposite reason.

There are only fifteen countries with at least 20 data points and just eight with at least 30. This means that we won’t get a great amount of information from cross-country comparisons outside of the sample. Frankly, I might want to limit this to just the US, UK, Canada, and Australia, as the rest are marginal, but for this survey analysis, I’ll keep the other eleven.

Building Our Cleaned-Up Data Set

Now that we’ve performed some basic analysis, we will clean up the data set. I’m doing most of the cleanup in a single operation, but I do have some comment notes here, particularly around the oddities with SalaryUSD. The SalaryUSD column has a few problems:

Some people put in pennies, which aren’t really that important at the level we’re discussing. I want to strip them out.
Some people put in delimiters like commas or decimal points (which act as commas in countries like Germany). I want to strip them out, particularly because the decimal point might interfere with my analysis, turning 100.000 to $100 instead of $100K.
Some people included the dollar sign, so remove that, as well as any spaces.

It’s not a perfect regex, but it did seem to fix the problems in this data set at least.

valid_countries <- survey_2018 %>%
                    group_by(Country) %>%
                    summarize(n = n()) %>%
                    filter(n >= 20)

# Data cleanup
survey_2018 <- salary_data %>%
  filter(Survey.Year == 2018) %>%
  filter(HoursWorkedPerWeek < 200) %>%
  # There were only two students in the survey, so we will exclude them here.
  filter(EmploymentSector != "Student") %>%
  inner_join(valid_countries, by="Country") %>%
  mutate(
    SalaryUSD = stringr::str_replace_all(SalaryUSD, "\\$", "") %>%
      stringr::str_replace_all(., ",", "") %>%
      stringr::str_replace_all(., " ", "") %>%
      # Some people put in pennies.  Let's remove anything with a decimal point and then two numbers.
      stringr::str_replace_all(., stringr::regex("\\.[0-9]{2}$"), "") %>%
      # Now any decimal points remaining are formatting characters.
      stringr::str_replace_all(., "\\.", "") %>%
      as.numeric(.),
    # Some people have entered bad values here, so set them to the median.
    YearsWithThisDatabase = case_when(
      (YearsWithThisDatabase > 32) ~ median(YearsWithThisDatabase),
      TRUE ~ YearsWithThisDatabase
    ),
    # Some people apparently entered number of databases rather than number of servers.
    DatabaseServers = case_when(
      (DatabaseServers >= 5000) ~ median(DatabaseServers),
      TRUE ~ DatabaseServers
    ),
    EmploymentStatus = as.factor(EmploymentStatus),
    JobTitle = as.factor(JobTitle),
    ManageStaff = as.factor(ManageStaff),
    OtherPeopleOnYourTeam = as.factor(OtherPeopleOnYourTeam),
    Education = as.factor(Education),
    EducationIsComputerRelated = as.factor(EducationIsComputerRelated),
    Certifications = as.factor(Certifications),
    TelecommuteDaysPerWeek = as.factor(TelecommuteDaysPerWeek),
    EmploymentSector = as.factor(EmploymentSector),
    LookingForAnotherJob = as.factor(LookingForAnotherJob),
    CareerPlansThisYear = as.factor(CareerPlansThisYear),
    Gender = as.factor(case_when(
      (Gender == "Male") ~ "Male",
      (Gender == "Female") ~ "Female",
      TRUE ~ "Other"
    ))
  )

Now we can pare out variables we don’t need. Some of these, like postal code, are interesting but we just don’t have enough data for it to make sense. Others, like Kinds of Tasks Performed or Other Job Duties, have too many varieties for us to make much sense with a first pass. They might be interesting in a subsequent analysis, though.

survey_2018 <- survey_2018 %>%
  # One person had a salary of zero.  That's just not right.
  filter(SalaryUSD > 0) %>%
  select(-Counter, -KindsOfTasksPerformed, -OtherJobDuties, -OtherDatabases, -Timestamp, -Survey.Year, 
         -PostalCode, -n, -PrimaryDatabase)

Now that we have our salary data fixed, we can finally look at outliers. I’d consider a salary of $500K a year to be a bit weird for this field. It’s not impossible, but I am a little suspicious. I am very suspicious of the part-timer making $1.375 million, the federal employee making $1 million, or the New Zealander making $630K at a non-profit.

I’m kind of taking a risk by removing these, but they’re big enough outliers that they can have a real impact on our analysis if they’re bad data.

survey_2018 %>%
  filter(SalaryUSD > 500000) %>%
  arrange(desc(SalaryUSD))

I think I’d be willing to accept $1.4 million a year to be a manager of none.

On the other side, there are 12 people who say they earned less than $5K a year. Those also seem wrong. Some of them look like dollars per hour, and maybe some are monthly salary. I’m going to strip those out.

survey_2018 %>%
  filter(SalaryUSD < 5000) %>%
  arrange(desc(SalaryUSD))

For just over a dollar a week, you can hire a data architect.

survey_2018 <- filter(survey_2018, SalaryUSD >= 5000 & SalaryUSD <= 500000)

Data Analysis

We did some of the data analysis up above. We can do additional visualization and correlation studies. For example, let’s look at a quick distribution of salaries after our cleanup work:

summary(survey_2018$SalaryUSD)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   5000   70000   92000   95186  115000  486000

We can also build a histogram pretty easily using the ggplot2 library. This shows the big clump of database professionals earning beween $70K and $115K per year. This salary distribution does skew right a bit, as you can see.

ggplot(data = survey_2018, mapping = aes(x = SalaryUSD)) +
  geom_histogram() +
  theme_minimal() +
  scale_x_log10(label = scales::dollar)

Not including that guy making $58 a year.

We can also break this down to look by primary job title, though I’ll limit to a couple of summaries instead of showing a full picture.

survey_2018 %>% filter(JobTitle == "Data Scientist") %>% select(SalaryUSD) %>% summary(.)

   SalaryUSD     
 Min.   : 45000  
 1st Qu.: 76250  
 Median :111000  
 Mean   :102000  
 3rd Qu.:122000  
 Max.   :160000

survey_2018 %>% filter(JobTitle == "Developer: App code (C#, JS, etc)") %>% select(SalaryUSD) %>% summary(.)

   SalaryUSD     
 Min.   : 22000  
 1st Qu.: 60000  
 Median : 84000  
 Mean   : 84341  
 3rd Qu.:105000  
 Max.   :194000

survey_2018 %>% filter(JobTitle == "Developer: T-SQL") %>% select(SalaryUSD) %>% summary(.)

   SalaryUSD     
 Min.   : 12000  
 1st Qu.: 66000  
 Median : 87000  
 Mean   : 88026  
 3rd Qu.:110000  
 Max.   :300000

This fit pretty well to my biases, although the max Data Scientist salary seems rather low.

Conclusions

This is only a tiny sample of what I’d want to do with a real data set, but it gives you an idea of the kinds of things we look at and the kinds of things we need to fix before a data set becomes useful.

In the next post, we will get started with the wide world of modeling.

36 Chambers – The Legendary Journeys: Execution to the max!

Data Processing: An Example

An Example Of Data Processing

Building Our Cleaned-Up Data Set

Data Analysis

Conclusions

Related

2 thoughts on “Data Processing: An Example”

Leave a comment Cancel reply