Recently, Brent Ozar put together a data professional salary survey. He did some early research and opened the floor for others, now that the results are in. In this post, I’m going to take that data set and do some fairly simple descriptive statistics with R. So let’s fire up RTVS and get started.
Getting A Picture Of The Data
The first thing we want to do is grab the Excel file and load it into R. There’s a nice package called XLConnect which makes working with Excel a breeze. Aside from that, I’m going to load dplyr and tidyr to help with data cleansing and the to get some data we’ll use later in conjunction with salary data.
install.packages("XLConnect") install.packages("tidyr") install.packages("dplyr") install.packages("pwt9", repos = 'http://cran.rstudio.com/') library(XLConnect) library(tidyr) library(dplyr) library(pwt9) wb <- loadWorkbook("2017_Data_Professional_Salary_Survey_Responses.xlsx") salary.data <- readWorksheet(wb, sheet = "Salary Survey", region = "A4:T2902")
The salary survey worksheet really starts on row 4, with the first three rows being pre-header descriptive information. The first thing we want to do is check out the summary of our data frame to see if there are any immediate goodies:
From the summary, we can see that there’s a wide range of salary values, from $430 a year up to $1.45 million. There are certainly some values which seem suspicious, and that’s a common problem with data sets like this: misreported values, typos, bad translations, etc. We’re going to have to do the best we can with the data we have, but in a practical data science scenario, we might want to investigate these values to see if they’re legit and clean them up if not.
Because salary is our primary fact value here, I figured it’d be interesting to see the distribution of salaries in a histogram. That’s easy to do with the hist() function, but if you just build a histogram of salaries, you’ll get skewed results. There are several orders of magnitude difference in salaries in this data set, and the way we smooth out those differences and capture meaningful trend differences is to take the log of salary. It doesn’t really matter which log base you use, so long as you’re consistent. I’m going to use log10 because that gives us an intuitive understanding of values: 10^5 is $10,000 per annum, 10^6 is $100,000, and so on.
The resulting histogram looks like this:
This is not really a normal distribution; it skews left (thanks in part to that $435 per year value, but there are plenty of very low values in the data set). It also shows that a strong majority of people answered between approximately 10^4.8 to 10^5.2. If you want to see what the exact bin widths are (assuming that, like me, you didn’t define them in the first place), that’s easy:
h <- hist(log10(salary.data$SalaryUSD)) h$breaks
> h$breaks  2.6 2.8 3.0 3.2 3.4 3.6 3.8 4.0 4.2 4.4 4.6 4.8 5.0 5.2 5.4 5.6 5.8 6.0 6.2
To give you some context, 10^4.6 is $39,810 per year, 10^4.8 is $63,095 per year, 10^5.0 is $100,000 per year, and 10^5.2 is $158,489 per year. Most data professionals surveyed make between $63K and $159K per year.
Breakdowns By Country
Now let’s look at some country-level breakdowns. For this, I’m going to use some dplyr function chains, and my first step is to see what we’ve got.
entrants.by.country <- data.frame(salary.data %>% group_by(Country) %>% summarize(mean.salary = mean(SalaryUSD), n = n()))
This will create a new data frame called entrants.by.country, which includes the country name, the number of entrants (n), and the mean salary. We can see that Anglosphere countries dominate the survey results, which makes sense given that it was on an English-language blog:
entrants.by.country[order(entrants.by.country$n, decreasing = TRUE),]
> entrants.by.country[order(entrants.by.country$n, decreasing = TRUE),] Country mean.salary n 64 United States 106741.99 1879 63 United Kingdom 62570.73 293 10 Canada 74600.53 105 26 India 22507.34 70 3 Australia 91087.17 69 20 Germany 76408.07 45 39 Netherlands 65543.22 36 57 Sweden 60151.21 34 55 South Africa 54313.75 28 40 New Zealand 78336.92 24 15 Denmark 101767.35 23 27 Ireland 80177.25 20 46 Poland 32028.70 20 58 Switzerland 120138.89 18
That said, we have data for people from 66 countries; I’m only including some of the results in the list above for space reasons. If we want to see where the highest mean salary is for data professionals, we just need to reorder the sort criteria:
entrants.by.country[order(entrants.by.country$mean.salary, decreasing = TRUE),]
> entrants.by.country[order(entrants.by.country$mean.salary, decreasing = TRUE),] Country mean.salary n 58 Switzerland 120138.89 18 6 Bermuda 120000.00 1 64 United States 106741.99 1879 25 Iceland 105500.00 2 15 Denmark 101767.35 23 42 Norway 97820.89 9 3 Australia 91087.17 69
These results say that Switzerland and Bermuda are nice places to work if you want to maximize expected salary. But there’s a problem: we only have one entrant who chose Bermuda, so that’s not very helpful for data comparison. What if we picked the one guy in Bermuda who makes more than $100K a year? Or maybe everybody makes more than this guy? We don’t know, and there just aren’t enough entrants in this sample to give us good information, so I want to filter out any country with fewer than 10 entrants. Why 10? Either because I hate Norway or I like round numbers; your call.
filtered.entrants.by.country <- entrants.by.country %>% filter(n > 9) count(filtered.entrants.by.country) filtered.entrants.by.country[order(filtered.entrants.by.country$mean.salary, decreasing = TRUE),]
> count(filtered.entrants.by.country) Source: local data frame [1 x 1] n (int) 1 23
> filtered.entrants.by.country[order(filtered.entrants.by.country$mean.salary, decreasing = TRUE),] Country mean.salary n 21 Switzerland 120138.89 18 23 United States 106741.99 1879 5 Denmark 101767.35 23 1 Australia 91087.17 69 10 Ireland 80177.25 20 14 New Zealand 78336.92 24 11 Israel 77200.00 10 8 Germany 76408.07 45 4 Canada 74600.53 105 13 Netherlands 65543.22 36 6 Finland 63640.00 10 22 United Kingdom 62570.73 293 20 Sweden 60151.21 34 2 Belgium 59354.55 11 18 South Africa 54313.75 28 7 France 53063.64 11 12 Italy 51410.00 15 19 Spain 40233.33 15 15 Poland 32028.70 20 3 Brazil 31152.18 11 16 Romania 27284.00 15 9 India 22507.34 70 17 Russia 21693.54 13
We’re now down to 23 entrants, so I included the full set here. Honestly, I think even 10 is way too low to make an accurate guess; I’d want 30-50 before I start feeling at all confident, and even that’s low considering all the factors which go into salary within a country.
Looking at these countries, we see Switzerland in its own tier, followed by the United States and Denmark. Australia kind of fits between the next tier, which includes Ireland, New Zealand, Israel, Germany, and Canada. Down and down it goes, until we land on India and Russia at the bottom.
Introducing Purchasing Power Parity
Looking at salary data in a vacuum can be interesting, but I think we all have an intuitive understanding that it’s more expensive to live in the United States, UK, or Germany than in Romania, Russia, or India. Economists have done a lot of research on the topic, particularly around purchasing power parity. The quick idea is, supposing I have a basket of goods that I can purchase in one area, how much would it cost me to purchase an equivalent basket of goods in another area? Within one country, there can be vast differences: compare prices in New York City versus Iowa City. Between countries, there is also a significant difference, and PPP tries to adjust for that difference.
In our case, I want the purchasing power parity-adjusted Gross Domestic Product of each country; I’ll treat that as the norm. Before anybody jumps down my throat about how per capita GDP isn’t the same as average income, the various ways in which GDP is flawed as a calculation metric, that per-capita GDP includes people outside the workforce, that workforce compositions can differ between countries, etc. etc., I know and I agree. My hope here is that most of the differences will balance out without having to go through major analytic hurdles. We’re futzing about, not writing academic papers.
Anyhow, how do we get this data? Well, that’s the Penn World Table package we loaded early on. Building the data set is pretty easy. In this case, I just want to compare my data set against he latest year in the data set, which is 2014.
data("pwt9.0") countries.2014 <- pwt9.0 %>% filter(year == 2014)
In this data set, there are two PPP GDP measures: an expenditure-side measure and an output-side measure. The expenditure-side standard is better for comparing living standards for people in those countries, whereas the output-side measure is better for understanding productivity differences between countries. We’re going to use the expenditure-side GDP values, although the differences aren’t huge. Also, I should note that population and GDP are both measured in millions, so to get per capita GDP, we can do straight division.
country.gdp <- countries.2014 %>% select(Country = country, rgdpe, pop) country.gdp$gdp.per.capita <- country.gdp$rgdpe / country.gdp$pop country.gdp <- lapply(country.gdp, gsub, pattern = "United States of America", replacement = "United States") country.gdp <- as.data.frame(lapply(country.gdp, gsub, pattern = "Russian Federation", replacement = "Russia")
I also did some data cleanup, as there are two countries whose names do not match the values in the salary data set. To allow me to join these two sets together, I’m going to rename the countries in the GDP data set. That allows me to run a query like the following:
country.gdp %>% filter(Country == "United States") %>% select(Country, gdp.per.capita)
> country.gdp %>% filter(Country == "United States") %>% select(Country, gdp.per.capita) Country gdp.per.capita 1 United States 52292.281832077
This says that in 2014, per-capita GDP (with PPP adjustments) was $52,292.28. My goal is to compare that per-capita GDP with the mean salary for a country and get a ratio for data professionals versus a normalized salary per country, still understanding that this is a very imprecise way of solving the problem.
What I need to do is join the data sets together, and dplyr lets me do that quite nicely. I’m going to use the left_join function to link the data sets, and then do some additional work. First, I need to convert gdp.per.capita from a factor to a numeric value so I can perform computations, like calculating salary ratios.
salary.versus.gdp <- left_join(filtered.entrants.by.country, country.gdp, by = "Country") #The GDP decimal values became factors, so let's defactorize them. salary.versus.gdp$gdp.per.capita <- as.numeric(as.character(salary.versus.gdp$gdp.per.capita)) #Now compute a salary ratio salary.versus.gdp$salary.ratio <- salary.versus.gdp$mean.salary / salary.versus.gdp$gdp.per.capita salary.versus.gdp[order(salary.versus.gdp$salary.ratio, decreasing = TRUE),] %>% select(Country, n, salary.ratio)
Here’s the full result set:
> salary.versus.gdp[order(salary.versus.gdp$salary.ratio, decreasing = TRUE),] %>% select(Country, n, salary.ratio) Country n salary.ratio 18 South Africa 28 4.478221 9 India 70 4.308437 11 Israel 10 2.320406 5 Denmark 23 2.265324 14 New Zealand 24 2.255296 1 Australia 69 2.114833 3 Brazil 11 2.094886 21 Switzerland 18 2.054748 23 United States 1879 2.041257 4 Canada 105 1.761447 8 Germany 45 1.662452 10 Ireland 20 1.644101 6 Finland 10 1.575217 22 United Kingdom 293 1.554880 12 Italy 15 1.435772 13 Netherlands 36 1.387466 2 Belgium 11 1.359226 20 Sweden 34 1.348734 7 France 11 1.347673 16 Romania 15 1.310630 15 Poland 20 1.273202 19 Spain 15 1.188078 17 Russia 13 0.902426
A couple things stand out to me. First, in a bare majority of the countries in this list, data professionals make 1.5-2x the “expected” value, saying that being a data professional is a relatively lucrative business. In South Africa and India, those ratios are even higher. I could expect it from India, where there are still large pockets of poverty and where IT jobs tend to be among the best-paying jobs in the country. South Africa is an odd duck to me here, but that might be my ignorance of their circumstances.
The other thing that stands out is Russia. Russia is the only country in which our salary ratio is under 1.0, indicating that being a data professional is a sub-par job. I decided to take a look at what makes Russia so special here.
What Makes Russia So Special?
The first thing I want to do is filter the Russian data set and take a quick look at their results to see if anything stands out.
russians <- salary.data %>% filter(Country == "Russia") summary(russians$SalaryUSD)
Immediately, I see something pop out:
> summary(russians$SalaryUSD) Min. 1st Qu. Median Mean 3rd Qu. Max. 1050 12000 20000 21690 35000 39570
Wait, there’s a person making $1050 a year? Let’s look at that person’s details:
salary.data %>% filter(SalaryUSD == 1050)
> salary.data %>% filter(SalaryUSD == 1050) Timestamp SalaryUSD Country PrimaryDatabase YearsWithThisDatabase OtherDatabases EmploymentStatus JobTitle 1 2017-01-10 00:29:51 1050 Russia Microsoft SQL Server 4 Full time employee Developer: App code (C#, JS, etc) ManageStaff YearsWithThisTypeOfJob OtherPeopleOnYourTeam DatabaseServers Education EducationIsComputerRelated Certifications 1 No 4 More than 5 4 Bachelors (4 years) Yes No, I never have HoursWorkedPerWeek TelecommuteDaysPerWeek EmploymentSector LookingForAnotherJob Counter 1 40 None, or less than 1 day per week Private business No 1
According to this, the person is a full-time app developer with 4 years of experience and a Bachelor’s degree. He’s not looking for another job, so if this is legit data, somebody might want to snatch him up at $2000 a year. I’d take a risk on a developer for $2K.
Realistically, this is probably bad data—maybe it’s $1050 a month? Or $10,500? In any event, if you take this guy out, the mean jumps from $21,690 to $23,413. That’s still below the Russian GDP of $24,039, but is close enough to be almost 1:1…but that’s still not a great ratio in comparison to the other countries on this list.
By Primary Database
The next thing I looked at was primary database.
primary.database <- data.frame(salary.data %>% group_by(PrimaryDatabase) %>% summarize(mean.salary = mean(SalaryUSD), n = n())) primary.database[order(primary.database$mean.salary, decreasing = TRUE),]
> primary.database[order(primary.database$mean.salary, decreasing = TRUE),] PrimaryDatabase mean.salary n 1 Amazon RDS (any flavor) 325000.00 3 5 Elasticsearch 156500.00 2 8 MongoDB 134000.00 3 3 Cassandra 130000.00 1 11 Other 103418.44 16 15 Teradata 101521.43 7 13 SAP 101357.14 7 10 Oracle 90506.67 89 7 Microsoft SQL Server 90395.03 2729 12 PostgreSQL 86613.41 17 2 Azure SQL DB 84050.00 6 9 MySQL/MariaDB 76800.00 8 4 DB2 72941.67 6 6 Microsoft Access 61700.00 3 14 SQLite 44000.00 1
Given that 94% of entrants chose SQL Server, there’s no way we’re getting anything good from this slice. That said, it does feed my biases toward DB2 and Access…
By Job Title
I now want to look at salary by job title. Again, I’m going to filter out any job titles with fewer than 10 entries.
job.title <- data.frame(salary.data %>% group_by(JobTitle) %>% summarize(mean.salary = mean(SalaryUSD), n = n())) filtered.job.title <- job.title %>% filter(n > 9) filtered.job.title[order(filtered.job.title$mean.salary, decreasing = TRUE),]
> filtered.job.title[order(filtered.job.title$mean.salary, decreasing = TRUE),] JobTitle mean.salary n 3 Data Scientist 123317.55 11 2 Architect 120086.86 244 9 Manager 116003.52 142 8 Engineer 99983.54 147 4 DBA 87697.75 1514 10 Other 83594.62 53 7 Developer: T-SQL 83310.70 264 6 Developer: Business Intelligence (SSRS, PowerBI, etc) 80021.84 203 1 Analyst 79985.09 128 5 Developer: App code (C#, JS, etc) 78001.04 181
Self-described data scientists have the highest mean salary (go team!), followed by architects and managers. The one thing that surprised me a bit was that T-SQL developers (AKA Database Engineers) make slightly less in this data set than Database Administrators. There could be a number of factors, such as country-level differences, but the different isn’t vast, so I’ll go with it.
At the bottom of the list are app developers, data analysts, and BI developers/report writers. This more or less fits my preconceived notions, and the numbers don’t seem outlandish.
The last thing I want to do is perform a few significance tests to get an idea of how robust our findings are—in other words, are we able to draw specific conclusions from this data?
Data Scientist Versus DBA
The first question I want to ask is, based on this data set, can we expect the wide world of data scientists to make more than the whole population of DBAs? Intuitively, I’d say yes, and we can see that there’s a $35K a year difference in the results, but the problem is that the number of data scientists in the sample is only 11.
The way that I’m going to compare these data sets is by running a t-test on the log10 salaries of each data set. I know that the distributions are not normal, so I can’t make the assumption that they are. That said, t-tests are robust to normality, meaning that although they were designed to work with data following a normal distribution, they can still function without that requirement.
data.scientists <- salary.data %>% filter(JobTitle == "Data Scientist") dbas <- salary.data %>% filter(JobTitle == "DBA") t.test(log10(data.scientists$SalaryUSD), log10(dbas$SalaryUSD))
> t.test(log10(data.scientists$SalaryUSD), log10(dbas$SalaryUSD)) Welch Two Sample t-test data: log10(data.scientists$SalaryUSD) and log10(dbas$SalaryUSD) t = 1.9772, df = 10.173, p-value = 0.07574 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -0.01828365 0.31237969 sample estimates: mean of x mean of y 5.031427 4.884379
I’m taking a two sample t-test (because I have two samples: a sample of data scientists and a sample of database administrators) and get a t score of 1.9772 and a p value of 0.076. Neither of these is good enough to give us enough confidence to state that, based solely on this data set, a data scientist in the overall population earns more than a database administrator. We can visualize a big part of the reason just by looking at the two histograms:
3 of the 10 data scientists are in the 10^5.2–5.4 range, which is well above the peak for DBAs. But most data scientists fit into the upper end of the DBA salary range, so even though the means are vastly different, it’s hard to confirm—based on so few data scientists in the sample—that we can make a confident prediction that there’s a real difference rather than just being sample-related noise.
But let’s do a quick thought experiment and add three more data scientists to the survey, each of whom has a salary of our mean, $123,317.50. If we do that, the results become quite different:
ds.thought.experiment <- c(data.scientists$SalaryUSD, 123317.5, 123317.5, 123317.5) t.test(log10(ds.thought.experiment), log10(dbas$SalaryUSD))
> t.test(log10(ds.thought.experiment), log10(dbas$SalaryUSD)) Welch Two Sample t-test data: log10(ds.thought.experiment) and log10(dbas$SalaryUSD) t = 2.7378, df = 13.368, p-value = 0.01658 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 0.0340569 0.2855808 sample estimates: mean of x mean of y 5.044198 4.884379
Now the t score is 2.74 and the p value is under 0.02. This makes sense: by adding a few more people to the sample (and assuming that the sample mean is anywhere near accurate), we see a clearer bifurcation.
DBA Versus DBE
We saw that we didn’t have enough data scientist entries to differentiate data scientists from DBAs in terms of salary, so what can we do about this debate? Spoilers: not much.
dbes <- salary.data %>% filter(JobTitle == "Developer: T-SQL") t.test(log10(dbes$SalaryUSD), log10(dbas$SalaryUSD))
> t.test(log10(dbes$SalaryUSD), log10(dbas$SalaryUSD)) Welch Two Sample t-test data: log10(dbes$SalaryUSD) and log10(dbas$SalaryUSD) t = -1.6626, df = 336.98, p-value = 0.09733 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -0.072605456 0.006090578 sample estimates: mean of x mean of y 4.851122 4.884379
There’s a t score of -1.66, indicating that DBE salaries are lower, but the p value is nowhere near good enough for us to be confident that there’s a statistically significant difference.
US Versus Russia
Okay, with two failed tests, let’s go for a gimmie. American data professionals earn more than Russian data professionals. This isn’t quite a layup because there are only 13 Russians in the sample, so we might end up with something like the data scientist scenario.
americans <- salary.data %>% filter(Country == "United States") t.test(log10(americans$SalaryUSD), log10(russians$SalaryUSD))
> t.test(log10(americans$SalaryUSD), log10(russians$SalaryUSD)) Welch Two Sample t-test data: log10(americans$SalaryUSD) and log10(russians$SalaryUSD) t = 6.613, df = 12.019, p-value = 2.472e-05 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 0.5325389 1.0557546 sample estimates: mean of x mean of y 5.001766 4.207620
Success! We have a t score over 6.6 and a p value way below 1, meaning that based on our sample, it’s safe to say that Americans earn more.
US Versus UK
So how about the UK?
uk <- salary.data %>% filter(Country == "United Kingdom") t.test(log10(americans$SalaryUSD), log10(uk$SalaryUSD))
> t.test(log10(americans$SalaryUSD), log10(uk$SalaryUSD)) Welch Two Sample t-test data: log10(americans$SalaryUSD) and log10(uk$SalaryUSD) t = 22.971, df = 363.35, p-value < 2.2e-16 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 0.2171545 0.2578154 sample estimates: mean of x mean of y 5.001766 4.764281
We have enough members of both samples to be very confident in the difference. Note that even though the means are closer between the US and UK samples, the t score is much higher; this is because we have so many more members in the sample and can draw more confident conclusions.
US Versus Australia
Australia’s a tougher call because the sample is a bit smaller and the mean is a lot closer.
aussies <- salary.data %>% filter(Country == "Australia") t.test(log10(americans$SalaryUSD), log10(aussies$SalaryUSD))
> t.test(log10(americans$SalaryUSD), log10(aussies$SalaryUSD)) Welch Two Sample t-test data: log10(americans$SalaryUSD) and log10(aussies$SalaryUSD) t = 3.9512, df = 73.376, p-value = 0.0001771 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 0.03386076 0.10277441 sample estimates: mean of x mean of y 5.001766 4.933449
Even with that hedging, there are enough members of the Australian sample for us to be confident that there is a statistically significant difference between the populations.
US Versus Denmark
So how about the US’s nearest neighbor in terms of average data professional salary?
denmark <- salary.data %>% filter(Country == "Denmark") t.test(log10(americans$SalaryUSD), log10(denmark$SalaryUSD))
> t.test(log10(americans$SalaryUSD), log10(denmark$SalaryUSD)) Welch Two Sample t-test data: log10(americans$SalaryUSD) and log10(denmark$SalaryUSD) t = 1.1847, df = 22.086, p-value = 0.2487 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -0.06728552 0.24668863 sample estimates: mean of x mean of y 5.001766 4.912065
The t score is small, the p value is large, and the 95% confidence interval crosses the 0 boundary. In other words, there aren’t enough people in the sample to show a statistically significant difference between the US and Denmark.
I wouldn’t draw too many conclusions from this sample, but it is rich enough to get some ideas. We were able to see some common trends across most western countries regarding the relative pricing of data professionals—even though Americans make more in the data set than people in all other countries save for Switzerland (after filtering out the low-count countries), the earnings ratio relative to other Americans is nearer to the middle of the pack, meaning that data professionals tend to be at similar relative levels of affluence.
We also learned that you should avoid DB2…but I don’t think you needed me to tell you that…
Hopefully this inspires you to grab the data set yourself and play around with it. There are plenty of other tests that people can run on the data; I’m just scratching the surface.