Life and times of Avindian

I have once again been absent for some time. In particular, I’ve yet to write on the Hall of Fame election, which I plan to do in the not-too-distant future. Similarly, I have some overdue thoughts on Madden 17 to share.

I am thinking of properly starting my YouTube channel soon. I’ve gotten a little bit of experience doing some videos; once I’m a bit more comfortable, I will start posting videos. Until then, my friends!

Analyzing DBA Salaries

Recently, Brent Ozar put together a data professional salary survey.  He did some early research and opened the floor for others, now that the results are in.  In this post, I’m going to take that data set and do some fairly simple descriptive statistics with R.  So let’s fire up RTVS and get started.

Getting A Picture Of The Data

The first thing we want to do is grab the Excel file and load it into R.  There’s a nice package called XLConnect which makes working with Excel a breeze. Aside from that, I’m going to load dplyr and tidyr to help with data cleansing and the to get some data we’ll use later in conjunction with salary data.

install.packages("XLConnect")
install.packages("tidyr")
install.packages("dplyr")
install.packages("pwt9", repos = 'http://cran.rstudio.com/')
library(XLConnect)
library(tidyr)
library(dplyr)
library(pwt9)

wb <- loadWorkbook("2017_Data_Professional_Salary_Survey_Responses.xlsx")
salary.data <- readWorksheet(wb, sheet = "Salary Survey", region = "A4:T2902")

The salary survey worksheet really starts on row 4, with the first three rows being pre-header descriptive information. The first thing we want to do is check out the summary of our data frame to see if there are any immediate goodies:

summary(salary.data)

Summary.png

From the summary, we can see that there’s a wide range of salary values, from $430 a year up to $1.45 million. There are certainly some values which seem suspicious, and that’s a common problem with data sets like this: misreported values, typos, bad translations, etc.  We’re going to have to do the best we can with the data we have, but in a practical data science scenario, we might want to investigate these values to see if they’re legit and clean them up if not.

Because salary is our primary fact value here, I figured it’d be interesting to see the distribution of salaries in a histogram. That’s easy to do with the hist() function, but if you just build a histogram of salaries, you’ll get skewed results. There are several orders of magnitude difference in salaries in this data set, and the way we smooth out those differences and capture meaningful trend differences is to take the log of salary. It doesn’t really matter which log base you use, so long as you’re consistent. I’m going to use log10 because that gives us an intuitive understanding of values: 10^5 is $10,000 per annum, 10^6 is $100,000, and so on.

hist(log10(salary.data$SalaryUSD))

The resulting histogram looks like this:

histsalary

This is not really a normal distribution; it skews left (thanks in part to that $435 per year value, but there are plenty of very low values in the data set).  It also shows that a strong majority of people answered between approximately 10^4.8 to 10^5.2.  If you want to see what the exact bin widths are (assuming that, like me, you didn’t define them in the first place), that’s easy:

h <- hist(log10(salary.data$SalaryUSD))
h$breaks
> h$breaks
 [1] 2.6 2.8 3.0 3.2 3.4 3.6 3.8 4.0 4.2 4.4 4.6 4.8 5.0 5.2 5.4 5.6 5.8 6.0 6.2

To give you some context, 10^4.6 is $39,810 per year, 10^4.8 is $63,095 per year, 10^5.0 is $100,000 per year, and 10^5.2 is $158,489 per year.  Most data professionals surveyed make between $63K and $159K per year.

Breakdowns By Country

Now let’s look at some country-level breakdowns.  For this, I’m going to use some dplyr function chains, and my first step is to see what we’ve got.

entrants.by.country <- data.frame(salary.data %>% group_by(Country) %>% summarize(mean.salary = mean(SalaryUSD), n = n()))

This will create a new data frame called entrants.by.country, which includes the country name, the number of entrants (n), and the mean salary. We can see that Anglosphere countries dominate the survey results, which makes sense given that it was on an English-language blog:

entrants.by.country[order(entrants.by.country$n, decreasing = TRUE),]
> entrants.by.country[order(entrants.by.country$n, decreasing = TRUE),]
                 Country mean.salary    n
64         United States   106741.99 1879
63        United Kingdom    62570.73  293
10                Canada    74600.53  105
26                 India    22507.34   70
3              Australia    91087.17   69
20               Germany    76408.07   45
39           Netherlands    65543.22   36
57                Sweden    60151.21   34
55          South Africa    54313.75   28
40           New Zealand    78336.92   24
15               Denmark   101767.35   23
27               Ireland    80177.25   20
46                Poland    32028.70   20
58           Switzerland   120138.89   18

That said, we have data for people from 66 countries; I’m only including some of the results in the list above for space reasons. If we want to see where the highest mean salary is for data professionals, we just need to reorder the sort criteria:

entrants.by.country[order(entrants.by.country$mean.salary, decreasing = TRUE),]
> entrants.by.country[order(entrants.by.country$mean.salary, decreasing = TRUE),]
                 Country mean.salary    n
58           Switzerland   120138.89   18
6                Bermuda   120000.00    1
64         United States   106741.99 1879
25               Iceland   105500.00    2
15               Denmark   101767.35   23
42                Norway    97820.89    9
3              Australia    91087.17   69

These results say that Switzerland and Bermuda are nice places to work if you want to maximize expected salary. But there’s a problem: we only have one entrant who chose Bermuda, so that’s not very helpful for data comparison. What if we picked the one guy in Bermuda who makes more than $100K a year? Or maybe everybody makes more than this guy? We don’t know, and there just aren’t enough entrants in this sample to give us good information, so I want to filter out any country with fewer than 10 entrants. Why 10? Either because I hate Norway or I like round numbers; your call.

filtered.entrants.by.country <- entrants.by.country %>% filter(n > 9)
count(filtered.entrants.by.country)
filtered.entrants.by.country[order(filtered.entrants.by.country$mean.salary, decreasing = TRUE),]
> count(filtered.entrants.by.country)
 Source: local data frame [1 x 1]

n
 (int)
 1 23
> filtered.entrants.by.country[order(filtered.entrants.by.country$mean.salary, decreasing = TRUE),]
          Country mean.salary    n
21    Switzerland   120138.89   18
23  United States   106741.99 1879
5         Denmark   101767.35   23
1       Australia    91087.17   69
10        Ireland    80177.25   20
14    New Zealand    78336.92   24
11         Israel    77200.00   10
8         Germany    76408.07   45
4          Canada    74600.53  105
13    Netherlands    65543.22   36
6         Finland    63640.00   10
22 United Kingdom    62570.73  293
20         Sweden    60151.21   34
2         Belgium    59354.55   11
18   South Africa    54313.75   28
7          France    53063.64   11
12          Italy    51410.00   15
19          Spain    40233.33   15
15         Poland    32028.70   20
3          Brazil    31152.18   11
16        Romania    27284.00   15
9           India    22507.34   70
17         Russia    21693.54   13

We’re now down to 23 entrants, so I included the full set here. Honestly, I think even 10 is way too low to make an accurate guess; I’d want 30-50 before I start feeling at all confident, and even that’s low considering all the factors which go into salary within a country.

Looking at these countries, we see Switzerland in its own tier, followed by the United States and Denmark.  Australia kind of fits between the next tier, which includes Ireland, New Zealand, Israel, Germany, and Canada.  Down and down it goes, until we land on India and Russia at the bottom.

Introducing Purchasing Power Parity

Looking at salary data in a vacuum can be interesting, but I think we all have an intuitive understanding that it’s more expensive to live in the United States, UK, or Germany than in Romania, Russia, or India.  Economists have done a lot of research on the topic, particularly around purchasing power parity.  The quick idea is, supposing I have a basket of goods that I can purchase in one area, how much would it cost me to purchase an equivalent basket of goods in another area?  Within one country, there can be vast differences:  compare prices in New York City versus Iowa City.  Between countries, there is also a significant difference, and PPP tries to adjust for that difference.

In our case, I want the purchasing power parity-adjusted Gross Domestic Product of each country; I’ll treat that as the norm.  Before anybody jumps down my throat about how per capita GDP isn’t the same as average income, the various ways in which GDP is flawed as a calculation metric, that per-capita GDP includes people outside the workforce, that workforce compositions can differ between countries, etc. etc., I know and I agree.  My hope here is that most of the differences will balance out without having to go through major analytic hurdles. We’re futzing about, not writing academic papers.

Anyhow, how do we get this data?  Well, that’s the Penn World Table package we loaded early on. Building the data set is pretty easy. In this case, I just want to compare my data set against he latest year in the data set, which is 2014.

data("pwt9.0")
countries.2014 <- pwt9.0 %>% filter(year == 2014)

In this data set, there are two PPP GDP measures: an expenditure-side measure and an output-side measure. The expenditure-side standard is better for comparing living standards for people in those countries, whereas the output-side measure is better for understanding productivity differences between countries. We’re going to use the expenditure-side GDP values, although the differences aren’t huge. Also, I should note that population and GDP are both measured in millions, so to get per capita GDP, we can do straight division.

country.gdp <- countries.2014 %>% select(Country = country, rgdpe, pop)
country.gdp$gdp.per.capita <- country.gdp$rgdpe / country.gdp$pop
country.gdp <- lapply(country.gdp, gsub, pattern = "United States of America", replacement = "United States")
country.gdp <- as.data.frame(lapply(country.gdp, gsub, pattern = "Russian Federation", replacement = "Russia") 

I also did some data cleanup, as there are two countries whose names do not match the values in the salary data set. To allow me to join these two sets together, I’m going to rename the countries in the GDP data set. That allows me to run a query like the following:

 country.gdp %>% filter(Country == "United States") %>% select(Country, gdp.per.capita)
> country.gdp %>% filter(Country == "United States") %>% select(Country, gdp.per.capita)
 Country gdp.per.capita
 1 United States 52292.281832077

This says that in 2014, per-capita GDP (with PPP adjustments) was $52,292.28. My goal is to compare that per-capita GDP with the mean salary for a country and get a ratio for data professionals versus a normalized salary per country, still understanding that this is a very imprecise way of solving the problem.

What I need to do is join the data sets together, and dplyr lets me do that quite nicely. I’m going to use the left_join function to link the data sets, and then do some additional work. First, I need to convert gdp.per.capita from a factor to a numeric value so I can perform computations, like calculating salary ratios.

salary.versus.gdp <- left_join(filtered.entrants.by.country, country.gdp, by = "Country")
#The GDP decimal values became factors, so let's defactorize them.
salary.versus.gdp$gdp.per.capita <- as.numeric(as.character(salary.versus.gdp$gdp.per.capita))
#Now compute a salary ratio
salary.versus.gdp$salary.ratio <- salary.versus.gdp$mean.salary / salary.versus.gdp$gdp.per.capita salary.versus.gdp[order(salary.versus.gdp$salary.ratio, decreasing = TRUE),] %>% select(Country, n, salary.ratio)

Here’s the full result set:

> salary.versus.gdp[order(salary.versus.gdp$salary.ratio, decreasing = TRUE),] %>% select(Country, n, salary.ratio)
          Country    n salary.ratio
18   South Africa   28     4.478221
9           India   70     4.308437
11         Israel   10     2.320406
5         Denmark   23     2.265324
14    New Zealand   24     2.255296
1       Australia   69     2.114833
3          Brazil   11     2.094886
21    Switzerland   18     2.054748
23  United States 1879     2.041257
4          Canada  105     1.761447
8         Germany   45     1.662452
10        Ireland   20     1.644101
6         Finland   10     1.575217
22 United Kingdom  293     1.554880
12          Italy   15     1.435772
13    Netherlands   36     1.387466
2         Belgium   11     1.359226
20         Sweden   34     1.348734
7          France   11     1.347673
16        Romania   15     1.310630
15         Poland   20     1.273202
19          Spain   15     1.188078
17         Russia   13     0.902426

A couple things stand out to me. First, in a bare majority of the countries in this list, data professionals make 1.5-2x the “expected” value, saying that being a data professional is a relatively lucrative business. In South Africa and India, those ratios are even higher. I could expect it from India, where there are still large pockets of poverty and where IT jobs tend to be among the best-paying jobs in the country. South Africa is an odd duck to me here, but that might be my ignorance of their circumstances.

The other thing that stands out is Russia. Russia is the only country in which our salary ratio is under 1.0, indicating that being a data professional is a sub-par job. I decided to take a look at what makes Russia so special here.

What Makes Russia So Special?

The first thing I want to do is filter the Russian data set and take a quick look at their results to see if anything stands out.

russians <- salary.data %>% filter(Country == "Russia")
summary(russians$SalaryUSD)

Immediately, I see something pop out:

> summary(russians$SalaryUSD)
 Min. 1st Qu. Median Mean 3rd Qu. Max.
 1050 12000 20000 21690 35000 39570

Wait, there’s a person making $1050 a year? Let’s look at that person’s details:

salary.data %>% filter(SalaryUSD == 1050)
> salary.data %>% filter(SalaryUSD == 1050)
            Timestamp SalaryUSD Country      PrimaryDatabase YearsWithThisDatabase OtherDatabases   EmploymentStatus                          JobTitle
1 2017-01-10 00:29:51      1050  Russia Microsoft SQL Server                     4            Full time employee Developer: App code (C#, JS, etc)
  ManageStaff YearsWithThisTypeOfJob OtherPeopleOnYourTeam DatabaseServers           Education EducationIsComputerRelated   Certifications
1          No                      4           More than 5               4 Bachelors (4 years)                        Yes No, I never have
  HoursWorkedPerWeek            TelecommuteDaysPerWeek EmploymentSector LookingForAnotherJob Counter
1                 40 None, or less than 1 day per week Private business                   No       1

According to this, the person is a full-time app developer with 4 years of experience and a Bachelor’s degree. He’s not looking for another job, so if this is legit data, somebody might want to snatch him up at $2000 a year. I’d take a risk on a developer for $2K.

Realistically, this is probably bad data—maybe it’s $1050 a month? Or $10,500? In any event, if you take this guy out, the mean jumps from $21,690 to $23,413. That’s still below the Russian GDP of $24,039, but is close enough to be almost 1:1…but that’s still not a great ratio in comparison to the other countries on this list.

By Primary Database

The next thing I looked at was primary database.

primary.database <- data.frame(salary.data %>% group_by(PrimaryDatabase) %>% summarize(mean.salary = mean(SalaryUSD), n = n()))
primary.database[order(primary.database$mean.salary, decreasing = TRUE),]
> primary.database[order(primary.database$mean.salary, decreasing = TRUE),]
           PrimaryDatabase mean.salary    n
1  Amazon RDS (any flavor)   325000.00    3
5            Elasticsearch   156500.00    2
8                  MongoDB   134000.00    3
3                Cassandra   130000.00    1
11                   Other   103418.44   16
15                Teradata   101521.43    7
13                     SAP   101357.14    7
10                  Oracle    90506.67   89
7     Microsoft SQL Server    90395.03 2729
12              PostgreSQL    86613.41   17
2             Azure SQL DB    84050.00    6
9            MySQL/MariaDB    76800.00    8
4                      DB2    72941.67    6
6         Microsoft Access    61700.00    3
14                  SQLite    44000.00    1

Given that 94% of entrants chose SQL Server, there’s no way we’re getting anything good from this slice. That said, it does feed my biases toward DB2 and Access…

By Job Title

I now want to look at salary by job title. Again, I’m going to filter out any job titles with fewer than 10 entries.

job.title <- data.frame(salary.data %>% group_by(JobTitle) %>% summarize(mean.salary = mean(SalaryUSD), n = n()))
filtered.job.title <- job.title %>% filter(n > 9)
filtered.job.title[order(filtered.job.title$mean.salary, decreasing = TRUE),]
> filtered.job.title[order(filtered.job.title$mean.salary, decreasing = TRUE),]
                                                JobTitle mean.salary    n
3                                         Data Scientist   123317.55   11
2                                              Architect   120086.86  244
9                                                Manager   116003.52  142
8                                               Engineer    99983.54  147
4                                                    DBA    87697.75 1514
10                                                 Other    83594.62   53
7                                       Developer: T-SQL    83310.70  264
6  Developer: Business Intelligence (SSRS, PowerBI, etc)    80021.84  203
1                                                Analyst    79985.09  128
5                      Developer: App code (C#, JS, etc)    78001.04  181

Self-described data scientists have the highest mean salary (go team!), followed by architects and managers. The one thing that surprised me a bit was that T-SQL developers (AKA Database Engineers) make slightly less in this data set than Database Administrators. There could be a number of factors, such as country-level differences, but the different isn’t vast, so I’ll go with it.

At the bottom of the list are app developers, data analysts, and BI developers/report writers. This more or less fits my preconceived notions, and the numbers don’t seem outlandish.

Significance Testing

The last thing I want to do is perform a few significance tests to get an idea of how robust our findings are—in other words, are we able to draw specific conclusions from this data?

Data Scientist Versus DBA

The first question I want to ask is, based on this data set, can we expect the wide world of data scientists to make more than the whole population of DBAs? Intuitively, I’d say yes, and we can see that there’s a $35K a year difference in the results, but the problem is that the number of data scientists in the sample is only 11.

The way that I’m going to compare these data sets is by running a t-test on the log10 salaries of each data set. I know that the distributions are not normal, so I can’t make the assumption that they are. That said, t-tests are robust to normality, meaning that although they were designed to work with data following a normal distribution, they can still function without that requirement.

data.scientists <- salary.data %>% filter(JobTitle == "Data Scientist")
dbas <- salary.data %>% filter(JobTitle == "DBA")
t.test(log10(data.scientists$SalaryUSD), log10(dbas$SalaryUSD))
> t.test(log10(data.scientists$SalaryUSD), log10(dbas$SalaryUSD))

	Welch Two Sample t-test

data:  log10(data.scientists$SalaryUSD) and log10(dbas$SalaryUSD)
t = 1.9772, df = 10.173, p-value = 0.07574
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.01828365  0.31237969
sample estimates:
mean of x mean of y 
 5.031427  4.884379

I’m taking a two sample t-test (because I have two samples: a sample of data scientists and a sample of database administrators) and get a t score of 1.9772 and a p value of 0.076. Neither of these is good enough to give us enough confidence to state that, based solely on this data set, a data scientist in the overall population earns more than a database administrator. We can visualize a big part of the reason just by looking at the two histograms:

histdatascihistdba

3 of the 10 data scientists are in the 10^5.2–5.4 range, which is well above the peak for DBAs. But most data scientists fit into the upper end of the DBA salary range, so even though the means are vastly different, it’s hard to confirm—based on so few data scientists in the sample—that we can make a confident prediction that there’s a real difference rather than just being sample-related noise.

But let’s do a quick thought experiment and add three more data scientists to the survey, each of whom has a salary of our mean, $123,317.50. If we do that, the results become quite different:

ds.thought.experiment <- c(data.scientists$SalaryUSD, 123317.5, 123317.5, 123317.5)
t.test(log10(ds.thought.experiment), log10(dbas$SalaryUSD))
> t.test(log10(ds.thought.experiment), log10(dbas$SalaryUSD))

	Welch Two Sample t-test

data:  log10(ds.thought.experiment) and log10(dbas$SalaryUSD)
t = 2.7378, df = 13.368, p-value = 0.01658
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.0340569 0.2855808
sample estimates:
mean of x mean of y 
 5.044198  4.884379

Now the t score is 2.74 and the p value is under 0.02. This makes sense: by adding a few more people to the sample (and assuming that the sample mean is anywhere near accurate), we see a clearer bifurcation.

DBA Versus DBE

We saw that we didn’t have enough data scientist entries to differentiate data scientists from DBAs in terms of salary, so what can we do about this debate? Spoilers: not much.

dbes <- salary.data %>% filter(JobTitle == "Developer: T-SQL")
t.test(log10(dbes$SalaryUSD), log10(dbas$SalaryUSD))
> t.test(log10(dbes$SalaryUSD), log10(dbas$SalaryUSD))

	Welch Two Sample t-test

data:  log10(dbes$SalaryUSD) and log10(dbas$SalaryUSD)
t = -1.6626, df = 336.98, p-value = 0.09733
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.072605456  0.006090578
sample estimates:
mean of x mean of y 
 4.851122  4.884379

There’s a t score of -1.66, indicating that DBE salaries are lower, but the p value is nowhere near good enough for us to be confident that there’s a statistically significant difference.

US Versus Russia

Okay, with two failed tests, let’s go for a gimmie. American data professionals earn more than Russian data professionals. This isn’t quite a layup because there are only 13 Russians in the sample, so we might end up with something like the data scientist scenario.

americans <- salary.data %>% filter(Country == "United States")
t.test(log10(americans$SalaryUSD), log10(russians$SalaryUSD))
> t.test(log10(americans$SalaryUSD), log10(russians$SalaryUSD))

	Welch Two Sample t-test

data:  log10(americans$SalaryUSD) and log10(russians$SalaryUSD)
t = 6.613, df = 12.019, p-value = 2.472e-05
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.5325389 1.0557546
sample estimates:
mean of x mean of y 
 5.001766  4.207620

Success! We have a t score over 6.6 and a p value way below 1, meaning that based on our sample, it’s safe to say that Americans earn more.

US Versus UK

So how about the UK?

uk <- salary.data %>% filter(Country == "United Kingdom")
t.test(log10(americans$SalaryUSD), log10(uk$SalaryUSD))
> t.test(log10(americans$SalaryUSD), log10(uk$SalaryUSD))

	Welch Two Sample t-test

data:  log10(americans$SalaryUSD) and log10(uk$SalaryUSD)
t = 22.971, df = 363.35, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.2171545 0.2578154
sample estimates:
mean of x mean of y 
 5.001766  4.764281

We have enough members of both samples to be very confident in the difference. Note that even though the means are closer between the US and UK samples, the t score is much higher; this is because we have so many more members in the sample and can draw more confident conclusions.

US Versus Australia

Australia’s a tougher call because the sample is a bit smaller and the mean is a lot closer.

aussies <- salary.data %>% filter(Country == "Australia")
t.test(log10(americans$SalaryUSD), log10(aussies$SalaryUSD))
> t.test(log10(americans$SalaryUSD), log10(aussies$SalaryUSD))

	Welch Two Sample t-test

data:  log10(americans$SalaryUSD) and log10(aussies$SalaryUSD)
t = 3.9512, df = 73.376, p-value = 0.0001771
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.03386076 0.10277441
sample estimates:
mean of x mean of y 
 5.001766  4.933449

Even with that hedging, there are enough members of the Australian sample for us to be confident that there is a statistically significant difference between the populations.

US Versus Denmark

So how about the US’s nearest neighbor in terms of average data professional salary?

denmark <- salary.data %>% filter(Country == "Denmark")
t.test(log10(americans$SalaryUSD), log10(denmark$SalaryUSD))
> t.test(log10(americans$SalaryUSD), log10(denmark$SalaryUSD))

	Welch Two Sample t-test

data:  log10(americans$SalaryUSD) and log10(denmark$SalaryUSD)
t = 1.1847, df = 22.086, p-value = 0.2487
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.06728552  0.24668863
sample estimates:
mean of x mean of y 
 5.001766  4.912065

The t score is small, the p value is large, and the 95% confidence interval crosses the 0 boundary. In other words, there aren’t enough people in the sample to show a statistically significant difference between the US and Denmark.

Conclusions

I wouldn’t draw too many conclusions from this sample, but it is rich enough to get some ideas.  We were able to see some common trends across most western countries regarding the relative pricing of data professionals—even though Americans make more in the data set than people in all other countries save for Switzerland (after filtering out the low-count countries), the earnings ratio relative to other Americans is nearer to the middle of the pack, meaning that data professionals tend to be at similar relative levels of affluence.

We also learned that you should avoid DB2…but I don’t think you needed me to tell you that…

Hopefully this inspires you to grab the data set yourself and play around with it.  There are plenty of other tests that people can run on the data; I’m just scratching the surface.

Let’s Build A Portable Hadoop Cluster, Part 1

A little while back, I had a mini-series on building a Hadoop cluster.  Part 1 covered what I bought.  Part 2 introduced the idea of a Dockerized Hadoop cluster.  Part 3 covered installation and setup.

That’s all well and good, but one of my goals is to make this a portable Hadoop cluster so I can present with it.  The Intel NUC does not have a built-in battery like a laptop, so if you pull the plug (hint:  don’t), it’ll just shut itself right off.  When you power it back on, you’ll find that your Hadoop cluster has gone into hiding:

docker

I’ll show how to bring the cluster back to life in the next blog post, but I will say that it takes about 10-15 minutes for everything to come up, and I might not have 10-15 minutes before my talk to set things up.  I’d prefer to be able to attach the NUC to an external battery pack an hour or two before the talk begins and let it coast from there.

In addition, I also need to have a network connection so I can talk to the NUC.  I don’t want to trust that my presentation location will have a good internet connection and I don’t want my NUC exposed to the network, so I need a miniature router as well.

Here’s what I landed on:

The TP-Link router was already in my bag, so I didn’t buy it specifically for this project.  It’s an alright travel router but is probably the weak link here and if I were buying new, I’d probably go with something a little more modern and powerful.

I did a lot more research on rechargable power packs, and the BP220 seemed to be the best for the job.  The Intel NUC that I have draws about 17 watts when idling and can spike up to 77 at load (and I’ve even read that it could spike into the 80s when you push it hard).  The BP220 supports that load and provides 223 watt-hours of juice per charge.  That means I could stretch out a battery charge for up to 13 hours (223 / 17), although a more realistic figure would be an average of about 35 watts, so maybe 6-7 hours.  Still, that’s more than I need to get me through a one-hour presentation and two hours of prep.

The battery pack itself is a little heavy, weighing in at a little over 3 pounds—in other words, it’s heavier than my laptop, especially if you pack the power brick as well.  Combined with the NUC, it’s about 7-8 pounds of extra gear, meaning that I’m fine taking it with me to present but wouldn’t want to schlep it around all the time.  That said, it’s also pretty compact.  At 10.6″ long, it fits nicely into my laptop bag, and it and the NUC can share the inside pocket while my laptop fits into the outside pocket.  At that point, I’m essentially carrying two laptops, but I did that for a while anyhow, so no big deal.

Finally, the power strip makes it so that I can plug in these devices along with my laptop.  Power outlets aren’t always conveniently located, and you rarely get more than one or maybe two outlets, so that’s in my bag just in case I do run low on battery power and need to plug everything in.

Working With Azure SQL DW

This is a continuation of my Polybase series.

The first step to working with Polybase in Azure SQL Data Warehouse is to provision an Azure SQL Data Warehouse instance.  I’m going to follow along with the data I used in my Azure Blob Storage examples and load that data into Azure SQL Data Warehouse using Polybase.

The best way to connect to Azure SQL Data Warehouse is not SQL Server Management Studio, but rather SQL Server Data Tools.  So we’ll connect using SSDT.

In my scenario, I have an Azure SQL Data Warehouse instance named csdwdb.

connection

After putting in the correct details, I can see my instance in the SQL Server Object Explorer in Visual Studio.

connected

My plan is to do two things:  first, I want to pull all of my flight data into Azure SQL Data Warehouse, putting it into a table with a clustered columnstore index.  Second, I want to perform some processing of the data and put the results back into Azure Blob Storage to allow me to migrate data back to an on-prem SQL Server instance.

Step One:  Loading Data

External Types

The first thing I need to do is to create an external data source, external file format, and external table for my Azure Blob Storage flights folder.  Looking back at my post on Azure Blob Storage, I can repurpose most of that code for this scenario as well:

CREATE MASTER KEY ENCRYPTION BY PASSWORD = '<password>';
GO
CREATE DATABASE SCOPED CREDENTIAL AzureStorageCredential
WITH IDENTITY = 'cspolybase',
SECRET = '<access key>';
GO
CREATE EXTERNAL DATA SOURCE WASBFlights
WITH (
    TYPE = HADOOP,
    LOCATION = 'wasbs://csflights@cspolybase.blob.core.windows.net',
    CREDENTIAL = AzureStorageCredential
);
GO
CREATE EXTERNAL FILE FORMAT [CsvFileFormat] WITH
(
    FORMAT_TYPE = DELIMITEDTEXT,
    FORMAT_OPTIONS
    (
        FIELD_TERMINATOR = N',',
        USE_TYPE_DEFAULT = True
    )
);
GO
CREATE EXTERNAL TABLE [dbo].[Flights2008]
(
    [year] int NULL,
    [month] int NULL,
    [dayofmonth] int NULL,
    [dayofweek] int NULL,
    deptime VARCHAR(100) NULL,
    crsdeptime VARCHAR(100) NULL,
    arrtime VARCHAR(100) NULL,
    crsarrtime VARCHAR(100) NULL,
    uniquecarrier VARCHAR(100) NULL,
    flightnum VARCHAR(100) NULL,
    tailnum VARCHAR(100) NULL,
    actualelapsedtime VARCHAR(100) NULL,
    crselapsedtime VARCHAR(100) NULL,
    airtime VARCHAR(100) NULL,
    arrdelay VARCHAR(100) NULL,
    depdelay VARCHAR(100) NULL,
    origin VARCHAR(100) NULL,
    dest VARCHAR(100) NULL,
    distance VARCHAR(100) NULL,
    taxiin VARCHAR(100) NULL,
    taxiout VARCHAR(100) NULL,
    cancelled VARCHAR(100) NULL,
    cancellationcode VARCHAR(100) NULL,
    diverted VARCHAR(100) NULL,
    carrierdelay VARCHAR(100) NULL,
    weatherdelay VARCHAR(100) NULL,
    nasdelay VARCHAR(100) NULL,
    securitydelay VARCHAR(100) NULL,
    lateaircraftdelay VARCHAR(100) NULL
)
WITH
(
    LOCATION = N'historical/2008.csv.bz2',
    DATA_SOURCE = WASBFlights,
    FILE_FORMAT = CsvFileFormat,
    -- Up to 5000 rows can have bad values before Polybase returns an error.
    REJECT_TYPE = Value,
    REJECT_VALUE = 5000
);
GO
CREATE EXTERNAL TABLE [dbo].[FlightsAll]
(
    [year] int NULL,
    [month] int NULL,
    [dayofmonth] int NULL,
    [dayofweek] int NULL,
    deptime VARCHAR(100) NULL,
    crsdeptime VARCHAR(100) NULL,
    arrtime VARCHAR(100) NULL,
    crsarrtime VARCHAR(100) NULL,
    uniquecarrier VARCHAR(100) NULL,
    flightnum VARCHAR(100) NULL,
    tailnum VARCHAR(100) NULL,
    actualelapsedtime VARCHAR(100) NULL,
    crselapsedtime VARCHAR(100) NULL,
    airtime VARCHAR(100) NULL,
    arrdelay VARCHAR(100) NULL,
    depdelay VARCHAR(100) NULL,
    origin VARCHAR(100) NULL,
    dest VARCHAR(100) NULL,
    distance VARCHAR(100) NULL,
    taxiin VARCHAR(100) NULL,
    taxiout VARCHAR(100) NULL,
    cancelled VARCHAR(100) NULL,
    cancellationcode VARCHAR(100) NULL,
    diverted VARCHAR(100) NULL,
    carrierdelay VARCHAR(100) NULL,
    weatherdelay VARCHAR(100) NULL,
    nasdelay VARCHAR(100) NULL,
    securitydelay VARCHAR(100) NULL,
    lateaircraftdelay VARCHAR(100) NULL
)
WITH
(
    LOCATION = N'historical/',
    DATA_SOURCE = WASBFlights,
    FILE_FORMAT = CsvFileFormat,
    -- Up to 5000 rows can have bad values before Polybase returns an error.
    REJECT_TYPE = Value,
    REJECT_VALUE = 5000
);
GO

Note that there is no database defined anymore. I’m running the query against my Azure SQL DW database, so I’m not going to specify a separate database here.

I want to make sure that everything’s working, so a quick select statement is in order:

SELECT TOP(100) * FROM dbo.FlightsAll;

top100

So, “quick” might be an overstatement, but I did get results I wanted and proof that we have data available.

Loading SQL Data Warehouse

My next step is to take this data and load it into Azure SQL Data Warehouse.  The smart way to create tables in Azure SQL Data Warehouse is to use the CREATE TABLE AS SELECT syntax.  I’m going to use that, as well as the advice they give on getting a fairly unique distribution key, and create my 2008 flights table as follows:

.[Flights2008DW]
WITH
(
CLUSTERED COLUMNSTORE INDEX,
DISTRIBUTION = HASH(tailnum)
)
AS SELECT * FROM dbo.Flights2008;
GO

After a little over a minute, I get my results:

Query completed. Rows were rejected while reading from external source(s).
1 row rejected from external table [Flights2008] in plan step 3 of query execution:
Location: ‘/historical/2008.csv.bz2’ Column ordinal: 0, Expected data type: INT, Offending value: Year (Column Conversion Error), Error: Error converting data type NVARCHAR to INT.

That’s a header row, and I’m okay with it not making its way in.  As a quick aside, I should note that I picked tailnum as my distribution key.  The airplane’s tail number is unique to that craft, so there absolutely will be more than 60 distinct values, and as I recall, this data set didn’t have too many NULL values.  After loading the 2008 data, I loaded all years’ data the same way, except selecting from dbo.Flights instead of Flights2008.

Querying The Data

So let’s do some quick queries and make sure that everything looks alright.  Going back to earlier this year, I had an example of a query against a larger data set.  My subset of flights is not exactly the same here as it was there—I only have flight data for the years 2003 through 2008 in Azure SQL Data Warehouse—but I figured it’d be a good start.

Here’s my slightly revised query:

SELECT
    f.origin,
    f.distance
FROM dbo.FlightsAllDW f
WHERE
    f.year = 2005
    AND f.dest = 'CMH';

And against an Azure SQL Data Warehouse instance with 200 DWUs, it took 6 seconds to load the results.

2005flights

Again, that’s not a fair comp:  I’m getting 200 DWU for data partitioned across 60 servers versus running a query on my laptop.  That said, it was pleasant seeing results come back that quickly.

Migrating Data Out Using Polybase

I’m going to show one last trick here:  outmigration using CETAS:  CREATE EXTERNAL TABLE AS SELECT.

CREATE EXTERNAL TABLE [dbo].[CMHFlights]
WITH
(
	LOCATION = N'columbus/',
    DATA_SOURCE = WASBFlights,
    FILE_FORMAT = CsvFileFormat,
    -- Up to 5000 rows can have bad values before Polybase returns an error.
    REJECT_TYPE = Value,
    REJECT_VALUE = 5000
)
AS SELECT * FROM dbo.FlightsAllDW WHERE dest = 'CMH';
GO

This query takes data from my FlightsAllDW Azure SQL DW table and pumps them into Azure Blob Storage into a new external table.

cmhflights

In eight seconds, it figured out the 220K records that interested me and wrote them to a new external table called CMHFlights, and in a folder marked columbus.  If I look at the folder, I can see that the Polybase engine created 60 files:

cetas

Remember that when we did this ourselves using on-prem Polybase (e.g., writing to Hadoop), we ended up the data split out into 8 files.  Azure SQL Data Warehouse is special.

Conclusion

Polybase inside Azure SQL Data Warehouse isn’t that different from on-prem Polybase.  Its primary focus is to get data into and out of Azure SQL Data Warehouse for cases in which you don’t want to use Azure Data Factory.  One really cool thing you can do with Polybase in Azure SQL DW that you can’t do on-prem is CREATE EXTERNAL TABLE AS SELECT.  Otherwise, if you understand how the on-prem product works, you’ll get how it works in SQL Data Warehouse.

Partitioned Views With Polybase

This is a continuation of my Polybase series.

In today’s Polybase post, I want to see if I can replicate “SQL 2000 partitioning” using Polybase.  Back before SQL Server 2005, there was no concept of partitioned tables; the only way you could partition a table was to separate each logical subset of data into its own table and create a view which unioned those subsets together.  The reason you might want to do this is that if there were constraints on each individual table, you would be able to perform partition elimination, only querying the table(s) with potentially relevant data.

With that in mind, let’s take a look at what we can do.

Statistics, Not Constraints

The first important thing to note is that we cannot create constraints after the fact.  Taking an external table with a NOT NULL column:

ALTER TABLE dbo.RemoteFixed ADD CONSTRAINT [PK_RemoteFixed] PRIMARY KEY CLUSTERED (ID);

I get the following error:

Msg 46518, Level 16, State 2, Line 3
The feature ‘ALTER TABLE’ is not supported with external tables.

In fact, I can’t create a primary key constraint on an external table either:

CREATE EXTERNAL TABLE [dbo].[RemoteFixedPK]
(
    ID INT NOT NULL PRIMARY KEY CLUSTERED,
    SomeNum INT NOT NULL,
    SomeChar CHAR(12) NOT NULL
)
WITH
(
    DATA_SOURCE = [WASBFlights],
    LOCATION = N'fixed/',
    FILE_FORMAT = [CsvFileFormat],
    REJECT_TYPE = VALUE,
    REJECT_VALUE = 5
);

Msg 103010, Level 16, State 1, Line 1
Parse error at line: 3, column: 21: Incorrect syntax near ‘PRIMARY’.

I leave it as an exercise to the reader to look at check constraints; I did and got the same basic error messages when trying to modify or create a table with a check constraint.

Why This Makes Sense

At first, you might think that this is a terrible limitation and something that needs “fixed.”  I don’t think I agree with that assessment, as these are external tables.  The value in primary key and check constraints comes from the fact that the database engine controls the rows which go into a table and can validate these constraints whenever you try to insert rows.  With external tables, however, there are multiple ways for data to get into that table:  it can come directly from SQL Server insert statements, but (much) more often, it will come from some external process.  That external process likely will not have the concepts of primary key or check constraints, and so data could get in which fails these constraints.

That said, I think there’s a fair argument in rejecting rows upon data retrieval based on constraints, but that behavior would differ from “normal” tables, so I’m not sure it would be the best choice.

Are Statistics The Key?

So knowing that we cannot create check constraints but we can create statistics, will that help?  I’ve created a series of flight tables, one for each year, which follow the following pattern:

USE [OOTP]
GO
CREATE EXTERNAL TABLE [dbo].[Flights2006]
(
    [year] int NULL,
    [month] int NULL,
    [dayofmonth] int NULL,
    [dayofweek] int NULL,
    deptime VARCHAR(100) NULL,
    crsdeptime VARCHAR(100) NULL,
    arrtime VARCHAR(100) NULL,
    crsarrtime VARCHAR(100) NULL,
    uniquecarrier VARCHAR(100) NULL,
    flightnum VARCHAR(100) NULL,
    tailnum VARCHAR(100) NULL,
    actualelapsedtime VARCHAR(100) NULL,
    crselapsedtime VARCHAR(100) NULL,
    airtime VARCHAR(100) NULL,
    arrdelay VARCHAR(100) NULL,
    depdelay VARCHAR(100) NULL,
    origin VARCHAR(100) NULL,
    dest VARCHAR(100) NULL,
    distance VARCHAR(100) NULL,
    taxiin VARCHAR(100) NULL,
    taxiout VARCHAR(100) NULL,
    cancelled VARCHAR(100) NULL,
    cancellationcode VARCHAR(100) NULL,
    diverted VARCHAR(100) NULL,
    carrierdelay VARCHAR(100) NULL,
    weatherdelay VARCHAR(100) NULL,
    nasdelay VARCHAR(100) NULL,
    securitydelay VARCHAR(100) NULL,
    lateaircraftdelay VARCHAR(100) NULL
)
WITH
(
    LOCATION = N'historical/2006.csv.bz2',
    DATA_SOURCE = WASBFlights,
    FILE_FORMAT = CsvFileFormat,
    -- Up to 5000 rows can have bad values before Polybase returns an error.
    REJECT_TYPE = Value,
    REJECT_VALUE = 5000
);

CREATE STATISTICS [st_flights2006_year] ON [dbo].[Flights2006]([year]);
GO

For each year, I created the table, pointed it to the relevant year file, and created a single statistics histogram based on the year.  For 2006, that histogram is simple:

flights2006statistics

We can see that there’s data just for one year.  So when I run the following query, my hope would be that the contradiction checker would say that this is not going to return any results and turn the query into a constant scan:

SELECT TOP 10
	*
FROM dbo.Flights2006
WHERE
	year = 2008;

The answer, a couple minutes later, was “close, but no cigar.”  As Benjamin Nevarez points out, contradiction detection is on a constraint, so just because our statistics say there are no records outside of 2006, doesn’t mean there really are no records outside of 2006, and so the contradiction detection logic won’t come into play here.

Creating A View

The next idea is to create a view, specifying our “constraints” in the WHERE clause of each segment:

CREATE VIEW dbo.AllFlights AS
SELECT * FROM dbo.Flights2006 WHERE year = 2006
UNION ALL
SELECT * FROM dbo.Flights2007 WHERE year = 2007
UNION ALL
SELECT * FROM dbo.Flights2008 WHERE year = 2008;

This does pass the obvious contradiction test:

contradiction

For a year outside of our range, we get no results back.  How about when we look at a contradiction in terms of two separate years?

contradiction2

I was a little concerned that we’d hit something like the scenario Conor Cunningham described back in 2008, but it looks like the engine was smart enough to figure this out.

So now for the big question:  can we get the equivalent of partition elimination?  The answer is yes.  When I ran the query, the execution plan’s remote query made reference to only one file:  wasbs://csflights@cspolybase.blob.core.windows.net/historical/2006.csv.bz2.  And my Wireshark packet capture only found 79,116 packets of at least 1400 bytes.

flights2006packets

As we saw earlier, that is consistent with reading just a single file.  So how about we look at two years; will we see two files?

SELECT
	year,
	COUNT(1) AS NumberOfFlights
FROM dbo.AllFlights af
WHERE
	year IN (2006, 2007)
GROUP BY
	year
ORDER BY
	year;

The Wireshark capture looks a bit different now:

multiyearflightspackets

And my execution plan now includes two wasbs blocks.  Here’s the remote query XML block which shows that:

<?xml version="1.0" encoding="utf-8"?>
<dsql_query number_nodes="1" number_distributions="8" number_distributions_per_node="8">
  <sql>ExecuteMemo explain query</sql>
  <dsql_operations total_cost="696.59944" total_number_operations="15">
    <dsql_operation operation_type="RND_ID">
      <identifier>TEMP_ID_40</identifier>
    </dsql_operation>
    <dsql_operation operation_type="ON">
      <location permanent="false" distribution="AllDistributions" />
      <sql_operations>
        <sql_operation type="statement">CREATE TABLE [tempdb].[dbo].[TEMP_ID_40] ([year] INT ) WITH(DATA_COMPRESSION=PAGE);</sql_operation>
      </sql_operations>
    </dsql_operation>
    <dsql_operation operation_type="ExternalRoundRobinMove">
      <operation_cost cost="695.59944" accumulative_cost="695.59944" average_rowsize="4" output_rows="45715" />
      <external_uri>wasbs://csflights@cspolybase.blob.core.windows.net/historical/2006.csv.bz2</external_uri>
      <destination_table>[TEMP_ID_40]</destination_table>
    </dsql_operation>
    <dsql_operation operation_type="RND_ID">
      <identifier>TEMP_ID_41</identifier>
    </dsql_operation>
    <dsql_operation operation_type="ON">
      <location permanent="false" distribution="AllDistributions" />
      <sql_operations>
        <sql_operation type="statement">CREATE TABLE [tempdb].[dbo].[TEMP_ID_41] ([year] INT ) WITH(DATA_COMPRESSION=PAGE);</sql_operation>
      </sql_operations>
    </dsql_operation>
    <dsql_operation operation_type="ON">
      <location permanent="false" distribution="AllDistributions" />
      <sql_operations>
        <sql_operation type="statement">EXEC [tempdb].[sys].[sp_addextendedproperty] @name=N'IS_EXTERNAL_STREAMING_TABLE', @value=N'true', @level0type=N'SCHEMA', @level0name=N'dbo', @level1type=N'TABLE', @level1name=N'TEMP_ID_41'</sql_operation>
      </sql_operations>
    </dsql_operation>
    <dsql_operation operation_type="ON">
      <location permanent="false" distribution="AllDistributions" />
      <sql_operations>
        <sql_operation type="statement">UPDATE STATISTICS [tempdb].[dbo].[TEMP_ID_41] WITH ROWCOUNT = 6024, PAGECOUNT = 2</sql_operation>
      </sql_operations>
    </dsql_operation>
    <dsql_operation operation_type="ON">
      <location permanent="false" distribution="AllDistributions" />
      <sql_operations>
        <sql_operation type="statement">CREATE STATISTICS [year] ON [tempdb].[dbo].[TEMP_ID_41] ([year]) WITH STATS_STREAM = 0x010000000100000000000000000000...</sql_operation>
      </sql_operations>
    </dsql_operation>
    <dsql_operation operation_type="RND_ID">
      <identifier>TEMP_ID_42</identifier>
    </dsql_operation>
    <dsql_operation operation_type="ON">
      <location permanent="false" distribution="AllDistributions" />
      <sql_operations>
        <sql_operation type="statement">CREATE TABLE [tempdb].[dbo].[TEMP_ID_42] ([year] INT NOT NULL, [col] BIGINT ) WITH(DATA_COMPRESSION=PAGE);</sql_operation>
      </sql_operations>
    </dsql_operation>
    <dsql_operation operation_type="MULTI">
      <dsql_operation operation_type="SHUFFLE_MOVE">
        <operation_cost cost="0.00576" accumulative_cost="695.6052" average_rowsize="12" output_rows="4" />
        <source_statement>SELECT [T1_1].[year] AS [year],
       [T1_1].[col] AS [col]
FROM   (SELECT   COUNT_BIG(CAST ((1) AS INT)) AS [col],
                 [T3_1].[year] AS [year]
        FROM     [tempdb].[dbo].[TEMP_ID_40] AS T3_1
        WHERE    ([T3_1].[year] = CAST ((2006) AS INT))
        GROUP BY [T3_1].[year]
        UNION ALL
        SELECT   COUNT_BIG(CAST ((1) AS INT)) AS [col],
                 [T3_1].[year] AS [year]
        FROM     [tempdb].[dbo].[TEMP_ID_41] AS T3_1
        WHERE    ([T3_1].[year] = CAST ((2007) AS INT))
        GROUP BY [T3_1].[year]) AS T1_1</source_statement>
        <destination_table>[TEMP_ID_42]</destination_table>
        <shuffle_columns>year;</shuffle_columns>
      </dsql_operation>
      <dsql_operation operation_type="ExternalRoundRobinMove">
        <operation_cost cost="733.274256" accumulative_cost="1428.879456" average_rowsize="4" output_rows="48191" />
        <external_uri>wasbs://csflights@cspolybase.blob.core.windows.net/historical/2007.csv.bz2</external_uri>
        <destination_table>[TEMP_ID_41]</destination_table>
      </dsql_operation>
    </dsql_operation>
    <dsql_operation operation_type="ON">
      <location permanent="false" distribution="AllDistributions" />
      <sql_operations>
        <sql_operation type="statement">DROP TABLE [tempdb].[dbo].[TEMP_ID_41]</sql_operation>
      </sql_operations>
    </dsql_operation>
    <dsql_operation operation_type="ON">
      <location permanent="false" distribution="AllDistributions" />
      <sql_operations>
        <sql_operation type="statement">DROP TABLE [tempdb].[dbo].[TEMP_ID_40]</sql_operation>
      </sql_operations>
    </dsql_operation>
    <dsql_operation operation_type="STREAMING_RETURN">
      <operation_cost cost="1" accumulative_cost="1429.879456" average_rowsize="8" output_rows="2" />
      <location distribution="AllDistributions" />
      <select>SELECT [T1_1].[year] AS [year],
       [T1_1].[col] AS [col]
FROM   (SELECT CONVERT (INT, [T2_1].[col], 0) AS [col],
               [T2_1].[year] AS [year]
        FROM   (SELECT ISNULL([T3_1].[col], CONVERT (BIGINT, 0, 0)) AS [col],
                       [T3_1].[year] AS [year]
                FROM   (SELECT   SUM([T4_1].[col]) AS [col],
                                 [T4_1].[year] AS [year]
                        FROM     [tempdb].[dbo].[TEMP_ID_42] AS T4_1
                        GROUP BY [T4_1].[year]) AS T3_1) AS T2_1) AS T1_1</select>
    </dsql_operation>
    <dsql_operation operation_type="ON">
      <location permanent="false" distribution="AllDistributions" />
      <sql_operations>
        <sql_operation type="statement">DROP TABLE [tempdb].[dbo].[TEMP_ID_42]</sql_operation>
      </sql_operations>
    </dsql_operation>
  </dsql_operations>
</dsql_query>

We see wasbs references on line 16 and 76.  What’s interesting is that each individual file goes into its own temp table (TEMP_ID_40 and TEMP_ID_41 for 2006 and 2007, respectively), and then the Polybase engine creates a third temp table (TEMP_ID_42) and inserts the individual aggregations from 40 and 41 into this third temp table.

Conclusion:  “Stretch” Database

Using a view, we were able to create a “partitioned” Polybase experience, similar to what we had in SQL Server 2000.  This form of poor man’s partitioning allows us to segment out data sets and query them independently, something which can be helpful when storing very large amounts of data off-site and only occasionally needing to query it.  The thing to remember, though, is that if you store this in Azure Blob Storage, you will need to pull down the entire table’s worth of data to do any processing.

This leads to a concept I first heard from Ginger Grant:  pseudo-StretchDB.  Instead of paying for what Stretch offers, you get an important subset of the functionality at a much, much lower price.  If you do store the data in Azure Blob Storage, you’re paying pennies per gigabyte per month.  For cold storage, like a scenario in which you need to keep data around to keep the auditors happy but your main application doesn’t use that information, it can work fine.  But if you need to query this data frequently, performance might be a killer.

Presentation Goals For 2017

In light of my 2016 goals, I’ve decided to make my 2017 speaking goals a little bit more ambitious and keep them public. To wit:

Speak at 20 SQL Saturday and 10 User Groups

Last year, I broke both of these barriers, so you might wonder how this is particularly ambitious. The way I figure it, when combined with the other goals, getting out to 20 separate SQL Saturdays will be a bit harder, as there are only so many weeks (and weekends) in the conference season.

So far, I am locked into three SQL Saturdays: Nashville, Vienna, and Cleveland.

Speak at 2 Paid Conferences

Last year, my goal was to speak at one paid conference. This year, I’m pushing it up to two. This is a harder goal than SQL Saturday-related goals mostly because paid conferences usually mean travel during the week, and being in management—even as low a level of management as I’m in—makes that a little harder.

Give 6 Webinars

Here’s the big one for me. I have presented two webinars to date. Webinars are an area I have neglected in the past, focusing instead on in-person presentations. Speaking in person is a bit easier than giving a webinar for a few reasons. First, you get to warm up the audience before the session, priming them for a good talk. You also get to see how well the material is going over: if they’re laughing at your jokes and looking interested, you know you’re in the groove; if they seem confused, you can slow down and repeat things.

Without those kinds of audience cues, I want to create more of a subdued webinar experience. My goal is still to be educational and entertaining (to some extent, at least!), but I have stripped out some of the humor. I haven’t figured out how exactly to replace those gaps entirely smoothly, so that’s going to be part of my goal. After all, if I end up being terrible at webinars, I probably won’t hit my goal of 6!

Do A Full-Length, Pictures-Only Talk

I created a couple lightning talks this year which incorporate hand-drawn pictures, stealing the idea from David Neal, who does a much better job at it than I do. Regardless, I like the format and would like to do an entire talk in that style. It’s a bit of a risk, as I really want to do everything—including demos—in hand-drawn picture format. I think it would be an interesting, one-off style.

This isn’t something that I would want to do for every talk, but as I go over the talks I have today, I know that I’m a bit short on graphics, and I know of several places in talks where a simple picture can help me explain the concept better. Maybe this talk would help me get to that point.

Inserting Into Azure Blob Storage

This is a continuation of my Polybase series.

My goal in today’s post is to insert data into Azure Blob Storage using Polybase and see what happens.  My expectation is that it will behave the same as inserting into Hadoop.

Creating A New External Table

Our first step is to create a new external table.  In this case, I want to store information on second basemen.  I already have the data locally in a table called Player.SecondBasemen, and I’d like to move it up to Azure Blob Storage into a folder called ootp.

secondbasemen

To do this, we’ll create a new external table:

USE [OOTP]
GO
CREATE EXTERNAL TABLE [dbo].[SecondBasemenWASB]
(
	[FirstName] [varchar](50) NULL,
	[LastName] [varchar](50) NULL,
	[Age] [int] NULL,
	[Throws] [varchar](5) NULL,
	[Bats] [varchar](5) NULL
)
WITH
(
	DATA_SOURCE = [WASBFlights],
	LOCATION = N'ootp/',
	FILE_FORMAT = [CsvFileFormat],
	REJECT_TYPE = VALUE,
	REJECT_VALUE = 5
)
GO

Inserting Data

Inserting data is pretty easy.

INSERT INTO dbo.SecondBasemenWASB
(
	FirstName,
	LastName,
	Age,
	Throws,
	Bats
)
SELECT
	sb.FirstName,
	sb.LastName,
	sb.Age,
	sb.Bats,
	sb.Throws
FROM Player.SecondBasemen sb;

A few seconds later and we have some rows in the table:

rowsinserted

We can check to make sure that everything looks fine:

secondbasemenwasb

I get back the same 777 results.

Looking at Azure Management Studio, I can see that the Polybase engine created a logical folder named ootp and 8 separate files, of which 4 have data.

filesuploaded

One interesting finding is that there is also an empty block blob with the name ootp:

emptyblockblob

The timestamp for it is the same as for the initial upload, so these were created at the same time.

Uploading A Larger Data Set

I decided to take advantage of my Azure Blob Storage connection and create a larger data set.  In this case, I’m going to cross join the second basemen table, creating 777 records for each of the 777 second basemen.  Those will all go into Azure Blob Storage and I’ll be able to see how the Polybase engine deals with larger data sets.  I’m running two tests, one which will generate 777 * 777 = 603,729 records, and one which will generate 777 * 777 * 777 = 469,097,433 records.  Here’s my query for the latter:

INSERT INTO dbo.SecondBasemenWASB
(
	FirstName,
	LastName,
	Age,
	Throws,
	Bats
)
SELECT
	sb.FirstName,
	sb.LastName,
	sb.Age,
	sb.Bats,
	sb.Throws
FROM Player.SecondBasemen sb
	CROSS JOIN Player.SecondBasemen sb2
	CROSS JOIN Player.SecondBasemen sb3;

Generating and uploading 600K records was pretty fast, but building 469 million records was a little slower.  During this time, the Polybase engine was streaming results out, meaning it didn’t appear to write everything to a local file.  The mpdwsvc service did take up about 2 GB of RAM during this time and was pushing out at about 25-30 Mbps, which is probably my Wifi connection’s limitation.  All in all, it took my laptop approximately 80 minutes to generate and push those results.

The end results were not particularly shocking:

multipleuploads

I’ve sliced the first, second, and third uploads into separate batches to make it clearer.  In the first batch, as I mentioned above, only four of the files ended up with data in them.  In the second batch, each file had somewhere around 2MB of write activity.  For the third batch, each was about 1.5 GB.  I should note that for the second run, the spread between the smallest and largest files was approximately 17.6%, whereas it was 12.3% for the third run.

What About Fixed-Width Lines?

One additional question I have involves whether the process for loading data is round-robin on a row-by-row basis.  My conjecture is that it is not (particularly given that our first example had 4 files with zero records in them!), but I figured I’d create a new table and test.  In this case, I’m using three fixed-width data types and loading 10 million identical records.  I chose to use identical record values to make sure that the text length of the columns in this line were exactly the same; the reason is that we’re taking data out of SQL Server (where an int is stored in a 4-byte block) and converting that int to a string (where each numeric value in the int is stored as a one-byte character).  I chose 10 million because I now that’s well above the cutoff point for data to go into each of the eight files, so if there’s special logic to handle tiny row counts, I’d get past it.

CREATE TABLE dbo.LocalFixed
(
	ID INT NOT NULL,
	SomeNum INT NOT NULL,
	SomeChar CHAR(12) NOT NULL
);

INSERT INTO dbo.LocalFixed
(
	ID,
	SomeNum,
	SomeChar
)
SELECT TOP (10000000)
	1,
	0,
	'abcdefghijkl'
FROM Player.SecondBasemen sb
	CROSS JOIN Player.SecondBasemen sb1
	CROSS JOIN Player.SecondBasemen sb2;

From here, we can create another external table and insert our 10 million rows into it:

CREATE EXTERNAL TABLE [dbo].[RemoteFixed]
(
	ID INT NOT NULL,
	SomeNum INT NOT NULL,
	SomeChar CHAR(12) NOT NULL
)
WITH
(
	DATA_SOURCE = [WASBFlights],
	LOCATION = N'fixed/',
	FILE_FORMAT = [CsvFileFormat],
	REJECT_TYPE = VALUE,
	REJECT_VALUE = 5
);

INSERT INTO dbo.RemoteFixed
(
	ID,
	SomeNum,
	SomeChar
)
SELECT
	lf.ID,
	lf.SomeNum,
	lf.SomeChar
FROM dbo.LocalFixed lf;

 

And here are the results in Blob Storage:

fixedwidthfiles

Even when I guarantee that rows are exactly the same size, there are still minor differences:  here it’s 20.48 MB versus 20.02 MB.  This indicates to me that the Polybase file write system is not round robin on a row-by-row basis, as there is a difference of 28,193 records separating the largest from the smallest file (1,262,965 rows with data in the largest file versus 1,234,772 in the smallest file).

Conclusion

Just like the Hadoop scenario, inserting into Azure Blob Storage splits the data into a set of files and uploads those files separately.  There weren’t any surprises here, so you can read the Hadoop version to get the rest of the story.  We did take the opportunity to learn a bit more about the way the Polybase file writer engine operates and can conclusively reject the idea that it evenly distributes records based either on data length or number of rows.

This wraps up my Polybase investigations of Azure Blob Storage for now.  Next up, I’m going to look at Azure SQL Data Warehouse.