Getting Started With Azure Machine Learning

I’m going to be giving a talk on Azure Machine Learning for my employer, so you’ll probably see a few ML-related topics on the blog as I put together some thoughts.

My first post on the topic will be linking to David Crook’s ML talk.  His slides and code are freely available.  David moves a bit quickly in this video, but because he has made everything available, you can work through the problem at your own pace.

Jitter And Color In R

As I work through Practical Data Science with R, I picked up on a jitter function in chapter 3 (free download).  I’m going to explore that jitter function a bit in this post.


If you want to follow along, grab the data set from the authors’ GitHub repo.  The relevant data file is in Custdata\custdata.tsv.  Custdata is an artificial set of data which is supposed to give a variety of data points to make graphing and analyzing data a bit easier.  This data set has 1000 observations of 11 variables, and the idea is to plot health insurance likelihood against certain demographic characteristics.  In this case, we’ll look at marital status.

Different Bar Charts

It’s pretty easy to build bar charts in R using ggplot2.  Here’s a simple example using the custdata set:

ggplot(custdata) + geom_bar(aes(x=marital.stat,fill=health.ins))

The resulting bar chart looks like this:


This type of bar chart helps us see overall trends, but makes it difficult to see the details within a marital status type.  To do that, we can create a side-by-side bar chart by changing one little bit of code:  add position=”dodge” to the geom_bar function, like so:

ggplot(custdata) + geom_bar(aes(x=marital.stat,fill=health.ins), position="dodge")

The resulting bar chart looks like this:


This gives us a much clearer look at the relative breakdown across all marital status and insurance combinations, but we lose the knowledge we gained looking at the first bar chart.

At this point, I’m going to take a brief aside.


Color selection is vital to graphical design.  We want colors which complement the medium—be that a monitor, a trade magazine, or a newspaper.  We also need to be cognizant that people interpret colors differently:  red-green colorblindness is the most common form, but there are several forms and we want to design with colorblind individuals in mind.  That means that the red-yellow-green combo probably doesn’t work as well as you’d like.

In Cookbook R, there’s a good colorblind-friendly palette which works for us.

Picking Our Own Colors

In the graphs above, I let R choose the colors for me.  That can work for some scenarios, but graphic designers are going to want to end up choosing their own colors.  Fortunately, doing this in ggplot2 is very easy, especially with factor variables.  All we need to do is add on a scale_fill_manual function.  Here’s our side-by-side bar chart using blue and vermillion:

ggplot(custdata) + geom_bar(aes(x=marital.stat,fill=health.ins), position="dodge") + scale_fill_manual(values=c("#D55E00","#0072B2"))

In this case, I specified hex codes for the colors.  If I wanted to use simpler colors, I could just write “blue” and “red,” but these won’t be quite the same colors.

Here is the new bar chart with our selected colors:


The image is the same as before, but those new colors guarantee that a colorblind person will be able to read our chart.

Fill Bars And Jitter

A third type of bar chart is the fill bar.  The fill bar lets us see very clearly the percentage of individuals with health insurance across different marital statuses:

ggplot(custdata) + geom_bar(aes(x=marital.stat,fill=health.ins), position="fill") + scale_fill_manual(values=c("#D55E00","#0072B2"))

Notice that the only thing we did was change the position to “fill” and we have a completely different chart. Here is how the fill bar chart looks:


The problem with a fill bar chart is that we completely lose context of how many people are in each category. The authors re-introduce this concept with the rug: a set of points below each bar that shows relative density. In order to do this, we add on a geom_point function that looks like so:

ggplot(custdata, aes(x=marital.stat)) + geom_bar(aes(fill=health.ins), position="fill") + geom_point(aes(y=-0.05), size=0.75, alpha=0.3, position=position_jitter(h=0.01)) + scale_fill_manual(values=c("#D55E00","#0072B2"))

Here is how the chart looks:


The geom_point function has all of our changes. It starts by moving y down to -0.05, which is far enough down on the plot that we can see those points without them interfering with our bar chart. The size and alpha channels are set to maximize visibility, especially for the Married option. You can see that the Married option is densest, but there’s still some gap in there. The last option is position_jitter. Position_jitter is a function which “jitter[s] points to avoid overplotting.” Basically, without jittering the points, we would have the same points overlaying one another, but by jittering the height in this case, we distribute the points across a wider space, letting us see the sheer number more clearly than otherwise.

Jitter gives us two options: width and height. In the example above, we see what it looks like with a height value of 0.01 but no width specified. In the next chart, I’m overlaying different jitter width levels on our plot. You can see that the default is closest to a width of 0.4:

ggplot(custdata, aes(x=marital.stat)) + geom_bar(aes(fill=health.ins), position="fill") +
  geom_point(aes(y=-0.05), size=0.75, alpha=0.3, position=position_jitter(h=0.01)) +
  geom_point(aes(y=0.05), size=0.75, alpha=0.3, position=position_jitter(w=0.4,h=0.01)) +
  geom_point(aes(y=0.25), size=0.75, alpha=0.3, position=position_jitter(w=0.3,h=0.01)) +
  geom_point(aes(y=0.45), size=0.75, alpha=0.3, position=position_jitter(w=0.2,h=0.01)) +
  geom_point(aes(y=0.65), size=0.75, alpha=0.3, position=position_jitter(w=0.1,h=0.01)) +

Here’s the resulting plot:


Of course, another option we could use would be to print the total number of elements in each section…but that’s for another day.


Different types of bar charts have different advantages and disadvantages.  Each provides us a certain view of the data, but the downside to each view is that we lose some other aspects.  Which graphs we show will depend upon the message we want to send, but it’s important to realize that there are methods to get around some of the deficiencies in certain graphs, such as generating a rug using the jitter function.

Also, be cognizant of color choices.  I’m not always good at that, but it’s something I want to think more about as I visualize more data with R.

Additional Resources

Beware Functions In SQL Server

Aren’t Functions A Good Thing?

As developers, we want to modularize code in order to present duplication.  There are several laudable reasons to prevent code duplication.  The simplest reason is that we all have better things to do than type the same lines of code over and over.  Putting code into a common module or function means that we make calling that code simpler and we type less.

Another reason to prevent code duplication is to prevent errors.  If we have five versions of a calculation and that calculation later changes, it’s easy accidentally to miss one of those changes and end up with two separate versions of a calculation.  The best case scenario is that this will cause users to doubt the veracity of your data; the worst case scenario is that you will permanently damage your application by unknowingly storing incorrect data that you cannot later fix.

Yet another reason developers want to use functions is to make testing easier.  When a piece of logic exists in one place, we can write a single set of tests to cover that logic.  If you have logic strewn about in many different functions, it makes testing harder for the same reason as above:  you might miss one version of the calculation or operation.

All of this shows that there are definite advantages to writing modular code using functions or other language constructs.

So Why Are They A Bad Thing?

In SQL server, modularization usually comes in three forms:  stored procedures, functions, and views.  I am a huge fan of stored procedures, am somewhat wary of views, and am strongly anti-function.  The reason comes down to performance.  Stored procedures have no performance overhead as opposed to ad hoc SQL, and so there is no performance optimization reason not to use them.  Views theoretically should not have any performance impact on an environment, but when you nest views deeply enough or hit complex enough views, the query optimizer can get confused and throw out an unnecessarily complex execution plan which performs poorly.  Often times, an easy fix to a poorly-performing query involving a view is to bring the relevant view code in-line.

Functions are like views, except much, much worse.  The problem with functions tends to be the same, regardless of whether you use scalar, multi-set, or table functions:  they perform poorly, much worse than their alternatives.  If you create a user-defined scalar function and use it in your WHERE clause or in a JOIN condition, you can pretty much guarantee that SQL Server will go through your table row-by-row, running that function every time.  This is bad enough if your function is a fairly simple calculation, but if you need to look up data from another table, you’re making a separate table join every single time.  Suppose you have 10 million rows in your main table and a function which does a quick lookup against a tiny table with just 30 rows.  To make the example starker, let’s say that the 10 million rows are ordered in such a way that we could profit most from a merge join against the 30-row table.  It would make sense to grab those 30 rows and stream the 10 million rows through, matching up against the relevant lookup row along the way.  But if we put this lookup operation into a function, it gets called once for each of the 10 million rows.  This most likely will not be a merge join and will lead to a huge amount of I/O.  The worst part is that your execution plan will show what one lookup looks like but fail to mention that it’s going to happen 10 million times, meaning that if you don’t understand the performance problems of functions, it’s easy to look at the plan and say that the function will actually perform better…until you see the total time elapsed!

A Case Study:  User Total Time

A customer of mine produces software which optimizes employee schedules.  They track shift times and figure out when an employee hits the overtime cutoff, and can track shift data from the past, present, and future.  The way that they did this was with a function called UserTotalTime().  This function called a few scalar functions to check parameters, and then performed a somewhat expensive set of table joins, aggregating the data into a result set with the user ID, time period, and number of hours worked + scheduled, leading to a determination of whether that employee has overtime hours or not.

When you’re looking at one user or even all users over a single time period, the process didn’t perform that badly, and so this function was integrated into more and more code, more and more complex procedures.  At this point, they started to notice the solution beginning to burst at the seams.  Queries for larger customers were taking an unacceptably long time and they weren’t sure what they could do to improve the situation.  It got to the point where they started deleting some data to limit how many rows were in the tables; that provided a temporary boost to performance, but wasn’t a good long-term strategy because it limited future analytics.

What To Do?

Taking a look at the function, the first thing I realized was that this generally did not need to be a function, nor did it need to be as complex as it was.  Those scalar functions to check parameters could be simplified and eliminated, as well as part of the main calculation.  At that point, I could inline the remaining code and modify procedures to build the result set directly instead of calling a function.

In this case, however, I went a step further.  This customer’s database is really closer to a warehouse than a true transactional system:  the read to write ratio is in the thousands or tens of thousands; people rarely update data, but they definitely read it a lot.  As a result, I ended up creating a reporting table which holds the aggregated results of this query in a brand new table, updating it as part of an ETL process which runs whenever rows get updated, as well as periodically (just in case somebody modifies data in a non-standard manner).  This works well for their environment, although it certainly isn’t a solution to every function-based performance problem.

What Can You Do?

The easiest way to fix performance problems with functions is to remove the function and bring the function’s code inline.  This doesn’t feel very nice for a developer, as it means copying code and violates the Don’t Repeat Yourself mantra, but if you’re running into performance problems, a little repetition is a small cost to pay for code which performs up to business needs.

Bonus Answer:  APPLY Yourself

Another tool to help get rid of pesky functions is the APPLY operator.  I just happen to have a presentation on the topic.  The great thing about the APPLY operator is that it operates like a table-valued function, but because the code is inline, you don’t get the performance problems that you would with a function.

Stay Tuned

In tomorrow’s blog post, I’m going to circle back around to UserTotalTime, showing why a function appears useful, how it can perform poorly, and how we can deconstruct a function and get back an acceptable level of performance.

Using SQL Server R Services To Connect R To SQL Server

Yesterday, we looked at ways to connect R to SQL Server using RODBC.  Today, we’re going to take a look at how SQL Server 2016 is going to make R processing a little bit easier.

Warning:  Time-Sensitive Material

Hey, so this blog post is written for SQL Server 2016 CTP 3.  If you’re a time traveler from the distant future, you might want to check to see if there’s an easier way to do this, and that everything still works as expected.  This post went live on November 11th, 2015.

Why Use SQL Server R Services?

Hey, we can use R to connect to SQL Server, so why mess with a good thing?  Well, there are a few reasons to look into SQL Server R Services (AKA, Don’t-Call-It-SSRS):

Better R

R is memory-only, single-threaded, non-optimized, and has memory leaks.  R is a fantastic language for modeling and working with relatively small data sets, but open source R is not something you really want to run in a production environment.  Microsoft’s play with Revolution Analytics is to make Revolution R Enterprise a viable server technology able to run in production.  Will that be the case in SQL Server 2016?  Well, ask me in a year…

Keep Data Local

In addition to improving R’s stability, the new method will allow us to keep data local to the SQL Server instance.  That way, we can do all of the processing on a beefy server designed for data processing and sending a small result set or image back to the user, saving network traffic.

Integrate With SSRS (No, The Other SSRS)

SQL Server R Services lets you return images, which you can easily display in SQL Server Reporting Services.  This extends SSRS in nice ways, as R has a fantastic set of packages built around visualization, way beyond anything base SSRS ever had.  We won’t show that in this post, but I do plan on getting around to it later.

 Let’s Do Something!

In yesterday’s post, we connected to SQL Server via RStudio.  Today, we’re going to do it the new way, using a bit of new functionality:  sp_execute_external_script.  For SQL Server 2016, the only language that is available with sp_execute_external_script is R.  At PASS Summit, the product manager stated outright that there will be additional languages, specifying Python.

The sp_execute_external_script command has a number of parameters available.  Here’s a fake call with those parameters:

execute sp_execute_external_script
	@language = 'R',
	@script = N'[some R script]',
	@input_data_1 = N'SELECT * FROM SomeTable',
	@input_data_1_name = N'rDataFrameName',		--data frame name in R
	@output_data_1_name = N'rOutputDataFrameName',	--data frame returned to SQL Server
	@params = N'@model varbinary(max)',		--model from R.  Declared sort of like sp_executesql
	@model = @iris_model,				--Declared variable used in query
	WITH RESULT SETS ([result set]);		--mandatory if returning a result set

So we can see a few parameters here. First is @language, which needs to be “R” to run an R script. Speaking of scripts, the R script is the next parameter. We populate the input data frame with @input_data_1, and we can either give that data frame a name (using @input_data_1_name) or take the default of InputDataSet. Once the script finishes, we’re expecting an output data set, so we can name it using the @output_data_1_name parameter or just call our output frame OutputDataSet. If we have any parameters we want to use, we can specify them in @params, sort of like sp_executesql has a @params attribute. Then, we specify the values of those parameters. Finally, if we have a result set, we need to specify it.

Okay, that’s a sample call, but let’s look at a somewhat-realistic case. Using yesterday’s data set, let’s build a model on pincp versus agep.  Note that pincp is PUMS Persons Income.  In honor of our resident Penguatroll’s recent birthday, I’m going to run our model and predict income based on a certain age which I’m sure has absolutely no relevance here.

execute sp_execute_external_script
	@language = N'R',
	@script = N' 
InputDataSet$sex = as.factor(ifelse(InputDataSet$sex==1, ''M'', ''F''))
InputDataSet$sex = relevel(InputDataSet$sex, ''M'')
cowmap <- c("Employee of a private for-profit", "Private not-for-profit employee", "Local government employee", "State government employee", "Federal government employee", "Self-employed not incorporated", "Self-employed incorporated")
InputDataSet$cow = as.factor(cowmap[InputDataSet$cow])
InputDataSet$cow = relevel(InputDataSet$cow, cowmap[1])
schlmap = c(rep("no high school diploma",15), "Regular high school diploma", "GED or alternative credential", "some college credit, no degree", "some college credit, no degree", "Associate''s degree", "Bachelor''s degree", "Master''s degree", "Professional degree", "Doctorate degree")
InputDataSet$schl = as.factor(schlmap[InputDataSet$schl])
InputDataSet$schl = relevel(InputDataSet$schl, schlmap[1])

model <- lm(pincp~agep, data=InputDataSet)
OutputDataSet <- data.frame(predict(model,data.frame(agep=33),interval="predict"))
	@input_data_1 = N'SELECT TOP(1000) agep, cow, schl, sex, pincp from ss11pusa WHERE cow IN (1,2,3,4,5,6,7);'

The result here is $27,488…but r^2 is so so low that the range is -$50,664 through $105,640…so yeah, not very meaningful.


For right now, there are some restrictions you need to be aware of:

  1. There are a number of data types which R does not support, including time, datetime2, nchar, nvarchar(!), hierarchyid, geometry, geography, and sql_variant
  2. The install.packages command in R is blocked by default.  There are ways to install packages, but you’ll want an administrator to do that rather than doing it in an R script.
  3. For CTP 3, we get up to 20% of available machine memory for R.  I’m hoping that we’ll get to see Resource Governor pools.


I love the fact that we’re getting integration with R in SQL Server.  I’m not loving the implementation, honestly.  I think it’s a pretty complex stored procedure which will turn off some people less familiar with SQL Server.  It’s also impossible to debug R scripts, meaning that you’ll need to debug them outside of the procedure, and that gives you the opportunity to make extra mistakes.

So, what’s the major use of SQL Server R Services?  Early on, I see batch processing as the main driver here.  The whole point of getting involved with Revolution R is to create sever-quality R, so imagine a SQL Agent job which runs this procedure once a night against some raw data set.  The R job could build a model, process that data, and return a result set.  You take that result set and feed it into a table for reporting purposes.  I’d like to see more uses, but this is probably the first one we’ll see in the wild.

Additional Resources

Using RODBC To Connect R To SQL Server

So far in my series on R, I’ve only looked at reading data from flat files.  Flat files are very useful constructs and are nice ways to get data sets to people with arbitrary end systems, but there’s this awesome thing called a database which allows you to store data, and I want to get on that bandwagon.

The purpose of this post is to walk you through using RODBC to connect a Linux system (I’m using Elementary OS, but any Ubuntu-based distro will follow from these basic steps) to a SQL Server installation.  In this case, my SQL Server instance is 2016 CTP 3, but these instructions should work for any version of SQL Server 2005 or later.

Warning:  Potentially Time-Sensitive Material

Hey, so this blog post is written for specific installations of software and worked when the blog post went live.  If you’re a time traveler from the distant future, you might want to double-check that the installation and configuration instructions are still valid.  Maybe you’re lucky and installation got way easier…but given that most of my resources are dated in 2013, I doubt it…  This post went live on November 10th, 2015.

Check Firewall and Network Settings

Before we begin, make sure that your R machine can see your SQL Server machine.  My SQL Server box is named SQL2016, and both machines are VMs on the same PC, and they have IP addresses on the same subnet.  I can ping SQL2016 from my Linux box.  I also added a firewall rule opening up TCP port 1433 on the SQL2016 box.  I’m not using a named instance of SQL Server, and I am okay with using the default port here.  You may not be okay with that, and I’m okay with you not being okay; just make sure that the machines you want to talk are allowed to talk and we’ll call it square.

Configure, Make, Make Install, Rinse, Repeat

The first thing to do is read.  Zhixian’s Tech Blog has some utterly fantastic blog posts on the topic:  install unixODBC, install FreeTDS, and configure FreeTDS.  The note on the unixODBC blog post stating that Ubuntu’s version of unixODBC is out of date is still valid—the version on the website is from 2013, and the packaged Ubuntu version is from 2012.

Here are the basic steps, skimmed from those three blog posts:

  1. Download unixODBC and run configure, make, & make install.
  2. touch /etc/odbc.ini & /etc/odbcinst.ini.  In Zhixian’s blog posts, you’re instructed to create files in /usr/local/etc.  You can do that as well, but in my case, it turns out that RStudio was reading the /etc/ files, and I’m on a VM where it’s okay to have global DSNs.
  3. Download FreeTDS and run configure, make, & make install.
  4. Run sudo apt-get install tdsodbc.  This was not included in the blog posts, but I needed to do that to get, the TDS ODBC drivers.
  5. Configure /etc/odbcinst.ini, /etc/odbc.ini, and /usr/local/etc/freetds.conf.  Here are my settings, again, pointing out that the virtual machine name is sql2016.


APP = unixodbc
Description = Scratch on sql2016
Driver = FreeTDS
Server = sql2016
Database = Scratch
Port = 1433
TDS Version = 8.0

ODBC.INI allows you to declare DSNs.  In this case, I’m using the FreeTDS driver to connect to sql2016’s Scratch database.  It is vital that you use “TDS Version = 8.0” in the DSN entry; otherwise, you will not be able to connect to the SQL 2016 instance.  From what I gather, you might not be able to connect to anything later than 2008 if you don’t use TDS Version 8.0.


Description = FreeTDS unixODBC driver
Driver = /usr/lib/x86_64-linux-gnu/odbc/
Setup = /usr/lib/x86_64-linux-gnu/odbc/

ODBCInst explains where FreeTDS lives.  We declared a FreeTDS entry in ODBC.INI, and this is the underlying driver.


tds version = 8.0

host = sql2016
port = 1433
tds version = 8.0

FreeTDS also has its own configuration file.  In this config file, I changed the TDS version from 4.2 to 8.0.  I also added a DSN for sql2016.  It wasn’t absolutely necessary to change the global entry, but honestly, I’m not connecting to any SQL 2000 boxes, so setting it to 8.0 by default saves me heartache later on.

Testing FreeTDS

FreeTDS has a couple different tools you can use to test connections:  tsql and isql.


The tsql command (not to be confused with Microsoft’s Transact-SQL language) is a basic testing tool.  If I want to test that my Linux box can connect to the SQL Server box—that firewalls and network rules aren’t getting in the way, that the SQL Server instance is up, that kind of thing—I can use tsql to do a quick direct connection:

tsql -S sql2016:1433 -D Scratch -U RUser

This particular call tries to connect to a server named sql2016 using the RUser login, and the Scratch database context. Note that with tsql, I needed to specify the port number. I’m not sure if this was a SQL Server 2016 thing or what, but when I specified -P 1433 and stripped the :1433 off of the server name, I wasn’t able to connect.


In contrast to tsql, isql lets you test a connection using a DSN.

isql sql2016 RUser <<PASSWORD>>

If you can connect to the instance using tsql and isql, you know everything’s set up correctly.

Connecting Via R

Up to this point, we’ve been working on plumbing issues.  Now it’s time to connect via RStudio and try to do some data analysis.

The first thing we need to do is install RODBC.  You can do this one of two ways:  install the RODBC package within R, or install the CRAN package using apt-get.

If you want to install the RODBC package within R, the syntax is the same as any other installation:


My preference is actually to run apt-get:

sudo apt-get install r-cran-odbc

Note that I have an apt source set up pointing to trusty that I set up in the first post in this series.

Regardless of which method you use to install the package, we load the package the same way: library(RODBC). One the library is loaded, connecting to SQL Server is as simple as referencing a DSN and running a SQL query.  In this case, I’m taking 2011 ACS 1-year PUMS data from the US Census Bureau.  You can get all kinds of PUMS data from the Census Bureau’s website.  I downloaded the CSV data and loaded it into SQL Server.


dbhandle <- odbcConnect("sql2016", "RUser", "<<PASSWORD>>")
res <- sqlQuery(dbhandle, 'SELECT TOP(1000) agep, cow, schl, sex, pincp from ss11pusa WHERE cow IN (1,2,3,4,5,6,7);')

We create dbhandle, which is how we connect to the sql2016 instance via the sql2016 DSN.  After that connection is made, we execute a SQL query and store its result set into res.

From here, we can do any analysis we want on the data set.  The analysis portion builds from chapter 2 of Practical Data Science with R, specifically with regard to data cleanup.  I’m going to make the data a little easier to read, and then we’ll build a model and grab summary information off of that model.

#Install r-cran-odbc -- sudo apt-get install r-cran-odbc

dbhandle <- odbcConnect("sql2016", "RUser", "<<PASSWORD>>")
res <- sqlQuery(dbhandle, 'SELECT TOP(1000) agep, cow, schl, sex, pincp from ss11pusa WHERE cow IN (1,2,3,4,5,6,7);')

#Practical Data Science with R, chapter 2
res$sex = as.factor(ifelse(res$sex==1, 'M', 'F'))
res$sex = relevel(res$sex, 'M')
cowmap <- c("Employee of a private for-profit", "Private not-for-profit employee", "Local government employee", "State government employee", "Federal government employee", "Self-employed not incorporated", "Self-employed incorporated")
res$cow = as.factor(cowmap[res$cow])
res$cow = relevel(res$cow, cowmap[1])
schlmap = c(rep("no high school diploma",15), "Regular high school diploma", "GED or alternative credential", "some college credit, no degree", "some college credit, no degree", "Associate's degree", "Bachelor's degree", "Master's degree", "Professional degree", "Doctorate degree")
res$schl = as.factor(schlmap[res$schl])
res$schl = relevel(res$schl, schlmap[1])


#Sign of a bad modeler:  lots of commented-out models...
#model <- lm(pincp~agep, data=res)
#model <- lm(pincp~sex, data=res)
#model <- lm(pincp~cow, data=res)
#model <- lm(pincp~schl, data=res)
model <- lm(pincp~., data=res)

The findings here aren’t necessarily all that important, although if you’re curious, an OLS model of age, sex, employment sector, and level of education versus income has an R^2 of approximately 0.31 in my data sample.  Your data sample will probably differ.


Getting a Linux machine to talk to a SQL Server instance is harder than it should be.  Yes, Microsoft has a Linux ODBC driver and some easy setup instructions…if you’re using Red Hat or SuSE.  Hopefully this helps you get connected.

Next up in the series, I’m going to show how SQL Server 2016 changes this story.

Additional Resources

R And The Housing Market

Last month, at SQL Saturday Charlotte, I had the opportunity to meet Kiran Math, who gave a great talk introducing R to SQL developers.  His talk was the basis for my blog post on dplyr and TidyR, and I want to walk through his housing market example in some detail here.  A lot of this will follow the code he made available in his talk, but I also want to take this opportunity to move in a couple of directions he couldn’t due to time constraints.

Getting A Data Set

Basically, Kiran has a co-worker who wanted to move from one ZIP code (29605) to another (29615), and his co-worker wants to figure out how much he will likely get for his house when selling, and how much a similarly-sized house will cost in the new ZIP code.  To figure this out, his co-worker pulled together data for all home sales in those two ZIP codes over a time period of several months and saved it.  For this post, I’m going to use his CSV file, which has five fields:  ID (auto-incrementing identity integer), SaleDate (date type), Area (number of square feet), ZIP_29615 (how much it sold for in ZIP code 29615, in thousands of dollars), and ZIP_29605 (how much it sold for in ZIP code 29605, in thousands of dollars).  This isn’t a particularly good data structure, but developers will be developers…

Improving The Data Set

The first thing we need to do is import that data into R.  Grab Kiran’s file and make sure you have dplyr, tidy, and ggplot2 installed:



Now that we have those packages installed, let’s load the data set from CSV. I’ll assume that you’re running this in Linux and the CSV is in your home directory’s Downloads folder:

path2csv <- file.path('~/Downloads/vHomeSales.csv')
dat <- read.csv(path2csv, = TRUE)


Right off the bat, we can see some interesting things. First of all, looking at the data set, we see a lot of NA records.  This is because the initial data set is an unpivoted set.  An individual sale takes place in one of the two ZIP codes, but we have separate columns for each ZIP code.  This means that only one of the two ZIP code columns will ever have a value.

> head(dat)
  ID SaleDate Area ZIP_29615 ZIP_29605
1  1 6/3/2015 1300        NA        62
2  2 6/3/2015 1310        NA        67
3  3 6/3/2015 1320        NA        62
4  4 6/3/2015 1350        NA        57
5  5 6/3/2015 1370        NA        72
6  6 6/3/2015 1300        NA        62
> str(dat)
'data.frame':	432 obs. of  5 variables:
 $ ID       : int  1 2 3 4 5 6 7 8 9 10 ...
 $ SaleDate : chr  "6/3/2015" "6/3/2015" "6/3/2015" "6/3/2015" ...
 $ Area     : int  1300 1310 1320 1350 1370 1300 1400 1400 1440 1450 ...
 $ ZIP_29615: int  NA NA NA NA NA NA 158 NA 160 NA ...
 $ ZIP_29605: int  62 67 62 57 72 62 NA 82 NA 85 ...

Also, when running str, we see that SaleDate is a character type, based on how it was read in.  So let’s fix that now:

dat$SaleDate <- as.Date(dat$SaleDate, format="%m/%d/%Y")

Remember that last time we tried this, I needed to do some fancy formatting to get dates to load.  In Kiran’s code, he does not specify a format, but I noticed that the dates looked a bit off when I ran it, so I am specifying a format.  Kiran’s code also has us converting the ZIP Code data to integers, but as you see above, when I ran the code in RStudio, the ZIP Code fields were already integers, so I’m skipping that.

From here, we get into using dplyr and TidyR.  My post on using dyplr and TidyR should serve as a useful introduction to these two tools.  The first thing we want to do is grab a subset of variables.  In this case, we don’t care about the ID.  We also want to get all sales since January 1st of 2015, but don’t need to know the sale date.  dplyr has a filter function which works great for this scenario:

tDat <- dat %>%
       filter(SaleDate > '1/1/2015') %>%

Now that we have the three variables we want, it’s time to correct the data set. We really just want two variables: Area and ZIPCode. Pivoting this data set will get rid of the NA values quite nicely and make it easier for us to chart, plot, and manipulate the data. We’re going to gather the two ZIP code columns together and leave the other variable alone:

gDat<- gather(tDat, Zipcode, SalePrice, 2:3, na.rm=TRUE)

This leaves us an interesting intermediate result set:

> head(gDat)
   Area   Zipcode SalePrice
7  1400 ZIP_29615       158
9  1440 ZIP_29615       160
11 1460 ZIP_29615       160
13 1460 ZIP_29615       157
15 1560 ZIP_29615       168
17 1580 ZIP_29615       173
> str(gDat)
'data.frame':	432 obs. of  3 variables:
 $ Area     : int  1400 1440 1460 1460 1560 1580 1600 1620 1700 1800 ...
 $ Zipcode  : Factor w/ 2 levels "ZIP_29615","ZIP_29605": 1 1 1 1 1 1 1 1 1 1 ...
 $ SalePrice: int  158 160 160 157 168 173 168 173 188 198 ...

Basically, we moved the variable name (ZIP_29605 or ZIP_29615) into a variable called Zipcode and took the value of that variable and moved it into SalePrice. We left Area alone, so it’s the same as in tDat. But here’s the thing about the remaining, pivoted data set: the ZIP codes start with ZIP_, and we’d rather get rid of that and change it to “29615” or “29605.” Also, we want to make sure that Zipcode remains a factor: 29615 and 29605 are discrete labels, not continuous variables. The following bits of code do this:

mDat <- mutate(gDat,Zipcode=extract_numeric(Zipcode))
mDat$Zipcode <- as.factor(mDat$Zipcode)

Playing With The Data

At this point, we have our final data set, with ZIP Code, square footage, and sale price (in thousands of dollars).

> head(mDat)
  Area Zipcode SalePrice
1 1400   29615       158
2 1440   29615       160
3 1460   29615       160
4 1460   29615       157
5 1560   29615       168
6 1580   29615       173
> str(mDat)
'data.frame':	432 obs. of  3 variables:
 $ Area     : int  1400 1440 1460 1460 1560 1580 1600 1620 1700 1800 ...
 $ Zipcode  : Factor w/ 2 levels "29605","29615": 2 2 2 2 2 2 2 2 2 2 ...
 $ SalePrice: int  158 160 160 157 168 173 168 173 188 198 ...

The first thing you typically want to do with a data set is plot it.  Our minds are pretty good at making sense of visualizations, and we can use a basic plot to see if there are any trends we can discern.



We can definitely see a trend here.  It looks like the houses in ZIP Code 29615 are more expensive than similarly-sized houses in 29605, except for one red outlier, which might be miscoded data or could just be somebody who got a great deal selling his house.

Modeling The Data

So now that we see a bit of a trend, the next question is to ask, how much money could we expect to get if we sold a house in ZIP Code 29605?  To do that, we’re going to build a linear model using Ordinary Least Squares.  R has a great built-in function for that, namely lm.

dat29605 <- filter(mDat, Zipcode == 29605)
model_sale <- lm(SalePrice~Area,data=dat29605)
predict(model_sale,data.frame(Area=3000),interval="predict" )

The above code does three things. First, it filters out everything whose ZIP Code is not 29605, leaving the relevant data set. Next, we build a linear model in which we predict sale price based on the house’s square footage, using dat29605 as our data set. Finally, we make a prediction based off of this model, asking how much we might expect to get for a house whose size is 3000 square feet. The result:

> predict(model_sale,data.frame(Area=3000),interval="predict" )
       fit      lwr      upr
1 198.0433 146.3831 249.7035

The mean point is approximately $198,000, and a 95% confidence interval gives us a range of $146,000 through $249,000.  That’s not an extremely precise estimate, but it gives us a starting point for negotiations.

Similarly, let’s say that we want to buy a house in ZIP Code 29615 which is the same size.  Here’s what we come up with:

dat29615 <- filter(mDat, Zipcode == 29615)
model_sale_29615 <- lm(SalePrice~Area,data=dat29615)
predict(model_sale_29615,data.frame(Area=3000),interval="predict" )

And now the prediction:

> predict(model_sale_29615,data.frame(Area=3000),interval="predict" )
       fit      lwr      upr
1 315.4288 267.0714 363.7861

So we can sell a house in ZIP Code 29605 for about $198K (give or take $50K) and purchase an equally-sized house in ZIP Code 29615 for about $315K (give or take $40K).

If we want more information on the linear model results, we can use the built-in summary() function.  Running summary against each model gives us the following results:


> summary(model_sale)

lm(formula = SalePrice ~ Area, data = dat29605)

    Min      1Q  Median      3Q     Max 
-56.180 -18.361   1.147  19.668 108.437 

              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -34.356979   4.938838  -6.956 4.14e-11 ***
Area          0.077467   0.001573  49.251  < 2e-16 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 26.15 on 215 degrees of freedom
Multiple R-squared:  0.9186,	Adjusted R-squared:  0.9182 
F-statistic:  2426 on 1 and 215 DF,  p-value: < 2.2e-16


> summary(model_sale_29615)

lm(formula = SalePrice ~ Area, data = dat29615)

   Min     1Q Median     3Q    Max 
-67.45 -19.12   1.39  19.05  71.78 

            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 8.197209   5.240978   1.564    0.119    
Area        0.102411   0.001561  65.608   <2e-16 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 24.47 on 213 degrees of freedom
Multiple R-squared:  0.9528,	Adjusted R-squared:  0.9526 
F-statistic:  4304 on 1 and 213 DF,  p-value: < 2.2e-16

Both of these models show that there is a strong relationship between Area and SalePrice.  Intuitively, we understand this well:  bigger houses will, all other things being equal, sell for more money.  In this case, differences in size accounted for 91% of the variance in sale price in ZIP Code 29605 and 95% in ZIP Code 29615.  This says that, within a ZIP Code, size is the primary consideration.  So how about a model which combines the two ZIP Codes together?

model_sale_all <- lm(SalePrice~Area,data=mDat)

When we look at the linear model results, we get:

> summary(model_sale_all)

lm(formula = SalePrice ~ Area, data = mDat)

     Min       1Q   Median       3Q      Max 
-143.134  -62.336    2.531   62.334  138.647 

              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -28.296646   9.303846  -3.041   0.0025 ** 
Area          0.095344   0.002863  33.306   <2e-16 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 65.85 on 430 degrees of freedom
Multiple R-squared:  0.7207,	Adjusted R-squared:   0.72 
F-statistic:  1109 on 1 and 430 DF,  p-value: < 2.2e-16

When looking at both data sets together, square footage is a dominant factor, explaining 72% of the variance in price.  But notice that we went from explaining 90-95% of the variance down to 72%.  This tells us that there are other factors at play here.  We can speculate on some of these factors—maybe better schools, lower tax rates, superior amenities, or maybe ZIP Code 29605 is full of jerks and nobody wants to live near them.

Wrapping Up

Thanks again to Kiran Math for the work he put into this introduction to R.  You can grab all of his code, as well as Power Point slides, at the link above.

Visual Studio Database Test Extensions

I introduced a new project last night:  Visual Studio Database Test Extensions, or VSDBTestExtensions.  You can get the code from Github.

What Is This?

The short version of this project is that it makes Visual Studio database test projects a little bit better, making it easier to compare result sets.  This is very helpful when refactoring code, testing different versions of procedures for performance problems, or testing procedures on databases in which the underlying data changes regularly and you can’t hard-code scalar values or use the sadly terrible checksum test option.

How Do I Use It?

Installation instructions are available in the readme file.  Setting up database tests is pretty straightforward, and you’ll know you’ve succeeded if you see a few extra options in Visual Studio:


You need to have SQL Server Data Tools installed on your machine as well as Visual Studio in order for this to work.

How Do I Contribute?

This code is open source and I accept pull requests.  I’ll clean up the code, tidy up the documentation, and try to make it a bit nicer over the next few days, but I want to get the product out right now, even if it isn’t quite perfect.  I consider it practically feature-complete, at least for version 1.0.  I’ll gladly accept questions and feedback on additional tests to add, and there are a couple of issues I need to fix.  I’m also going to look at a way of making it a little easier to install, and I probably need to put together a real signed certificate.