So far in my series on R, I’ve only looked at reading data from flat files. Flat files are very useful constructs and are nice ways to get data sets to people with arbitrary end systems, but there’s this awesome thing called a database which allows you to store data, and I want to get on that bandwagon.
The purpose of this post is to walk you through using RODBC to connect a Linux system (I’m using Elementary OS, but any Ubuntu-based distro will follow from these basic steps) to a SQL Server installation. In this case, my SQL Server instance is 2016 CTP 3, but these instructions should work for any version of SQL Server 2005 or later.
Warning: Potentially Time-Sensitive Material
Hey, so this blog post is written for specific installations of software and worked when the blog post went live. If you’re a time traveler from the distant future, you might want to double-check that the installation and configuration instructions are still valid. Maybe you’re lucky and installation got way easier…but given that most of my resources are dated in 2013, I doubt it… This post went live on November 10th, 2015.
Check Firewall and Network Settings
Before we begin, make sure that your R machine can see your SQL Server machine. My SQL Server box is named SQL2016, and both machines are VMs on the same PC, and they have IP addresses on the same subnet. I can ping SQL2016 from my Linux box. I also added a firewall rule opening up TCP port 1433 on the SQL2016 box. I’m not using a named instance of SQL Server, and I am okay with using the default port here. You may not be okay with that, and I’m okay with you not being okay; just make sure that the machines you want to talk are allowed to talk and we’ll call it square.
Configure, Make, Make Install, Rinse, Repeat
The first thing to do is read. Zhixian’s Tech Blog has some utterly fantastic blog posts on the topic: install unixODBC, install FreeTDS, and configure FreeTDS. The note on the unixODBC blog post stating that Ubuntu’s version of unixODBC is out of date is still valid—the version on the website is from 2013, and the packaged Ubuntu version is from 2012.
Here are the basic steps, skimmed from those three blog posts:
- Download unixODBC and run configure, make, & make install.
- touch /etc/odbc.ini & /etc/odbcinst.ini. In Zhixian’s blog posts, you’re instructed to create files in /usr/local/etc. You can do that as well, but in my case, it turns out that RStudio was reading the /etc/ files, and I’m on a VM where it’s okay to have global DSNs.
- Download FreeTDS and run configure, make, & make install.
- Run sudo apt-get install tdsodbc. This was not included in the blog posts, but I needed to do that to get libtdsodbc.so, the TDS ODBC drivers.
- Configure /etc/odbcinst.ini, /etc/odbc.ini, and /usr/local/etc/freetds.conf. Here are my settings, again, pointing out that the virtual machine name is sql2016.
[sql2016] APP = unixodbc Description = Scratch on sql2016 Driver = FreeTDS Server = sql2016 Database = Scratch Port = 1433 TDS Version = 8.0
ODBC.INI allows you to declare DSNs. In this case, I’m using the FreeTDS driver to connect to sql2016’s Scratch database. It is vital that you use “TDS Version = 8.0” in the DSN entry; otherwise, you will not be able to connect to the SQL 2016 instance. From what I gather, you might not be able to connect to anything later than 2008 if you don’t use TDS Version 8.0.
[FreeTDS] Description = FreeTDS unixODBC driver Driver = /usr/lib/x86_64-linux-gnu/odbc/libtdsodbc.so Setup = /usr/lib/x86_64-linux-gnu/odbc/libtdsodbc.so
ODBCInst explains where FreeTDS lives. We declared a FreeTDS entry in ODBC.INI, and this is the underlying driver.
[global] tds version = 8.0 [sql2016] host = sql2016 port = 1433 tds version = 8.0
FreeTDS also has its own configuration file. In this config file, I changed the TDS version from 4.2 to 8.0. I also added a DSN for sql2016. It wasn’t absolutely necessary to change the global entry, but honestly, I’m not connecting to any SQL 2000 boxes, so setting it to 8.0 by default saves me heartache later on.
FreeTDS has a couple different tools you can use to test connections: tsql and isql.
The tsql command (not to be confused with Microsoft’s Transact-SQL language) is a basic testing tool. If I want to test that my Linux box can connect to the SQL Server box—that firewalls and network rules aren’t getting in the way, that the SQL Server instance is up, that kind of thing—I can use tsql to do a quick direct connection:
tsql -S sql2016:1433 -D Scratch -U RUser
This particular call tries to connect to a server named sql2016 using the RUser login, and the Scratch database context. Note that with tsql, I needed to specify the port number. I’m not sure if this was a SQL Server 2016 thing or what, but when I specified -P 1433 and stripped the :1433 off of the server name, I wasn’t able to connect.
In contrast to tsql, isql lets you test a connection using a DSN.
isql sql2016 RUser <<PASSWORD>>
If you can connect to the instance using tsql and isql, you know everything’s set up correctly.
Connecting Via R
Up to this point, we’ve been working on plumbing issues. Now it’s time to connect via RStudio and try to do some data analysis.
The first thing we need to do is install RODBC. You can do this one of two ways: install the RODBC package within R, or install the CRAN package using apt-get.
If you want to install the RODBC package within R, the syntax is the same as any other installation:
My preference is actually to run apt-get:
sudo apt-get install r-cran-odbc
Regardless of which method you use to install the package, we load the package the same way: library(RODBC). One the library is loaded, connecting to SQL Server is as simple as referencing a DSN and running a SQL query. In this case, I’m taking 2011 ACS 1-year PUMS data from the US Census Bureau. You can get all kinds of PUMS data from the Census Bureau’s website. I downloaded the CSV data and loaded it into SQL Server.
library(RODBC) dbhandle <- odbcConnect("sql2016", "RUser", "<<PASSWORD>>") res <- sqlQuery(dbhandle, 'SELECT TOP(1000) agep, cow, schl, sex, pincp from ss11pusa WHERE cow IN (1,2,3,4,5,6,7);')
We create dbhandle, which is how we connect to the sql2016 instance via the sql2016 DSN. After that connection is made, we execute a SQL query and store its result set into res.
From here, we can do any analysis we want on the data set. The analysis portion builds from chapter 2 of Practical Data Science with R, specifically with regard to data cleanup. I’m going to make the data a little easier to read, and then we’ll build a model and grab summary information off of that model.
#Install r-cran-odbc -- sudo apt-get install r-cran-odbc library(RODBC) dbhandle <- odbcConnect("sql2016", "RUser", "<<PASSWORD>>") res <- sqlQuery(dbhandle, 'SELECT TOP(1000) agep, cow, schl, sex, pincp from ss11pusa WHERE cow IN (1,2,3,4,5,6,7);') #Practical Data Science with R, chapter 2 res$sex = as.factor(ifelse(res$sex==1, 'M', 'F')) res$sex = relevel(res$sex, 'M') cowmap <- c("Employee of a private for-profit", "Private not-for-profit employee", "Local government employee", "State government employee", "Federal government employee", "Self-employed not incorporated", "Self-employed incorporated") res$cow = as.factor(cowmap[res$cow]) res$cow = relevel(res$cow, cowmap) schlmap = c(rep("no high school diploma",15), "Regular high school diploma", "GED or alternative credential", "some college credit, no degree", "some college credit, no degree", "Associate's degree", "Bachelor's degree", "Master's degree", "Professional degree", "Doctorate degree") res$schl = as.factor(schlmap[res$schl]) res$schl = relevel(res$schl, schlmap) summary(res$cow) str(res) #Sign of a bad modeler: lots of commented-out models... #model <- lm(pincp~agep, data=res) #model <- lm(pincp~sex, data=res) #model <- lm(pincp~cow, data=res) #model <- lm(pincp~schl, data=res) model <- lm(pincp~., data=res) summary(model) plot(model)
The findings here aren’t necessarily all that important, although if you’re curious, an OLS model of age, sex, employment sector, and level of education versus income has an R^2 of approximately 0.31 in my data sample. Your data sample will probably differ.
Getting a Linux machine to talk to a SQL Server instance is harder than it should be. Yes, Microsoft has a Linux ODBC driver and some easy setup instructions…if you’re using Red Hat or SuSE. Hopefully this helps you get connected.
Next up in the series, I’m going to show how SQL Server 2016 changes this story.
- Zhixian’s Tech Blog was an invaluable resource for this. Definitely read the three-part series (part 1, part 2, part 3).
- Revolution R has another method for doing some of this work. The end instructions look pretty similar.
- My configuration files were based off of this StackOverflow post. Note that my names were slightly different.
- If you don’t have a copy of Practical Data Science with R, get it. It’s one of my primary tools for learning the language.