Yesterday, we looked at ways to connect R to SQL Server using RODBC. Today, we’re going to take a look at how SQL Server 2016 is going to make R processing a little bit easier.
Warning: Time-Sensitive Material
Hey, so this blog post is written for SQL Server 2016 CTP 3. If you’re a time traveler from the distant future, you might want to check to see if there’s an easier way to do this, and that everything still works as expected. This post went live on November 11th, 2015.
Why Use SQL Server R Services?
Hey, we can use R to connect to SQL Server, so why mess with a good thing? Well, there are a few reasons to look into SQL Server R Services (AKA, Don’t-Call-It-SSRS):
R is memory-only, single-threaded, non-optimized, and has memory leaks. R is a fantastic language for modeling and working with relatively small data sets, but open source R is not something you really want to run in a production environment. Microsoft’s play with Revolution Analytics is to make Revolution R Enterprise a viable server technology able to run in production. Will that be the case in SQL Server 2016? Well, ask me in a year…
Keep Data Local
In addition to improving R’s stability, the new method will allow us to keep data local to the SQL Server instance. That way, we can do all of the processing on a beefy server designed for data processing and sending a small result set or image back to the user, saving network traffic.
Integrate With SSRS (No, The Other SSRS)
SQL Server R Services lets you return images, which you can easily display in SQL Server Reporting Services. This extends SSRS in nice ways, as R has a fantastic set of packages built around visualization, way beyond anything base SSRS ever had. We won’t show that in this post, but I do plan on getting around to it later.
Let’s Do Something!
In yesterday’s post, we connected to SQL Server via RStudio. Today, we’re going to do it the new way, using a bit of new functionality: sp_execute_external_script. For SQL Server 2016, the only language that is available with sp_execute_external_script is R. At PASS Summit, the product manager stated outright that there will be additional languages, specifying Python.
The sp_execute_external_script command has a number of parameters available. Here’s a fake call with those parameters:
execute sp_execute_external_script @language = 'R', @script = N'[some R script]', @input_data_1 = N'SELECT * FROM SomeTable', @input_data_1_name = N'rDataFrameName', --data frame name in R @output_data_1_name = N'rOutputDataFrameName', --data frame returned to SQL Server @params = N'@model varbinary(max)', --model from R. Declared sort of like sp_executesql @model = @iris_model, --Declared variable used in query WITH RESULT SETS ([result set]); --mandatory if returning a result set
So we can see a few parameters here. First is @language, which needs to be “R” to run an R script. Speaking of scripts, the R script is the next parameter. We populate the input data frame with @input_data_1, and we can either give that data frame a name (using @input_data_1_name) or take the default of InputDataSet. Once the script finishes, we’re expecting an output data set, so we can name it using the @output_data_1_name parameter or just call our output frame OutputDataSet. If we have any parameters we want to use, we can specify them in @params, sort of like sp_executesql has a @params attribute. Then, we specify the values of those parameters. Finally, if we have a result set, we need to specify it.
Okay, that’s a sample call, but let’s look at a somewhat-realistic case. Using yesterday’s data set, let’s build a model on pincp versus agep. Note that pincp is PUMS Persons Income. In honor of our resident Penguatroll’s recent birthday, I’m going to run our model and predict income based on a certain age which I’m sure has absolutely no relevance here.
execute sp_execute_external_script @language = N'R', @script = N' InputDataSet$sex = as.factor(ifelse(InputDataSet$sex==1, ''M'', ''F'')) InputDataSet$sex = relevel(InputDataSet$sex, ''M'') cowmap <- c("Employee of a private for-profit", "Private not-for-profit employee", "Local government employee", "State government employee", "Federal government employee", "Self-employed not incorporated", "Self-employed incorporated") InputDataSet$cow = as.factor(cowmap[InputDataSet$cow]) InputDataSet$cow = relevel(InputDataSet$cow, cowmap) schlmap = c(rep("no high school diploma",15), "Regular high school diploma", "GED or alternative credential", "some college credit, no degree", "some college credit, no degree", "Associate''s degree", "Bachelor''s degree", "Master''s degree", "Professional degree", "Doctorate degree") InputDataSet$schl = as.factor(schlmap[InputDataSet$schl]) InputDataSet$schl = relevel(InputDataSet$schl, schlmap) model <- lm(pincp~agep, data=InputDataSet) OutputDataSet <- data.frame(predict(model,data.frame(agep=33),interval="predict")) ', @input_data_1 = N'SELECT TOP(1000) agep, cow, schl, sex, pincp from ss11pusa WHERE cow IN (1,2,3,4,5,6,7);' WITH RESULT SETS ((fit DECIMAL(19,6) NOT NULL, lwr DECIMAL(19,6) NOT NULL, upr DECIMAL(19,6) NOT NULL));
The result here is $27,488…but r^2 is so so low that the range is -$50,664 through $105,640…so yeah, not very meaningful.
For right now, there are some restrictions you need to be aware of:
- There are a number of data types which R does not support, including time, datetime2, nchar, nvarchar(!), hierarchyid, geometry, geography, and sql_variant
- The install.packages command in R is blocked by default. There are ways to install packages, but you’ll want an administrator to do that rather than doing it in an R script.
- For CTP 3, we get up to 20% of available machine memory for R. I’m hoping that we’ll get to see Resource Governor pools.
I love the fact that we’re getting integration with R in SQL Server. I’m not loving the implementation, honestly. I think it’s a pretty complex stored procedure which will turn off some people less familiar with SQL Server. It’s also impossible to debug R scripts, meaning that you’ll need to debug them outside of the procedure, and that gives you the opportunity to make extra mistakes.
So, what’s the major use of SQL Server R Services? Early on, I see batch processing as the main driver here. The whole point of getting involved with Revolution R is to create sever-quality R, so imagine a SQL Agent job which runs this procedure once a night against some raw data set. The R job could build a model, process that data, and return a result set. You take that result set and feed it into a table for reporting purposes. I’d like to see more uses, but this is probably the first one we’ll see in the wild.
- Henk van der Valk shows us how to install and use SQL Server R Services