Could Not Find Function rxSqlUpateLibPaths

I ran into an interesting SQL Server Machine Learning Services issue the other day.  We decided to upgrade our SQL Server 2016 machine running R Services to SQL Server 2017.  This was an in-place upgrade rather than rebuild and reinstall.

The Problem

After installation completed, the DBA enabled SQL Server 2017 Machine Learning Services, but as soon as I tried to run a simple R script, it stalled for about 30 seconds and then I got an error:

Msg 39012, Level 16, State 1, Line 0
Unable to communicate with the runtime for ‘R’ script. Please check the requirements of ‘R’ runtime.
STDERR message(s) from external script:
Error: could not find function “rxSqlUpdateLibPaths”
Execution halted

There were exactly 0 Google hits for rxSqlUpdateLibPaths, so that was an auspicious beginning.  Fortunately, I found a solution relatively quickly.

The Solution

It turns out that on this SQL Server 2016 machine, we had installed Microsoft R Server 9.1.0 and bound SQL Server 2016 R Services to the newer version of Microsoft R Server.  One of the things this does is create an R_LIBS_USER variable pointing to Microsoft R Server.  After changing that to point to the R Services library (generally located in {version}.MSSQLSERVER\R_SERVICES\library) and restarting the SQL Server 2017 Launchpad service, everything worked as expected.

Advertisements

OWASP Top 10 For 2017

I have been following the OWASP Top 10 for 2017 for a while and have decided to create a talk on the topic.  It was interesting watching the discussion on the OWASP GitHub repo, and now that the list appears to be settled, I can publish my talk.

This talk is meant to provide an overview of the OWASP Top 10 from a .NET developer’s perspective.  To support a 60-minute talk, I’m providing a large number of links to additional resources, as I know I can’t go in depth on any single topic.  This blog post will serve as my Links and Further Information for the talk, so here goes:

General Links

Injection

Authentication and Session Management

Sensitive Data Exposure

XML External Entity Injection

T-SQL Tuesday 95: Polybase And HDInsight

tt_300_11FF5A1DToday is T-SQL Tuesday #95, hosted by Derik Hammer, and the topic is Big Data.

This is a continuation of my Polybase series.  If you haven’t seen this already, I have a twenty-some part series on Polybase in SQL Server 2016 and now SQL Server 2017.  Polybase hasn’t changed with the RTM release of SQL Server 2017, so I haven’t needed to write much new about the topic lately.  That’s my excuse for being lazy, at least.

Additional Resources

If you’re new here, you might be interested in the rest of my Polybase series, which I have in rough reading order.  I also turned this series into a presentation.

You may also like my Kafka series, which led to a presentation on the topic.  Kafka is a fantastic distributed message broker which lives in the Hadoop ecosystem, and I can’t say enough good things about it.

Maybe you’re curious about this Hadoop thing but don’t know where to start.  Go check out my presentation introducing Hadoop, which I have recorded as well.

After that, if you want to see several methods you could use to connect Hadoop to SQL Server (including Polybase), check out my presentation on the topic, which the folks at NDC Sydney helpfully recorded.

Finally, if you just want to get an idea of what the modern data platform space looks like, I have a presentation on that as well.  Carlos L Chacon interviewed me on the SQL Data Partners podcast in Episode 47 and we talked about this in depth, including where Hadoop fits into this ever-growing combination of technologies.

Now, On With The Show

In today’s post, I want to look at trying to connect a SQL Server instance to Azure HDInsight and see if I can take advantage of predicate pushdown.  The short answer is, not really.  The long answer is, read on.

 

Let’s Spin Up HDInsight

The first step here is to start up an HDInsight cluster.  We won’t need a powerful cluster for this demo.

First, in the Azure Portal, search for HDInsight and you’ll get the option to create a new cluster.  Here are my cluster configuration options:

1_ClusterConfiguration

I just wanted a classic Hadoop cluster and didn’t need to install HBase, Kafka, or anything else.  Note that Windows is an option for operating system, but nobody in their right mind runs Hadoop on Windows.  It’s dead; leave it be.

Next, fill out the basics.  Here’s my setup:

2_Basics

You can configure your storage options after that.  I’m choosing the default of Azure Storage, though you can also read from Azure Data Lake Storage.  If you want Hive or Oozie metadata to persist after you kill the cluster—something that you might want to do in production as you shut down the HDInsight cluster when not in use—you can choose an Azure SQL Database in the dropdowns at the bottom.

3_Storage

Once you’ve got storage set up, you want to edit the cluster size.  By default, you get two D12 head node and 4 D4 worker nodes.  This will run you $3.63 an hour in East US.  That’s about $87 a day, or roughly $2600 per month.

4_EditClusterSize

 

 

 

You probably want to trim that down a bit if you’re just playing around.  Here’s what I ended up with:

5_SmallerNodes

Just using 2 D3 nodes cuts the price in half.  It’d be a terrible demo for Hadoop and completely unfair in a performance test, but our purpose here is just to get a simple Polybase demo working, not provide a comprehensive study on Hadoop performance.

Once you’ve got everything set up, expect to stare at this for a while:

6_DeployingHDInsight

Configuring For Predicate Pushdown

Next up, we want to configure Polybase to use our Hadoop installation.  To get the right version of Hadoop, I need to SSH to the NameNode:

7_SSH

My HDInsight cluster is using Hortonworks Data Platform version 2.6.2.2-5:

8_HDPVersion

That means that in my yarn-site.xml file, all references will be to /usr/hdp/2.6.2.2-5.

The Problem

But now we run into a problem:  there are certain ports which need to be open for Polybase to work.  This includes port 50010 on each of the data nodes against which we want to run MapReduce jobs.  This goes back to the issue we see with spinning up data nodes in Docker:  ports are not available.  If you’ve put your HDInsight cluster into an Azure VNet and monkey around with ports, you might be able to open all of the ports necessary to get this working, but that’s a lot more than I’d want to mess with, as somebody who hasn’t taken the time to learn much about cloud networking.

One Workaround

There’s one consistent workaround that I see people use:  have Azure Blob Storage work as an intermediary.  In this scenario, you’d use Polybase to write to Azure Blob Storage and then pull that data into your Azure Storage local to the HDInsight cluster.  This works for data transfer, but it doesn’t allow for predicate pushdown.

Using A Deployment Template

Robert Alexander has created a deployment template to spin up a SQL Server instance and an HDInsight cluster with a sample data set, so if you have an Azure subscription and want to play along, it’s pretty easy.  Just hit the “Deploy” button, fill in a few details, and be prepared to wait half an hour.  Note that the average daily cost is just over $150 for this demo, so you probably want to delete the objects as soon as you can to save money.  You will be charged for an HDInsight cluster even if you’ve shut off all of the machines.  There is no free pause option; the only way not to pay for the service is to delete the cluster.

Unfortunately, I was not able to get this template working.  I got through step one of the deployment but got stuck with a status of Action Required:

10_ActionRequired

When clicking on the deployment link, it’d sit there spinning, waiting for something to load.  In my browser console, I saw the problem:

9_APIError

Trying to go to that link gives an error message:

{"message":"You don't have access to this deployment ({deploymentID})."}

In this case, {deploymentID} is the exposed GUID after /deployments/ in the link above.  I tried this with a couple of accounts, but was unable to spin up this demo with either of them.

Conclusion

I’m not particularly fond of HDInsight at this point.  There’s a lot of effort to get started, and frankly, if I’m going to go through all of this effort, I’ll just build my own Cloudera or Hortonworks cluster and get full control.  I’ve also been unable to use Polybase on my on-prem SQL Server instance against an HDInsight node.  There are cases where I can see HDInsight being useful, but I don’t think there’s enough time and administrative savings over building your own Hadoop cluster to make it a worthwhile choice.  Maybe in a few years they’ll have it nailed down, but for now, I’d stick to custom-built Hadoop either on-prem or as VMs in a cloud provider like Azure or AWS.

 

New Video: Much Ado About Hadoop

I’ve decided to start recording my presentations to ensure that I have them in video form.  They’re much closer to one-take videos than polished, professional videos, but you’ve got to start somewhere.

Here is my first video, Much Ado About Hadoop.

The main audience for this video is people who are interested in learning a bit about the Hadoop ecosystem from a high level but not sure where to begin.  I cover this from a historical perspective, explaining what the major components of Hadoop are, why they made sense in the context of contemporary hardware and software paradigms, and how the Hadoop ecosystem has evolved in response to changes in hardware and software.  This talk is approximately 90 minutes in length and is an expanded version of the talk I’ve given at SQL Saturdays.

From here, as I get a bit more familiar with my software and equipment—and as I get more time—I will post additional videos.  I already know what my second video will be and want to get it out in the next two or three weeks, depending upon circumstances.  From there, who knows?