Video: Applying Forensic Accounting Techniques With SQL And R

I have a new video up:

For more information: http://csmore.info/on/outliers

Forensic accountants and fraud examiners use a range of techniques to uncover fraudulent journal entries and illegal activities. As data professionals, most of us will never unravel a Bernie Madoff scheme, but we can apply these same techniques in our own environments to uncover dirty data. This session will use a combination of SQL Server and R to apply these fraud detection techniques, which include Benford’s Law, outlier analysis, time series analysis, and cohort analysis.

Click through for the video, which turns out to be exactly the same number of seconds as my first video.  That was not planned but I thought it was pretty cool.

Advertisements

Could Not Find Function rxSqlUpateLibPaths

I ran into an interesting SQL Server Machine Learning Services issue the other day.  We decided to upgrade our SQL Server 2016 machine running R Services to SQL Server 2017.  This was an in-place upgrade rather than rebuild and reinstall.

The Problem

After installation completed, the DBA enabled SQL Server 2017 Machine Learning Services, but as soon as I tried to run a simple R script, it stalled for about 30 seconds and then I got an error:

Msg 39012, Level 16, State 1, Line 0
Unable to communicate with the runtime for ‘R’ script. Please check the requirements of ‘R’ runtime.
STDERR message(s) from external script:
Error: could not find function “rxSqlUpdateLibPaths”
Execution halted

There were exactly 0 Google hits for rxSqlUpdateLibPaths, so that was an auspicious beginning.  Fortunately, I found a solution relatively quickly.

The Solution

It turns out that on this SQL Server 2016 machine, we had installed Microsoft R Server 9.1.0 and bound SQL Server 2016 R Services to the newer version of Microsoft R Server.  One of the things this does is create an R_LIBS_USER variable pointing to Microsoft R Server.  After changing that to point to the R Services library (generally located in {version}.MSSQLSERVER\R_SERVICES\library) and restarting the SQL Server 2017 Launchpad service, everything worked as expected.

OWASP Top 10 For 2017

I have been following the OWASP Top 10 for 2017 for a while and have decided to create a talk on the topic.  It was interesting watching the discussion on the OWASP GitHub repo, and now that the list appears to be settled, I can publish my talk.

This talk is meant to provide an overview of the OWASP Top 10 from a .NET developer’s perspective.  To support a 60-minute talk, I’m providing a large number of links to additional resources, as I know I can’t go in depth on any single topic.  This blog post will serve as my Links and Further Information for the talk, so here goes:

General Links

Injection

Authentication and Session Management

Sensitive Data Exposure

XML External Entity Injection

T-SQL Tuesday 95: Polybase And HDInsight

tt_300_11FF5A1DToday is T-SQL Tuesday #95, hosted by Derik Hammer, and the topic is Big Data.

This is a continuation of my Polybase series.  If you haven’t seen this already, I have a twenty-some part series on Polybase in SQL Server 2016 and now SQL Server 2017.  Polybase hasn’t changed with the RTM release of SQL Server 2017, so I haven’t needed to write much new about the topic lately.  That’s my excuse for being lazy, at least.

Additional Resources

If you’re new here, you might be interested in the rest of my Polybase series, which I have in rough reading order.  I also turned this series into a presentation.

You may also like my Kafka series, which led to a presentation on the topic.  Kafka is a fantastic distributed message broker which lives in the Hadoop ecosystem, and I can’t say enough good things about it.

Maybe you’re curious about this Hadoop thing but don’t know where to start.  Go check out my presentation introducing Hadoop, which I have recorded as well.

After that, if you want to see several methods you could use to connect Hadoop to SQL Server (including Polybase), check out my presentation on the topic, which the folks at NDC Sydney helpfully recorded.

Finally, if you just want to get an idea of what the modern data platform space looks like, I have a presentation on that as well.  Carlos L Chacon interviewed me on the SQL Data Partners podcast in Episode 47 and we talked about this in depth, including where Hadoop fits into this ever-growing combination of technologies.

Now, On With The Show

In today’s post, I want to look at trying to connect a SQL Server instance to Azure HDInsight and see if I can take advantage of predicate pushdown.  The short answer is, not really.  The long answer is, read on.

 

Let’s Spin Up HDInsight

The first step here is to start up an HDInsight cluster.  We won’t need a powerful cluster for this demo.

First, in the Azure Portal, search for HDInsight and you’ll get the option to create a new cluster.  Here are my cluster configuration options:

1_ClusterConfiguration

I just wanted a classic Hadoop cluster and didn’t need to install HBase, Kafka, or anything else.  Note that Windows is an option for operating system, but nobody in their right mind runs Hadoop on Windows.  It’s dead; leave it be.

Next, fill out the basics.  Here’s my setup:

2_Basics

You can configure your storage options after that.  I’m choosing the default of Azure Storage, though you can also read from Azure Data Lake Storage.  If you want Hive or Oozie metadata to persist after you kill the cluster—something that you might want to do in production as you shut down the HDInsight cluster when not in use—you can choose an Azure SQL Database in the dropdowns at the bottom.

3_Storage

Once you’ve got storage set up, you want to edit the cluster size.  By default, you get two D12 head node and 4 D4 worker nodes.  This will run you $3.63 an hour in East US.  That’s about $87 a day, or roughly $2600 per month.

4_EditClusterSize

 

 

 

You probably want to trim that down a bit if you’re just playing around.  Here’s what I ended up with:

5_SmallerNodes

Just using 2 D3 nodes cuts the price in half.  It’d be a terrible demo for Hadoop and completely unfair in a performance test, but our purpose here is just to get a simple Polybase demo working, not provide a comprehensive study on Hadoop performance.

Once you’ve got everything set up, expect to stare at this for a while:

6_DeployingHDInsight

Configuring For Predicate Pushdown

Next up, we want to configure Polybase to use our Hadoop installation.  To get the right version of Hadoop, I need to SSH to the NameNode:

7_SSH

My HDInsight cluster is using Hortonworks Data Platform version 2.6.2.2-5:

8_HDPVersion

That means that in my yarn-site.xml file, all references will be to /usr/hdp/2.6.2.2-5.

The Problem

But now we run into a problem:  there are certain ports which need to be open for Polybase to work.  This includes port 50010 on each of the data nodes against which we want to run MapReduce jobs.  This goes back to the issue we see with spinning up data nodes in Docker:  ports are not available.  If you’ve put your HDInsight cluster into an Azure VNet and monkey around with ports, you might be able to open all of the ports necessary to get this working, but that’s a lot more than I’d want to mess with, as somebody who hasn’t taken the time to learn much about cloud networking.

One Workaround

There’s one consistent workaround that I see people use:  have Azure Blob Storage work as an intermediary.  In this scenario, you’d use Polybase to write to Azure Blob Storage and then pull that data into your Azure Storage local to the HDInsight cluster.  This works for data transfer, but it doesn’t allow for predicate pushdown.

Using A Deployment Template

Robert Alexander has created a deployment template to spin up a SQL Server instance and an HDInsight cluster with a sample data set, so if you have an Azure subscription and want to play along, it’s pretty easy.  Just hit the “Deploy” button, fill in a few details, and be prepared to wait half an hour.  Note that the average daily cost is just over $150 for this demo, so you probably want to delete the objects as soon as you can to save money.  You will be charged for an HDInsight cluster even if you’ve shut off all of the machines.  There is no free pause option; the only way not to pay for the service is to delete the cluster.

Unfortunately, I was not able to get this template working.  I got through step one of the deployment but got stuck with a status of Action Required:

10_ActionRequired

When clicking on the deployment link, it’d sit there spinning, waiting for something to load.  In my browser console, I saw the problem:

9_APIError

Trying to go to that link gives an error message:

{"message":"You don't have access to this deployment ({deploymentID})."}

In this case, {deploymentID} is the exposed GUID after /deployments/ in the link above.  I tried this with a couple of accounts, but was unable to spin up this demo with either of them.

Conclusion

I’m not particularly fond of HDInsight at this point.  There’s a lot of effort to get started, and frankly, if I’m going to go through all of this effort, I’ll just build my own Cloudera or Hortonworks cluster and get full control.  I’ve also been unable to use Polybase on my on-prem SQL Server instance against an HDInsight node.  There are cases where I can see HDInsight being useful, but I don’t think there’s enough time and administrative savings over building your own Hadoop cluster to make it a worthwhile choice.  Maybe in a few years they’ll have it nailed down, but for now, I’d stick to custom-built Hadoop either on-prem or as VMs in a cloud provider like Azure or AWS.

 

New Video: Much Ado About Hadoop

I’ve decided to start recording my presentations to ensure that I have them in video form.  They’re much closer to one-take videos than polished, professional videos, but you’ve got to start somewhere.

Here is my first video, Much Ado About Hadoop.

The main audience for this video is people who are interested in learning a bit about the Hadoop ecosystem from a high level but not sure where to begin.  I cover this from a historical perspective, explaining what the major components of Hadoop are, why they made sense in the context of contemporary hardware and software paradigms, and how the Hadoop ecosystem has evolved in response to changes in hardware and software.  This talk is approximately 90 minutes in length and is an expanded version of the talk I’ve given at SQL Saturdays.

From here, as I get a bit more familiar with my software and equipment—and as I get more time—I will post additional videos.  I already know what my second video will be and want to get it out in the next two or three weeks, depending upon circumstances.  From there, who knows?

NDC Sydney Talks Up

NDC has released videos for NDC Sydney.  The NDC Conferences channel has the full set, and there were some fantastic talks this year.

I gave two talks at the conference:

They were surprisingly not awful!  Probably due to all of the post-production…

My favorite talk from NDC Sydney was Laura Bell talking about security for startups.  Definitely watch that one too; she gives four easy tips for us to use to improve our companies’ security postures.

Compress Those Models

I’m currently working on a project to generate a model in SQL Server R Services.  There are some nice demos on how to store a model (like this one that Microsoft has involving ski resort room rental), but there’s one thing that most of these leave out:  scale.

When you serialize a model to the database, it’s going to take up some amount of space.  In our case, we’re using a few different types of models.  Neural nets are relatively svelte, at roughly 40K per model.  On the other end, decision trees can take 2-4 MB per model.  When you’re storing a few models in SQL Server, size isn’t generally much of an issue.  Even if you’re using temporal tables to store model history, it’s probably not a big problem.  But when you’re storing a lot of data, it’s a big problem.

How much is a lot?  Well, let’s start with this:

BeforeCompression

These are some numbers for an early beta.  We have 91,130 products modeled and a model history of 194,830 products.  Each model, on its own, averages out to 177K in our ProductModel table and 212K in ProductModelHistory.  As mentioned, different types of models take up different amounts of space (and I didn’t elaborate upon all the types of models we’re using), so we wouldn’t expect the aggregates to be exactly the same.  I included a contrast:  we’ve generated 16.4 million quantity sold predictions during this time, but the total disk space is under 1 GB and 2/3 of that is a unique nonclustered index.  The 372MB is columnstore.  I love columnstore…

Compression

So right now, we’re burning roughly 200K per model.  My stated goal is to be able to store several years worth of data for 10 million products.  Let’s say that I need 10 million products in ProductModel and 1 billion rows in ProductModelHistory.  That means that we’d end up with 1.86 TB of data in the ProductModel table and 186 TB in ProductModelHistory.  This seems…excessive.

As a result, I decided to try using the COMPRESS() function in SQL Server 2016.  The COMPRESS function simply uses GZip compression.  Yeah, there are compression algorithms which tend to be more compact (e.g., bz2 or 7z), but GZip is relatively CPU efficient and I can wrap my SQL statements with COMPRESS() and DECOMPRESS() and not have to change any calling code; I just need to update the two stored procedures I use to insert and then retrieve product models.

With compression in place, my data on disk looks much less unreasonable:

PostCompression

That’s a drop from 200K per model down to about 45K per model, or an overall 4.7x compression ratio.  I noticed that the type of model was instrumental in the amount of compression.  Our neural nets compressed roughly 2x, going from approximately 40K per model down to 20K per model.  Meanwhile, those decision trees went from 2-4MB down to about 200K per model.

At 45K per model, storing 10 million models is about 429 GB, and 1 billion rows is just shy of 43 TB.  For my needs, that saves approximately 140 TB of disk space.  Your mileage will probably vary, but if you’re generating a large number of models, think about compression.