ConvertFrom-Gzip

Not too long ago, I needed to figure out how many total lines I had in a series of files.  Normally, this would not be difficult:  I can use the Measure-Object Powershell cmdlet with the -Line parameter and get a line count.  The tricky part here, though, was two-fold:  first, I had approximately 9400 files to parse; and second, all of the files were gzipped.  This meant that I would need to unzip each file first before counting its lines.

Fortunately, I found a great function on the TechNet gallery:  ConvertFrom-Gzip.  This function is well-written enough that it can read items from memory, do the unzipping in-memory, and pass contents along the pipeline in-memory, meaning that I never needed to write an unzipped file out to disk, saving me a lot of time.  Here’s my Powershell one-liner to solve the problem:

Get-ChildItem "C:\temp\CallLog\" -Recurse -Filter "*.gz" | Get-Item | ConvertFrom-Gzip | Measure-Object -Line

The thing I like most about this function is that it really fits in the spirit of Powershell—I can use it in a pipeline without needing to add extra cruft.  Note that if you want to use the function in your own code, it’s not signed.

Cross-Database Memory Optimized Queries Still Not Available

In SQL Server 2014, we learned that memory-optimized tables cannot join to tables outside the current database.  I wanted to see if this had changed in SQL Server 2016 CTP 3.2.

CREATE TABLE [dbo].[MOTest]
(
	MOTestID BIGINT IDENTITY(1,1) NOT NULL,
	ProductSubcategoryName VARCHAR(50) NOT NULL,
	CONSTRAINT [PK_MOTest] PRIMARY KEY NONCLUSTERED
	(
		MOTestID ASC
	)
) WITH(MEMORY_OPTIMIZED = ON, DURABILITY = SCHEMA_ONLY)
GO

INSERT into dbo.MOTest
(
	ProductSubcategoryName
)
VALUES
('Mountain Bikes'),
('Road Bikes'),
('Touring Bikes'),
('Cranksets'),
('Chains'),
('Something Else');

SELECT TOP (100)
	*
FROM AdventureWorks2014.Production.ProductSubcategory ps
	INNER JOIN dbo.MOTest m
		ON m.ProductSubcategoryName = ps.Name;

My result:

(6 row(s) affected)
Msg 41317, Level 16, State 5, Line 24
A user transaction that accesses memory optimized tables or natively compiled modules cannot access more than one user database or databases model and msdb, and it cannot write to master.
In other words, we still cannot cross databases with memory-optimized tables.

New Talk Planned

I’m introducing a new talk for 2016:  Securing SQL Server.  Here’s my abstract:

A default SQL Server installation is reasonably secure, but “reasonably secure” doesn’t cut it anymore in an era in which one bad line of code, one week password, or one open port can result in your customer database ending up on Pastebin.  In this talk, we will look at different methods of securing a SQL Server instance, from venerable (principle of least privilege, Transparent Data Encryption) to novel (Always Encrypted, row-level security).  These tools and techniques will show us ways for developers, database administrators, and network specialists to work together to secure corporate assets and prevent CISOs from ending up on the nightly news.

My plan is to hit on a few topics, introduce some resources to learn more about the topic, and move on.  The goal is to provide an overview of some of the tools and technologies available, including:

  • Applying the principle of least privilege to SQL Server logins, users, and roles.
  • Implementing Transparent Data Encryption.
  • Implementing Always Encrypted in SQL Server 2016.
  • Implementing row-based security, both in SQL Server 2016 and prior to 2016.
  • Using network segmentation to limit access to SQL Server instances.
  • Writing good, parameterized SQL to prevent against SQL injection.
  • Reducing the SQL Server surface area.

There are entire books on SQL Server security, so a one-hour talk can’t cover everything.  Nonetheless, my intent with this talk is to give everybody who attends at least one new resource.

Azure ML: Custom Data Sets

Last time around, I built a simple Azure ML experiment based on flight delay data made available to us as a Microsoft sample.  Today, I want to load a custom data set and build an experiment against it.

The Data Set

I’m using SQL Saturday data for this demo.  I came up with a set of data for each SQL Saturday.  This comma-delimited data file includes:

  1. City name
  2. Region.  This is something I personally coded, breaking each SQL Saturday down into the following regions:  Southeast, Mid-Atlantic, Northeast, Midwest, South, West, Canada, Latin America, Europe, Africa, and Asia.
  3. Date
  4. Month
  5. International.  1 if this event took place outside of the United States.
  6. Personal Connection.  1 if I personally knew at least one of the speaker committee members.
  7. Spoke.  1 if I spoke at this event.

My plan is to regress Spoke against Region, Month, International, and Personal Connection to see if there is a meaningful pattern.  (Hint:  in reality, there is.)

Uploading A Data Set

Uploading a data set to Azure ML is easy.  In the Azure ML Studio, go to My Experiments.  Then, on the left-hand side of the screen, click the “Datasets” option.  To add a new dataset, click the [+] NEW button at the bottom and select Dataset –> From Local File.

NewDataset

Doing this brings up a modal dialog:

UploadDataset

Uploading my hundred-row dataset took a few seconds, and now it’s available in the datasets section.

MyDatasets

Now let’s experiment.

Create The Experiment

Our experiment is going to look similar to the one we did yesterday:  a simple linear regression.  Because the structure is fundamentally the same, I won’t walk through it in as much detail as I did yesterday.

Build the Data

I dropped my 2015 SQL Saturdays custom dataset onto the canvas.  After that, I dragged on a Project Columns component and selected the columns I want to use in my regression model:  Region, Month, International, PersonalConnection, and Spoke.

SQLSaturdayColumns

From there, I dragged on a Split Data component and split the rows 70-30 using random seed 21468.

Build the Model

My model is a Linear Regression, so I dragged that onto the canvas.  I also dragged on a Train Model component, a Score Model component, and an Evaluate Model component.  The only thing which needed tweaked was the Train Model component, whose selected column is “Spoke” because I want to try to determine whether I’m likely to speak at a particular event.  Otherwise, the final model looks like:

SQLSaturdayModel

Check the Model

When I ran this model, I got back my results in approximately 30 seconds.  My model results show an R^2 of 0.48, which is pretty good for a social sciences result.

SQLSaturdayModelResults

An interesting line of questioning would be, what would improve the R^2 here?  Maybe variables around how many other events there were, the size of the event, and whether I had just gone to a SQL Saturday the week before would help, but I’m satisfied with these results.  So now I’m going to go the next step:  setting up a web service.

Setting Up A Service

Setting up a web service is pretty easy:  click the “Set Up Web Service” button at the bottom of the dashboard.

SetUpWebService

Doing this changes our canvas and gives us a new tab:  Predictive experiment.

PredictiveExperiment

After running that service, you can then click “Deploy Web Service” at the bottom of the screen to deploy it.  From there, you’ll get a web service screen which gives you information on how to connect to the web service.  It also gives you information on how to test the web service.

Testing the Web Service

I want to do seven tests for the web service:

  1. Cleveland in February, and I know the committee.  I expect “Spoke” to be very close to 1, meaning nearly a 100% likelihood of speaking.  Actual Result:  0.73.
  2. Baltimore in August, and I know the committee.  I expect “Spoke” to be somewhere around 0.5 because I do some Mid-Atlantic SQL Saturdays but not too many of them.  Actual Result:  0.52.
  3. Dallas, Texas in May.  I don’t know the committee.  I expect “Spoke” to be relatively close to 0 because I didn’t do any SQL Saturdays in the South.  Actual Result:  0.009.
  4. Kansas City in October.  I know the committee.  I expect “Spoke” to be relatively close to 0, but that’s because the model is missing some information (like how delicious the barbecue crawl is).  In reality, I want to go and would expect the probability to be closer to 1.  Actual Result:  0.67.
  5. Berlin in June.  I don’t know the committee.  I expect “Spoke” to be very close to 0.  Actual Result:  0.01.
  6. Raleigh in October.  I am the law.  I expect “Spoke” to be nearly 1.  Actual Result:  0.91.
  7. Raleigh in December.  I am the law.  I expect “Spoke” to be high, but less than case 6.  Actual Result:  0.898.

TestResults

In these cases, the model was accurate 7 out of 7 times, and actually tripped me up once when I expected failure (because I had actually missed Kansas City in October).  If you’re a harsh grader, I expected Cleveland to be higher.

Conclusions

This is obviously a toy model with a small data set, but it shows the power of what you can do with Azure ML.  With just a few clicks, you can be up and running with a data set, and after building a web service, you can easily consume it in your own application.  This significantly shortens the development time necessary to go from a basic plan to a production process.

Azure ML: Getting Started

Machine Learning Setup

Go to your old Azure portal.  Then, go to New –> Data Services –> Machine Learning –> Quick Create.  You’ll need an Azure account and an Azure storage account.

CreateML

Once we have our workspace, we can connect to Azure ML Studio to continue.

Build An Experiment

We want to build the following Azure ML experiment:

FirstAzureMLModel

Let’s build it one step at a time.  First, click on the [+] NEW link and select Blank Experiment.

BlankExperiment

This will give you a blank canvas.  We want to look for Flight Delay Data, which is a sample data set Microsoft provides.  We’re going to take this data set and drag it onto the canvas.

FlightDelaysData

Visualize The Data

Before we build a model, we need to understand the data set to get an idea of what kind of information is available to us.  In this case, we get a set of data dealing with flight delays, but to learn a little bit more about this data set, we can right-click on the set and go to Dataset –> Visualize.  That will let us get some basic details:

FlightDelayVisualization

In this data set, we have a few attributes like year of flight, day of flight, carrier, origin airport ID, and destination airport ID.  We also have a few metrics, including departure delay, arrival delay, and whether the flight was cancelled.  I clicked ArrDelay (arrival delay) and got some basic statistics and a bar chart.  This result set shows that the average flight (mean) is 6.6 minutes late, but the median flight is 3 minutes early.

Determine the Goal

My goal here is to see if I can predict how late a flight will be based on this data set.  If I can make a good prediction, I might be able to use it for my personal goals—figuring out my flight schedule a little bit better.

In other words, what I want to do is predict the arrival delay (ArrDelay).  The rest of my experiment will go toward that goal.

Add More Components

Ignore Cancelled Flights

The first thing I want to do is throw out cancelled flights.  I expect that those flights won’t be marked as delayed (because they were cancelled), and so I don’t want them to throw my predictions off.

The way that I can do this is to drag on a Split Data component.  To do this, I’m going to select “Relative Expression” in the Splitting mode.

SplitData

I also need to set a Relational expression, specifying that I want to split data based on whether Cancelled is greater than 0.  As a quick note, my first attempt with this Split Data component had me set \”Cancelled”=0, but that caused an error when I tried to run the ML experiment, so instead of using an equality operator, I needed to use an inequality operator.

The Split Data component gives you two outputs, but I only want to use the second, non-cancelled flights output.

Specify Columns

I’ve decided what my goal is, and now I need to throw out any unnecessary columns so that the model I choose doesn’t try to use them.  In my case, I want to throw out the following columns:

  1. CRSDepTime and CRSArrTime — These are the times the flight departed and arrived, respectively.  I could make use of them, but I figured they’d have little predicative power.
  2. DepDelay — The number of minutes late the flight is at departure.  This probably correlates pretty well to arrival lateness, but I want to see if the other columns predict late arrival times, as I won’t know how late the flight departs until I’m on the plane.
  3. DepDel15 and ArrDel15 — These tell us whether the flight was at least 15 minutes late for departure or arrival.  They’re just a grouping mechanism for DepDelay and ArrDelay, so don’t let my analysis include these.
  4. Cancelled — Was the flight cancelled?  Well, I just filtered out those flights, so no cancelled flights remain.  Therefore, we don’t need this column.

ProjectColumns

The Project Columns component is nice in that it gives me a drop-down list with the column names in my data set.

Separate Training From Test Data

The next thing that I want to do is add another Split Data component.  this time, I want to split a percentage of rows off into a training set (for the model) and a test set (to test the efficacy of the model).  This Split Data component has a different splitting mode:  Split rows.

SplitRows

In this case, 70% of my rows will go to the training set, with the remaining 30% held back for the test set.  I’m specifying a “random” seed because I want others to be able to replicate my work.

The Model

Adding A Model

Now that I have the data set laid out how I want, I need to hook up a model.  In this case, I’ll use a linear regression model.  In practice, we’ll want to use a model which makes sense based on our expectations, but I’m starting with a very simple model.

Select the Linear Regression component and drag it onto the canvas.  We won’t need to set any options here, although there are a couple of parameters we can tweak.

One thing to note is that our model has zero inputs and one output.  This means we don’t directly move a data set to a model.  Instead, we need to use the model as an input for something else.

Train A Model

That “something else” is a Train Model component.  We drag one of those onto the canvas and hook up the Linear Regression and Split Data as the inputs for this component.

TrainModel

On the Train Model component, I’ve select arrival delay as the variable we’d like to explain, and Azure ML takes care of the rest.

Score the Model

We have a trained model at this point, but now it’s time to see how useful that model actually is.  To do this, we’re going to drag on a Score Model component and set its inputs.  The Score Model component requires two inputs:  a model and a data set.  We’ll use the Train Model component as our model, and the test data for our data set.  We want to use the test data because we’ve already used training data to build our model, and testing against data we’ve already used will give us a false image of how good our model really is.

Evaluate the Model

The very last component on this experiment is an Evaluate Model component.  This lets us see how well the model predicted test data.  After putting this component on the canvas, right-click on it and go to Evaluation results –> Visualize.  You’ll get a bar graph as well as a summary data set which looks like this:

EvaluateModel

For a linear regression model, the Coefficient of Determination is better known as the R-squared value, and it ranges from 0 (no correlation between independent and dependent variables) up to 1 (absolute correlation).  In this case, our R-squared is a pitiful 0.009792, meaning that we cannot predict whether a flight would be delayed based solely on the inputs collected.

Conclusions

Although our first model was an abysmal failure at predicting flight delay time, it did show just how easy putting together an experiment in Azure ML can be.  We were able to cleanse a data set, project specific columns, split out training and test data, train a model, score that model, and evaluate the model using little more than a canvas.  Anyone familiar with SQL Server Integration Services or Informatica will be right at home with Azure ML.

The problem that database people will run into, however, is that although creating a model is easy, creating a good model is hard.  We can slap together a few components, but it takes more than that to generate a useful model.

Presentation Year In Review

This year was my breakout year for presenting.  Here’s a quick look by the numbers.

User Group Meetings

I presented at the following user group meetings this year:

  1. TriNUG Data SIG, January 21st.
  2. TriPASS, March 17th.
  3. Lynchburg SQL Server User Group, March 26th.
  4. TriNUG Data SIG, April 22nd.
  5. F#/Analytics SIG, July 28th.
  6. TriNUG lightning talk, December 9th.
  7. TriPASS, December 15th.

SQL Saturdays

I presented at the following SQL Saturdays this year:

  1. Cleveland, February 7th.
  2. Tampa, February 28th.
  3. Richmond, March 21st.
  4. Jacksonville, May 9th.
  5. Rochester, May 16th.
  6. Nova Scotia, June 8th.
  7. Chattanooga, June 27th.
  8. Columbus, July 11th.
  9. Indianapolis, August 8th.
  10. Omaha, August 15th.
  11. Pittsburgh, October 3rd.
  12. Raleigh, October 10th.  — Yeah, I didn’t technically present here, but I ran it, so I’m counting it!
  13. Charlotte, October 17th.

Podcasts

I participated in my first podcast this year, when Carlos Chacon interviewed me for the SQL Data Partners podcast.