ConvertFrom-Gzip

Not too long ago, I needed to figure out how many total lines I had in a series of files.  Normally, this would not be difficult:  I can use the Measure-Object Powershell cmdlet with the -Line parameter and get a line count.  The tricky part here, though, was two-fold:  first, I had approximately 9400 files to parse; and second, all of the files were gzipped.  This meant that I would need to unzip each file first before counting its lines.

Fortunately, I found a great function on the TechNet gallery:  ConvertFrom-Gzip.  This function is well-written enough that it can read items from memory, do the unzipping in-memory, and pass contents along the pipeline in-memory, meaning that I never needed to write an unzipped file out to disk, saving me a lot of time.  Here’s my Powershell one-liner to solve the problem:

Get-ChildItem "C:\temp\CallLog\" -Recurse -Filter "*.gz" | Get-Item | ConvertFrom-Gzip | Measure-Object -Line

The thing I like most about this function is that it really fits in the spirit of Powershell—I can use it in a pipeline without needing to add extra cruft.  Note that if you want to use the function in your own code, it’s not signed.

Cross-Database Memory Optimized Queries Still Not Available

In SQL Server 2014, we learned that memory-optimized tables cannot join to tables outside the current database.  I wanted to see if this had changed in SQL Server 2016 CTP 3.2.

CREATE TABLE [dbo].[MOTest]
(
	MOTestID BIGINT IDENTITY(1,1) NOT NULL,
	ProductSubcategoryName VARCHAR(50) NOT NULL,
	CONSTRAINT [PK_MOTest] PRIMARY KEY NONCLUSTERED
	(
		MOTestID ASC
	)
) WITH(MEMORY_OPTIMIZED = ON, DURABILITY = SCHEMA_ONLY)
GO

INSERT into dbo.MOTest
(
	ProductSubcategoryName
)
VALUES
('Mountain Bikes'),
('Road Bikes'),
('Touring Bikes'),
('Cranksets'),
('Chains'),
('Something Else');

SELECT TOP (100)
	*
FROM AdventureWorks2014.Production.ProductSubcategory ps
	INNER JOIN dbo.MOTest m
		ON m.ProductSubcategoryName = ps.Name;

My result:

(6 row(s) affected)
Msg 41317, Level 16, State 5, Line 24
A user transaction that accesses memory optimized tables or natively compiled modules cannot access more than one user database or databases model and msdb, and it cannot write to master.
In other words, we still cannot cross databases with memory-optimized tables.

Don’t let the door hit you in the ass on the way out?

Jason Cole of the Bleacher Report suggests that the firing of Ray Farmer is imminent, whereas Mike Pettine will be retained for at least another season. I, most notably, turned on Ray Farmer a few weeks ago when he tried to trade Joe Thomas.

That said, I’m still not convinced the right guy is going to be fired. Pettine’s hissy fits about not playing certain guys because he didn’t like them (Justin Gilbert and Dwayne Bowe, the near apocalypse it took to get Manziel a starting spot) don’t rub me the right way at all. When you are as talent starved as Cleveland, you play guys with talent, even if they pissed in your cornflakes or ran over your dog or whatever. If you want to bench them in favor of guys with approximately equal talent, okay, great, but the rotting carcass of Tramon Williams and special team superstar Johnson Bademosi do not have equal talent to Gilbert, for example.

I’m hoping that, if Farmer gets fired, the new GM will work more closely with Pettine and prevent any kind of power struggle. However, I’m not optimistic. I would almost prefer either nobody got fired or both guys got fired. Farmer and Pettine are toxic for different reasons, and there’s no way to know if their toxicity only exists when both are there or not.

If Farmer is the only guy fired this offseason, I will be very unhappy. Jim O’Neil has to go. I don’t know why his pass defenses worked in 2014 but not in 2015, but the front seven was about the same. Either Phil Taylor and Jabaal Sheard were secretly the greatest players in the NFL (Sheard has performed well for the Patriots, but nobody picked up Taylor after he was cut), or there’s something he tried to change to stop the run that has blown up in his face.

If Farmer is going to be fired, he has to be fired ASAP so the new guy can get to work on the draft.

New Talk Planned

I’m introducing a new talk for 2016:  Securing SQL Server.  Here’s my abstract:

A default SQL Server installation is reasonably secure, but “reasonably secure” doesn’t cut it anymore in an era in which one bad line of code, one week password, or one open port can result in your customer database ending up on Pastebin.  In this talk, we will look at different methods of securing a SQL Server instance, from venerable (principle of least privilege, Transparent Data Encryption) to novel (Always Encrypted, row-level security).  These tools and techniques will show us ways for developers, database administrators, and network specialists to work together to secure corporate assets and prevent CISOs from ending up on the nightly news.

My plan is to hit on a few topics, introduce some resources to learn more about the topic, and move on.  The goal is to provide an overview of some of the tools and technologies available, including:

  • Applying the principle of least privilege to SQL Server logins, users, and roles.
  • Implementing Transparent Data Encryption.
  • Implementing Always Encrypted in SQL Server 2016.
  • Implementing row-based security, both in SQL Server 2016 and prior to 2016.
  • Using network segmentation to limit access to SQL Server instances.
  • Writing good, parameterized SQL to prevent against SQL injection.
  • Reducing the SQL Server surface area.

There are entire books on SQL Server security, so a one-hour talk can’t cover everything.  Nonetheless, my intent with this talk is to give everybody who attends at least one new resource.

The complete clusterfuck of the Browns season

I know I owe you guys some HOF coverage — it is my shtick, after all — but it may have to wait until I find some spare time to write. In the meantime, I feel the need to post about the Browns because Kevin is whining about the Bills “only” being 7-9. Whiner.

The Browns are, at present, 3-11. Our last two games are at Kansas City and home vs. the Steelers. I do not expect either of those to be wins.

The Good

— The offense, especially the passing game, has worked fairly well given a lack of playmakers and tiny receivers.

— The offensive line continues to be a strength, except for Cameron Erving (see below.)

— I’m pleasantly surprised by Duke Johnson; I don’t know why he doesn’t get more carries, but he’s a nice receiving threat. I think his upside is Brian Westbrook.

— Johnny Manziel has played well when he’s been on the field. He doesn’t look overwhelmed, even in bad situations. I wish they’d let him run a little more, because that’s his strength, but I’m no longer convinced his size is an issue.

— Gary. Fuckin’. Barnidge.

— Travis Coons is an accurate kicker when he isn’t getting blocked.

— Andy Lee is a great punter.

— Karlos Dansby and Donte Whitner are pretty much the only reason the defense isn’t dead last. Nate Orchard is a nifty little pass rusher. Paul Kruger tries really hard; sometimes he is successful, sometimes he is not. The defensive line needs a lot of work, but there’s some talent there, especially Shelton. Shelton was a first round pick; because he’s a defensive tackle in a 3-4 defense, his stats will almost never be impressive. Watching him, I rarely notice him. That’s both good and bad.

The Bad

— Pretty much everything else about the defense. The secondary is godawful. Pettine refused to play Gilbert because reasons, then Gilbert got hurt. There’s no depth. Tramon Williams is a waste of a roster spot at this point. Tashaun Gipson hasn’t looked great all year. Our linebackers get hurt. Etc.

— Isaiah Crowell couldn’t break a tackle to save his life. He usually gets positive yardage, but unless the offensive line makes a big hole for him, he won’t get more than four yards or so.

— Cameron Erving is young, but gets bowled over consistently by opposing defensive linemen. Every time Coons has been blocked (three times this season), he’s been knocked on his ass.

— Jim O’Neil seems strangely reluctant to blitz. The Browns don’t have a lot of speed on defense, but you can’t keep rushing three or four and get to most QBs in the NFL. I highly doubt he keeps his job; either he and Pettine will be fired, or Pettine throws O’Neil under the bus.

— Ray Farmer and Mike Pettine need to end this bullshit power struggle, or both of them should be fired. Pettine is benching random players (Dwayne Bowe) and punishing others (Johnny Manziel, Justin Gilbert) for reasons only apparent to him. Farmer has made some serious mistakes in the first round, true, but burying Gilbert for no reason doesn’t help. They either need to get on the same page or one or both needs to go.

The Draft

Cleveland is going to get a top five pick, maybe better. Here are the positions we should not draft:

QB, K, P

If Alex Mack opts out, we’ll need a new center. Either we shift over Erving or pick up somebody new. Mitchell Schwartz is probably also gone. That makes offensive line an easy target. We also need wide receivers and defensive backs. And linebackers. And defensive linemen. And probably safeties (if we don’t re-sign Gipson and Whitner ages). Maybe another running back if Crowell is unfixable and/or Duke can’t pick up the slack.

do not want Cleveland to draft a QB. Not in the first round, not in the seventh round. Unless Manziel totally shits the bed over the next two weeks (and I thought he did okay against Seattle), he’ll start next season, especially if Pettine gets fired. None of the QB prospects in this draft intrigue me. Paxton Lynch has some upside, especially since he’s huge with a good arm. Jared Goff isn’t very exciting. Connor Cook doesn’t complete enough passes.

If Manziel does shit the bed, I could see Cleveland taking Lynch. Maybe.

I would vastly prefer we go either defensive or offensive line. Wide receiver is a position of need, but I’m not convinced that it’s worth a spending top five pick on one. I like Michael Thomas from OSU, who is large but still fast (which Cleveland desperately needs). Again, though, is he top five worthy?

We’ll have to see where we’re at once the draft rolls around.

 

Buffalo’s Failed Season

Coming into 2015, Buffalo fans were hopeful that Rex Ryan would be the difference maker and lead Buffalo into their first playoff bid this millennium.  And if you told me that Buffalo would have the 7th-rated pass offense and 4th-rated pass offense, I’d say there would be a good chance that they were fighting the Patriots for lead of the AFC East…but instead, the Bills are 6-8.

Here’s a quick rundown of what went wrong and what has gone right this year:

What Went Wrong

  1. Stupid penalties.  Buffalo has the most penalty yards in the NFL.  Against the Redskins, they had “only” 8 penalties for 56 yards.  When you’re averaging 9.5 penalties for 84 yards a game, you are automatically giving away points to your opponents.  Coming into week 15, the Bills had almost twice as many penalty yards as the Arizona Cardinals.  Good teams do not allow that many penalties.
  2. Coming into the Redskins game, the Bills had the 24th-rated defense according to DVOA, and the 35-point thumping will probably push them down a little bit.
  3. Specifically, on defense, the pass rush has been non-existent.  Buffalo came into the Washington game 30th in the league in sacks, bringing the quarterback down a mere 19 times.  By contrast, the team had 54 sacks in 2014, including 14.5 by Mario Williams alone.  This year, Williams and Jerry Hughes have (coming into week 15) 4 apiece.
  4. This defense is a Jim Schwartz defense, not a Rex Ryan defense.  I think that speaks negatively of Ryan rather than acting as an excuse for him.  Mike Pettine and Jim Schwartz both found great success with this defense.
  5. Injuries.  Top cornerback Stephon Gilmore was injured coming into the season and needed a few weeks to shake off the rust.  Top safety Aaron Williams missed pretty much the entire season, leaving the Bills in the not-so-capable hands of Bacarri Rambo.  Tyrod Taylor missed two game, both of which EJ Manuel lost.  LeSean McCoy and Karlos Williams have missed time and McCoy has played hurt as well.  We also saw Kyle Williams, Ty Powell, and Tony Steward miss significant time.
  6. Percy Harvin was a major bust.  Instead of being a rushing, receiving, and special teams threat, Harvin played five games and scored just one touchdown.  With his failure, Robert Woods and Chris Hogan needed to step up, and neither really did.

What Went Right

  1. Tyrod Taylor is the biggest pleasant surprise for this year’s Bills.  Tyrod started camp as the #3 starter, was a stretch starter, and responded with a great season.  Through week 15, Taylor has nearly 2700 yards passing and over 400 yards rushing.  His 20-5 TD-INT ratio is excellent, although his 63.5% completion percentage isn’t fantastic.
  2. Karlos Williams has also stepped up in a noticeable way.  Williams only has 417 rushing yards on the season, but he scored in his first five games of the season and averaged more than 5.7 yards per rush.  Matching him up with a healthy LeSean McCoy gives the Bills a powerful 1-2 tandem.
  3. Charles Clay and Sammy Watkins are good receivers.  After a weak start to the season, Watkins has really turned it on the past four games, averaging over 100 yards per start and 5 TDs during that stretch.  He’s got a shot at 1000 yards, which he just missed in 2014.  Clay, meanwhile, has a solid 528 yards on the season.  It’s not quite as good as what he did as a Dolphin, and he’s almost certainly not worth the top-shelf TE cap hit, but Clay has been a dependable part of this Bills offense.
  4. Ronald Darby is a serious defensive rookie of the year candidate.

The most likely end game for Buffalo is 7-9, but they do have a chance to scrape out an 8-8 season, which would be a disappointment but at least not as bad as a losing season.  We’re a long way from the off-season, but it’s already time to start thinking about what will make this Bills club more competitive next year.  If you wanted quick answers right now, I’d start with a linebacker or two, a good right tackle, and a good wide receiver to give Tyrod Taylor another pass catching threat.

Fixing Double-Typed Letters On A Lenovo Laptop

I recently purchased a Lenovo Yoga 900.  Overall, I like it a lot, although there are a couple of irksome things that I’ll probably cover in a detailed blog post.

One big problem that I had with it is that, on occasion, the “s” and “l” keys would double-enter when I struck them.  For example, typing “is” would, about 30% of the time, end up entering “iss” instead.  It seems that only those two keys were affected, but that could just have been my particular machine.

Fortunately, a bit of research led me to a similar issue and, more importantly, a solution.  I changed my registry settings to match Yael’s and that seems to do the trick.

Azure ML: Custom Data Sets

Last time around, I built a simple Azure ML experiment based on flight delay data made available to us as a Microsoft sample.  Today, I want to load a custom data set and build an experiment against it.

The Data Set

I’m using SQL Saturday data for this demo.  I came up with a set of data for each SQL Saturday.  This comma-delimited data file includes:

  1. City name
  2. Region.  This is something I personally coded, breaking each SQL Saturday down into the following regions:  Southeast, Mid-Atlantic, Northeast, Midwest, South, West, Canada, Latin America, Europe, Africa, and Asia.
  3. Date
  4. Month
  5. International.  1 if this event took place outside of the United States.
  6. Personal Connection.  1 if I personally knew at least one of the speaker committee members.
  7. Spoke.  1 if I spoke at this event.

My plan is to regress Spoke against Region, Month, International, and Personal Connection to see if there is a meaningful pattern.  (Hint:  in reality, there is.)

Uploading A Data Set

Uploading a data set to Azure ML is easy.  In the Azure ML Studio, go to My Experiments.  Then, on the left-hand side of the screen, click the “Datasets” option.  To add a new dataset, click the [+] NEW button at the bottom and select Dataset –> From Local File.

NewDataset

Doing this brings up a modal dialog:

UploadDataset

Uploading my hundred-row dataset took a few seconds, and now it’s available in the datasets section.

MyDatasets

Now let’s experiment.

Create The Experiment

Our experiment is going to look similar to the one we did yesterday:  a simple linear regression.  Because the structure is fundamentally the same, I won’t walk through it in as much detail as I did yesterday.

Build the Data

I dropped my 2015 SQL Saturdays custom dataset onto the canvas.  After that, I dragged on a Project Columns component and selected the columns I want to use in my regression model:  Region, Month, International, PersonalConnection, and Spoke.

SQLSaturdayColumns

From there, I dragged on a Split Data component and split the rows 70-30 using random seed 21468.

Build the Model

My model is a Linear Regression, so I dragged that onto the canvas.  I also dragged on a Train Model component, a Score Model component, and an Evaluate Model component.  The only thing which needed tweaked was the Train Model component, whose selected column is “Spoke” because I want to try to determine whether I’m likely to speak at a particular event.  Otherwise, the final model looks like:

SQLSaturdayModel

Check the Model

When I ran this model, I got back my results in approximately 30 seconds.  My model results show an R^2 of 0.48, which is pretty good for a social sciences result.

SQLSaturdayModelResults

An interesting line of questioning would be, what would improve the R^2 here?  Maybe variables around how many other events there were, the size of the event, and whether I had just gone to a SQL Saturday the week before would help, but I’m satisfied with these results.  So now I’m going to go the next step:  setting up a web service.

Setting Up A Service

Setting up a web service is pretty easy:  click the “Set Up Web Service” button at the bottom of the dashboard.

SetUpWebService

Doing this changes our canvas and gives us a new tab:  Predictive experiment.

PredictiveExperiment

After running that service, you can then click “Deploy Web Service” at the bottom of the screen to deploy it.  From there, you’ll get a web service screen which gives you information on how to connect to the web service.  It also gives you information on how to test the web service.

Testing the Web Service

I want to do seven tests for the web service:

  1. Cleveland in February, and I know the committee.  I expect “Spoke” to be very close to 1, meaning nearly a 100% likelihood of speaking.  Actual Result:  0.73.
  2. Baltimore in August, and I know the committee.  I expect “Spoke” to be somewhere around 0.5 because I do some Mid-Atlantic SQL Saturdays but not too many of them.  Actual Result:  0.52.
  3. Dallas, Texas in May.  I don’t know the committee.  I expect “Spoke” to be relatively close to 0 because I didn’t do any SQL Saturdays in the South.  Actual Result:  0.009.
  4. Kansas City in October.  I know the committee.  I expect “Spoke” to be relatively close to 0, but that’s because the model is missing some information (like how delicious the barbecue crawl is).  In reality, I want to go and would expect the probability to be closer to 1.  Actual Result:  0.67.
  5. Berlin in June.  I don’t know the committee.  I expect “Spoke” to be very close to 0.  Actual Result:  0.01.
  6. Raleigh in October.  I am the law.  I expect “Spoke” to be nearly 1.  Actual Result:  0.91.
  7. Raleigh in December.  I am the law.  I expect “Spoke” to be high, but less than case 6.  Actual Result:  0.898.

TestResults

In these cases, the model was accurate 7 out of 7 times, and actually tripped me up once when I expected failure (because I had actually missed Kansas City in October).  If you’re a harsh grader, I expected Cleveland to be higher.

Conclusions

This is obviously a toy model with a small data set, but it shows the power of what you can do with Azure ML.  With just a few clicks, you can be up and running with a data set, and after building a web service, you can easily consume it in your own application.  This significantly shortens the development time necessary to go from a basic plan to a production process.

Azure ML: Getting Started

Machine Learning Setup

Go to your old Azure portal.  Then, go to New –> Data Services –> Machine Learning –> Quick Create.  You’ll need an Azure account and an Azure storage account.

CreateML

Once we have our workspace, we can connect to Azure ML Studio to continue.

Build An Experiment

We want to build the following Azure ML experiment:

FirstAzureMLModel

Let’s build it one step at a time.  First, click on the [+] NEW link and select Blank Experiment.

BlankExperiment

This will give you a blank canvas.  We want to look for Flight Delay Data, which is a sample data set Microsoft provides.  We’re going to take this data set and drag it onto the canvas.

FlightDelaysData

Visualize The Data

Before we build a model, we need to understand the data set to get an idea of what kind of information is available to us.  In this case, we get a set of data dealing with flight delays, but to learn a little bit more about this data set, we can right-click on the set and go to Dataset –> Visualize.  That will let us get some basic details:

FlightDelayVisualization

In this data set, we have a few attributes like year of flight, day of flight, carrier, origin airport ID, and destination airport ID.  We also have a few metrics, including departure delay, arrival delay, and whether the flight was cancelled.  I clicked ArrDelay (arrival delay) and got some basic statistics and a bar chart.  This result set shows that the average flight (mean) is 6.6 minutes late, but the median flight is 3 minutes early.

Determine the Goal

My goal here is to see if I can predict how late a flight will be based on this data set.  If I can make a good prediction, I might be able to use it for my personal goals—figuring out my flight schedule a little bit better.

In other words, what I want to do is predict the arrival delay (ArrDelay).  The rest of my experiment will go toward that goal.

Add More Components

Ignore Cancelled Flights

The first thing I want to do is throw out cancelled flights.  I expect that those flights won’t be marked as delayed (because they were cancelled), and so I don’t want them to throw my predictions off.

The way that I can do this is to drag on a Split Data component.  To do this, I’m going to select “Relative Expression” in the Splitting mode.

SplitData

I also need to set a Relational expression, specifying that I want to split data based on whether Cancelled is greater than 0.  As a quick note, my first attempt with this Split Data component had me set \”Cancelled”=0, but that caused an error when I tried to run the ML experiment, so instead of using an equality operator, I needed to use an inequality operator.

The Split Data component gives you two outputs, but I only want to use the second, non-cancelled flights output.

Specify Columns

I’ve decided what my goal is, and now I need to throw out any unnecessary columns so that the model I choose doesn’t try to use them.  In my case, I want to throw out the following columns:

  1. CRSDepTime and CRSArrTime — These are the times the flight departed and arrived, respectively.  I could make use of them, but I figured they’d have little predicative power.
  2. DepDelay — The number of minutes late the flight is at departure.  This probably correlates pretty well to arrival lateness, but I want to see if the other columns predict late arrival times, as I won’t know how late the flight departs until I’m on the plane.
  3. DepDel15 and ArrDel15 — These tell us whether the flight was at least 15 minutes late for departure or arrival.  They’re just a grouping mechanism for DepDelay and ArrDelay, so don’t let my analysis include these.
  4. Cancelled — Was the flight cancelled?  Well, I just filtered out those flights, so no cancelled flights remain.  Therefore, we don’t need this column.

ProjectColumns

The Project Columns component is nice in that it gives me a drop-down list with the column names in my data set.

Separate Training From Test Data

The next thing that I want to do is add another Split Data component.  this time, I want to split a percentage of rows off into a training set (for the model) and a test set (to test the efficacy of the model).  This Split Data component has a different splitting mode:  Split rows.

SplitRows

In this case, 70% of my rows will go to the training set, with the remaining 30% held back for the test set.  I’m specifying a “random” seed because I want others to be able to replicate my work.

The Model

Adding A Model

Now that I have the data set laid out how I want, I need to hook up a model.  In this case, I’ll use a linear regression model.  In practice, we’ll want to use a model which makes sense based on our expectations, but I’m starting with a very simple model.

Select the Linear Regression component and drag it onto the canvas.  We won’t need to set any options here, although there are a couple of parameters we can tweak.

One thing to note is that our model has zero inputs and one output.  This means we don’t directly move a data set to a model.  Instead, we need to use the model as an input for something else.

Train A Model

That “something else” is a Train Model component.  We drag one of those onto the canvas and hook up the Linear Regression and Split Data as the inputs for this component.

TrainModel

On the Train Model component, I’ve select arrival delay as the variable we’d like to explain, and Azure ML takes care of the rest.

Score the Model

We have a trained model at this point, but now it’s time to see how useful that model actually is.  To do this, we’re going to drag on a Score Model component and set its inputs.  The Score Model component requires two inputs:  a model and a data set.  We’ll use the Train Model component as our model, and the test data for our data set.  We want to use the test data because we’ve already used training data to build our model, and testing against data we’ve already used will give us a false image of how good our model really is.

Evaluate the Model

The very last component on this experiment is an Evaluate Model component.  This lets us see how well the model predicted test data.  After putting this component on the canvas, right-click on it and go to Evaluation results –> Visualize.  You’ll get a bar graph as well as a summary data set which looks like this:

EvaluateModel

For a linear regression model, the Coefficient of Determination is better known as the R-squared value, and it ranges from 0 (no correlation between independent and dependent variables) up to 1 (absolute correlation).  In this case, our R-squared is a pitiful 0.009792, meaning that we cannot predict whether a flight would be delayed based solely on the inputs collected.

Conclusions

Although our first model was an abysmal failure at predicting flight delay time, it did show just how easy putting together an experiment in Azure ML can be.  We were able to cleanse a data set, project specific columns, split out training and test data, train a model, score that model, and evaluate the model using little more than a canvas.  Anyone familiar with SQL Server Integration Services or Informatica will be right at home with Azure ML.

The problem that database people will run into, however, is that although creating a model is easy, creating a good model is hard.  We can slap together a few components, but it takes more than that to generate a useful model.