Getting Started With Azure Machine Learning

I’m going to be giving a talk on Azure Machine Learning for my employer, so you’ll probably see a few ML-related topics on the blog as I put together some thoughts.

My first post on the topic will be linking to David Crook’s ML talk.  His slides and code are freely available.  David moves a bit quickly in this video, but because he has made everything available, you can work through the problem at your own pace.

satRdays

Steph Locke is starting up satRdays, the R equivalent to SQL Saturdays.  Right now, she’s still in the very early stages of planning, but I’m already excited.  The RTP area is full of analysts (if you’re interested, join the Research Triangle Analysts Meetup group and attend some of their meetings), and I think we could get a conference of three to five tracks going in 2016.  Once things settle down a little bit, I plan to get involved with satRdays and see if I have enough time and enough interested people to get another annual event in RTP.

Making Music With SQL Server

This is an old post, but I’m turning Thomas Rushton’s idea into a lightning talk for the .NET User Group.  Basically, the idea is that you can call Console.Beep with different pitches and for different lengths of time, with the end result being music.

I’m also going to demo the Super Mario Brothers theme.  Here’s a C# version.  The quick translation to Powershell is:

[console]::beep(659, 125);
[console]::beep(659, 125);
Start-Sleep -m 125;
[console]::beep(659, 125);
Start-Sleep -m 167;
[console]::beep(523, 125);
[console]::beep(659, 125);
Start-Sleep -m 125;
[console]::beep(784, 125);
Start-Sleep -m 375;
[console]::beep(392, 125);
Start-Sleep -m 375;
[console]::beep(523, 125);
Start-Sleep -m 250;
[console]::beep(392, 125);
Start-Sleep -m 250;
[console]::beep(330, 125);
Start-Sleep -m 250;
[console]::beep(440, 125);
Start-Sleep -m 125;
[console]::beep(494, 125);
Start-Sleep -m 125;
[console]::beep(466, 125);
Start-Sleep -m 42;
[console]::beep(440, 125);
Start-Sleep -m 125;
[console]::beep(392, 125);
Start-Sleep -m 125;
[console]::beep(659, 125);
Start-Sleep -m 125;
[console]::beep(784, 125);
Start-Sleep -m 125;
[console]::beep(880, 125);
Start-Sleep -m 125;
[console]::beep(698, 125);
[console]::beep(784, 125);
Start-Sleep -m 125;
[console]::beep(659, 125);
Start-Sleep -m 125;
[console]::beep(523, 125);
Start-Sleep -m 125;
[console]::beep(587, 125);
[console]::beep(494, 125);
Start-Sleep -m 125;
[console]::beep(523, 125);
Start-Sleep -m 250;
[console]::beep(392, 125);
Start-Sleep -m 250;
[console]::beep(330, 125);
Start-Sleep -m 250;
[console]::beep(440, 125);
Start-Sleep -m 125;
[console]::beep(494, 125);
Start-Sleep -m 125;
[console]::beep(466, 125);
Start-Sleep -m 42;
[console]::beep(440, 125);
Start-Sleep -m 125;
[console]::beep(392, 125);
Start-Sleep -m 125;
[console]::beep(659, 125);
Start-Sleep -m 125;
[console]::beep(784, 125);
Start-Sleep -m 125;
[console]::beep(880, 125);
Start-Sleep -m 125;
[console]::beep(698, 125);
[console]::beep(784, 125);
Start-Sleep -m 125;
[console]::beep(659, 125);
Start-Sleep -m 125;
[console]::beep(523, 125);
Start-Sleep -m 125;
[console]::beep(587, 125);
[console]::beep(494, 125);
Start-Sleep -m 375;
[console]::beep(784, 125);
[console]::beep(740, 125);
[console]::beep(698, 125);
Start-Sleep -m 42;
[console]::beep(622, 125);
Start-Sleep -m 125;
[console]::beep(659, 125);
Start-Sleep -m 167;
[console]::beep(415, 125);
[console]::beep(440, 125);
[console]::beep(523, 125);
Start-Sleep -m 125;
[console]::beep(440, 125);
[console]::beep(523, 125);
[console]::beep(587, 125);
Start-Sleep -m 250;
[console]::beep(784, 125);
[console]::beep(740, 125);
[console]::beep(698, 125);
Start-Sleep -m 42;
[console]::beep(622, 125);
Start-Sleep -m 125;
[console]::beep(659, 125);
Start-Sleep -m 167;
[console]::beep(698, 125);
Start-Sleep -m 125;
[console]::beep(698, 125);
[console]::beep(698, 125);
Start-Sleep -m 625;
[console]::beep(784, 125);
[console]::beep(740, 125);
[console]::beep(698, 125);
Start-Sleep -m 42;
[console]::beep(622, 125);
Start-Sleep -m 125;
[console]::beep(659, 125);
Start-Sleep -m 167;
[console]::beep(415, 125);
[console]::beep(440, 125);
[console]::beep(523, 125);
Start-Sleep -m 125;
[console]::beep(440, 125);
[console]::beep(523, 125);
[console]::beep(587, 125);
Start-Sleep -m 250;
[console]::beep(622, 125);
Start-Sleep -m 250;
[console]::beep(587, 125);
Start-Sleep -m 250;
[console]::beep(523, 125);
Start-Sleep -m 1125;
[console]::beep(784, 125);
[console]::beep(740, 125);
[console]::beep(698, 125);
Start-Sleep -m 42;
[console]::beep(622, 125);
Start-Sleep -m 125;
[console]::beep(659, 125);
Start-Sleep -m 167;
[console]::beep(415, 125);
[console]::beep(440, 125);
[console]::beep(523, 125);
Start-Sleep -m 125;
[console]::beep(440, 125);
[console]::beep(523, 125);
[console]::beep(587, 125);
Start-Sleep -m 250;
[console]::beep(784, 125);
[console]::beep(740, 125);
[console]::beep(698, 125);
Start-Sleep -m 42;
[console]::beep(622, 125);
Start-Sleep -m 125;
[console]::beep(659, 125);
Start-Sleep -m 167;
[console]::beep(698, 125);
Start-Sleep -m 125;
[console]::beep(698, 125);
[console]::beep(698, 125);
Start-Sleep -m 625;
[console]::beep(784, 125);
[console]::beep(740, 125);
[console]::beep(698, 125);
Start-Sleep -m 42;
[console]::beep(622, 125);
Start-Sleep -m 125;
[console]::beep(659, 125);
Start-Sleep -m 167;
[console]::beep(415, 125);
[console]::beep(440, 125);
[console]::beep(523, 125);
Start-Sleep -m 125;
[console]::beep(440, 125);
[console]::beep(523, 125);
[console]::beep(587, 125);
Start-Sleep -m 250;
[console]::beep(622, 125);
Start-Sleep -m 250;
[console]::beep(587, 125);
Start-Sleep -m 250;
[console]::beep(523, 125);
Start-Sleep -m 625;

And let’s not stop there:  play the Imperial March if you wish.

Capturing Deadlocks With Extended Events

Extended Events is the replacement for Profiler.  There are a number of advantages to XEs:  they are much more lightweight than server-side traces, they can capture more information, and there are more methods for storing this information (including a ring buffer and writing out to disk).  My biggest problem with Extended Events is that even with the GUI, it’s still easier to set up a Profiler trace than futz about trying to set up an XE.

Deadlocks Are Different

With deadlocks, the default SQL Server system_health Extended Event already tracks deadlocks.  You can get to this session pretty easily in SQL Server Management Studio by connecting to your instance and going to Management —> Extended Events —> Sessions —> system_health.  Inside this session, there are two options:  a ring buffer which keeps track of recent events, as well as an event_file which holds a bit more detail on past events.  Depending upon how busy your server is, that event file might go back several days, or maybe just a few hours (or minutes on a very busy server).

SystemHealthXE

Double-click on one of the session targets, depending upon whether you want to watch live data (ring buffer) or view older data (event file).  Once you do that, you’ll see the Extended Events viewer.  In SSMS, you will get a Filters button in the Extended Events menu.  Click the Filters button and you’ll get the ability to enter a filter in.

XEMenuBar

Select “name” from the dropdown and set its value equal to xml_deadlock_report.

XEFilter

Once you’re done with this, you’ll see only XML deadlock reports.  You can grab the XML and also see the deadlock graph.

DeadlockResults

Conclusion

Once you have deadlock graphs and you know how to read them, you can use that information to fix your deadlocking issues.

Jitter And Color In R

As I work through Practical Data Science with R, I picked up on a jitter function in chapter 3 (free download).  I’m going to explore that jitter function a bit in this post.

Setup

If you want to follow along, grab the data set from the authors’ GitHub repo.  The relevant data file is in Custdata\custdata.tsv.  Custdata is an artificial set of data which is supposed to give a variety of data points to make graphing and analyzing data a bit easier.  This data set has 1000 observations of 11 variables, and the idea is to plot health insurance likelihood against certain demographic characteristics.  In this case, we’ll look at marital status.

Different Bar Charts

It’s pretty easy to build bar charts in R using ggplot2.  Here’s a simple example using the custdata set:

library(ggplot2)
ggplot(custdata) + geom_bar(aes(x=marital.stat,fill=health.ins))

The resulting bar chart looks like this:

barchart

This type of bar chart helps us see overall trends, but makes it difficult to see the details within a marital status type.  To do that, we can create a side-by-side bar chart by changing one little bit of code:  add position=”dodge” to the geom_bar function, like so:

ggplot(custdata) + geom_bar(aes(x=marital.stat,fill=health.ins), position="dodge")

The resulting bar chart looks like this:

sbsbar

This gives us a much clearer look at the relative breakdown across all marital status and insurance combinations, but we lose the knowledge we gained looking at the first bar chart.

At this point, I’m going to take a brief aside.

Color

Color selection is vital to graphical design.  We want colors which complement the medium—be that a monitor, a trade magazine, or a newspaper.  We also need to be cognizant that people interpret colors differently:  red-green colorblindness is the most common form, but there are several forms and we want to design with colorblind individuals in mind.  That means that the red-yellow-green combo probably doesn’t work as well as you’d like.

In Cookbook R, there’s a good colorblind-friendly palette which works for us.

Picking Our Own Colors

In the graphs above, I let R choose the colors for me.  That can work for some scenarios, but graphic designers are going to want to end up choosing their own colors.  Fortunately, doing this in ggplot2 is very easy, especially with factor variables.  All we need to do is add on a scale_fill_manual function.  Here’s our side-by-side bar chart using blue and vermillion:

ggplot(custdata) + geom_bar(aes(x=marital.stat,fill=health.ins), position="dodge") + scale_fill_manual(values=c("#D55E00","#0072B2"))

In this case, I specified hex codes for the colors.  If I wanted to use simpler colors, I could just write “blue” and “red,” but these won’t be quite the same colors.

Here is the new bar chart with our selected colors:

sbsbarcolors

The image is the same as before, but those new colors guarantee that a colorblind person will be able to read our chart.

Fill Bars And Jitter

A third type of bar chart is the fill bar.  The fill bar lets us see very clearly the percentage of individuals with health insurance across different marital statuses:

ggplot(custdata) + geom_bar(aes(x=marital.stat,fill=health.ins), position="fill") + scale_fill_manual(values=c("#D55E00","#0072B2"))

Notice that the only thing we did was change the position to “fill” and we have a completely different chart. Here is how the fill bar chart looks:

fillbar

The problem with a fill bar chart is that we completely lose context of how many people are in each category. The authors re-introduce this concept with the rug: a set of points below each bar that shows relative density. In order to do this, we add on a geom_point function that looks like so:

ggplot(custdata, aes(x=marital.stat)) + geom_bar(aes(fill=health.ins), position="fill") + geom_point(aes(y=-0.05), size=0.75, alpha=0.3, position=position_jitter(h=0.01)) + scale_fill_manual(values=c("#D55E00","#0072B2"))

Here is how the chart looks:

fillbarrug

The geom_point function has all of our changes. It starts by moving y down to -0.05, which is far enough down on the plot that we can see those points without them interfering with our bar chart. The size and alpha channels are set to maximize visibility, especially for the Married option. You can see that the Married option is densest, but there’s still some gap in there. The last option is position_jitter. Position_jitter is a function which “jitter[s] points to avoid overplotting.” Basically, without jittering the points, we would have the same points overlaying one another, but by jittering the height in this case, we distribute the points across a wider space, letting us see the sheer number more clearly than otherwise.

Jitter gives us two options: width and height. In the example above, we see what it looks like with a height value of 0.01 but no width specified. In the next chart, I’m overlaying different jitter width levels on our plot. You can see that the default is closest to a width of 0.4:

ggplot(custdata, aes(x=marital.stat)) + geom_bar(aes(fill=health.ins), position="fill") +
  geom_point(aes(y=-0.05), size=0.75, alpha=0.3, position=position_jitter(h=0.01)) +
  geom_point(aes(y=0.05), size=0.75, alpha=0.3, position=position_jitter(w=0.4,h=0.01)) +
  geom_point(aes(y=0.25), size=0.75, alpha=0.3, position=position_jitter(w=0.3,h=0.01)) +
  geom_point(aes(y=0.45), size=0.75, alpha=0.3, position=position_jitter(w=0.2,h=0.01)) +
  geom_point(aes(y=0.65), size=0.75, alpha=0.3, position=position_jitter(w=0.1,h=0.01)) +
  scale_fill_manual(values=c("#D55E00","#0072B2"))

Here’s the resulting plot:

jitterwidths

Of course, another option we could use would be to print the total number of elements in each section…but that’s for another day.

Conclusion

Different types of bar charts have different advantages and disadvantages.  Each provides us a certain view of the data, but the downside to each view is that we lose some other aspects.  Which graphs we show will depend upon the message we want to send, but it’s important to realize that there are methods to get around some of the deficiencies in certain graphs, such as generating a rug using the jitter function.

Also, be cognizant of color choices.  I’m not always good at that, but it’s something I want to think more about as I visualize more data with R.

Additional Resources

Podcasts I’m Enjoying

With winter comes colder temperatures, and that means I have to put the top up on the Miata.  That is unfortunate for many reasons, but one of the plus sides is that I can comfortably listen to podcasts in my car, squeezing in a few extra hours of learning each week.

Here are the main podcasts I’m listening to today, in alphabetical order:

  • Away From the Keyboard.  Richie Rump and Cecil Phillip keep it a bit lighter, focusing more on stories than hardcore learning.  It’s a good podcast for unwinding after a long day.
  • Paul’s Security Weekly.  This used to be pauldotcom but whatever the name, Paul Asadoorian & co do a great job making security news entertaining.
  • SQL Data Partners podcast.  Carlos Chacon started podcasting not too long ago, and almost all of his podcasts are in interview format.
  • WOxPod.  Chris Bell also has interview-style podcasts, but in addition to that, has contemplative monologues.

At one point in time, I had dozens of podcasts, but it got to be too much.  I’m starting over again with these four.

Pluralsight Reviews: Ethical Hacking: Reconnaissance/Footprinting

This review covers Dale Meredith’s Ethical Hacking:  Reconnaissance/Footprinting.  The material in this course follows pretty closely to the Certified Ethical Hacker material on the topic, and I think Meredith’s rendition has many of the same benefits and flaws that I found with the CEH literature.

The course is 3 1/2 hours long, so it might take a couple plane flights to get through this, and fortunately, it’s not the type of course where you need to be sitting in front of a computer typing away with the instructor to get anything out of it.  Meredith covers topics at a fairly high level and then drills into tools and techniques for collecting information on a target.  The primary emphasis of this course is looking at methods with no or low visibility to defenders, such as doing Google searches, researching employees through social media sites, looking at the public website, and finding ambient information from sources like job postings, news articles, and company blogs.  This is recon, so we’re collecting information that we’ll use later in a penetration test.  Meredith does show off some tools like WinHTTrack, and running that tool might raise the ire of a capable defender, but otherwise it’s smooth sailing for an attacker.  Meredith finishes up the course discussing methods to mitigate exposure in public records.

I’m assuming that many of the people watching this course are preparing for the CEH, and I think Meredith tailored his material to the topic.  The big downside to this is that the CEH hits you with lots of tools and techniques all at once, with relatively little focus on any of them.  I think Meredith did a good job focusing on some of the most important of these (such as Googledorks), but it felt like there were too many things happening all at once in this course, and there were times in which I really wish we could have seen a deep dive on more techniques than just Googledorks.  I hope that the later classes in this series offer much deeper looks at tools like nmap rather than a shotgun blast of software (which is how the CEH presents a lot of their material).

My Personality Insights and Tone Analyzer

Based on Kevin’s experiment from a few days ago, I decided to try out the personality analyzer. I started with a section of the first chapter of my dissertation and got the following:

You are inner-directed, skeptical and can be perceived as insensitive.

You are philosophical: you are open to and intrigued by new ideas and love to explore them. You are calm under pressure: you handle unexpected events calmly and effectively. And you are calm-seeking: you prefer activities that are quiet, calm, and safe.

You are motivated to seek out experiences that provide a strong feeling of organization.

You are relatively unconcerned with taking pleasure in life: you prefer activities with a purpose greater than just personal enjoyment. You consider achieving success to guide a large part of what you do: you seek out opportunities to improve yourself and demonstrate that you are a capable person.

To be as accurate as possible, I did the same with a later section, and I got this:

You are shrewd, skeptical and tranquil.

You are adventurous: you are eager to experience new things. You are imaginative: you have a wild imagination. And you are independent: you have a strong desire to have time to yourself.

You are motivated to seek out experiences that provide a strong feeling of prestige.

You are relatively unconcerned with both tradition and taking pleasure in life. You care more about making your own path than following what others have done. And you prefer activities with a purpose greater than just personal enjoyment.

I think these are absolutely fair points to make: shrewd, skeptical, inner-directed, and tranquil fits me to a tee most days.

The later section got me these results on the tone analyzer:

2015-11-21 22_44_55-Tone Analyzer

Because of the nature of my dissertation (which involves military history), I get a pretty angry score.

I had fun testing both of them, and as Kevin noted, I can’t gripe too much about the results.