Jitter And Color In R

As I work through Practical Data Science with R, I picked up on a jitter function in chapter 3 (free download).  I’m going to explore that jitter function a bit in this post.


If you want to follow along, grab the data set from the authors’ GitHub repo.  The relevant data file is in Custdata\custdata.tsv.  Custdata is an artificial set of data which is supposed to give a variety of data points to make graphing and analyzing data a bit easier.  This data set has 1000 observations of 11 variables, and the idea is to plot health insurance likelihood against certain demographic characteristics.  In this case, we’ll look at marital status.

Different Bar Charts

It’s pretty easy to build bar charts in R using ggplot2.  Here’s a simple example using the custdata set:

ggplot(custdata) + geom_bar(aes(x=marital.stat,fill=health.ins))

The resulting bar chart looks like this:


This type of bar chart helps us see overall trends, but makes it difficult to see the details within a marital status type.  To do that, we can create a side-by-side bar chart by changing one little bit of code:  add position=”dodge” to the geom_bar function, like so:

ggplot(custdata) + geom_bar(aes(x=marital.stat,fill=health.ins), position="dodge")

The resulting bar chart looks like this:


This gives us a much clearer look at the relative breakdown across all marital status and insurance combinations, but we lose the knowledge we gained looking at the first bar chart.

At this point, I’m going to take a brief aside.


Color selection is vital to graphical design.  We want colors which complement the medium—be that a monitor, a trade magazine, or a newspaper.  We also need to be cognizant that people interpret colors differently:  red-green colorblindness is the most common form, but there are several forms and we want to design with colorblind individuals in mind.  That means that the red-yellow-green combo probably doesn’t work as well as you’d like.

In Cookbook R, there’s a good colorblind-friendly palette which works for us.

Picking Our Own Colors

In the graphs above, I let R choose the colors for me.  That can work for some scenarios, but graphic designers are going to want to end up choosing their own colors.  Fortunately, doing this in ggplot2 is very easy, especially with factor variables.  All we need to do is add on a scale_fill_manual function.  Here’s our side-by-side bar chart using blue and vermillion:

ggplot(custdata) + geom_bar(aes(x=marital.stat,fill=health.ins), position="dodge") + scale_fill_manual(values=c("#D55E00","#0072B2"))

In this case, I specified hex codes for the colors.  If I wanted to use simpler colors, I could just write “blue” and “red,” but these won’t be quite the same colors.

Here is the new bar chart with our selected colors:


The image is the same as before, but those new colors guarantee that a colorblind person will be able to read our chart.

Fill Bars And Jitter

A third type of bar chart is the fill bar.  The fill bar lets us see very clearly the percentage of individuals with health insurance across different marital statuses:

ggplot(custdata) + geom_bar(aes(x=marital.stat,fill=health.ins), position="fill") + scale_fill_manual(values=c("#D55E00","#0072B2"))

Notice that the only thing we did was change the position to “fill” and we have a completely different chart. Here is how the fill bar chart looks:


The problem with a fill bar chart is that we completely lose context of how many people are in each category. The authors re-introduce this concept with the rug: a set of points below each bar that shows relative density. In order to do this, we add on a geom_point function that looks like so:

ggplot(custdata, aes(x=marital.stat)) + geom_bar(aes(fill=health.ins), position="fill") + geom_point(aes(y=-0.05), size=0.75, alpha=0.3, position=position_jitter(h=0.01)) + scale_fill_manual(values=c("#D55E00","#0072B2"))

Here is how the chart looks:


The geom_point function has all of our changes. It starts by moving y down to -0.05, which is far enough down on the plot that we can see those points without them interfering with our bar chart. The size and alpha channels are set to maximize visibility, especially for the Married option. You can see that the Married option is densest, but there’s still some gap in there. The last option is position_jitter. Position_jitter is a function which “jitter[s] points to avoid overplotting.” Basically, without jittering the points, we would have the same points overlaying one another, but by jittering the height in this case, we distribute the points across a wider space, letting us see the sheer number more clearly than otherwise.

Jitter gives us two options: width and height. In the example above, we see what it looks like with a height value of 0.01 but no width specified. In the next chart, I’m overlaying different jitter width levels on our plot. You can see that the default is closest to a width of 0.4:

ggplot(custdata, aes(x=marital.stat)) + geom_bar(aes(fill=health.ins), position="fill") +
  geom_point(aes(y=-0.05), size=0.75, alpha=0.3, position=position_jitter(h=0.01)) +
  geom_point(aes(y=0.05), size=0.75, alpha=0.3, position=position_jitter(w=0.4,h=0.01)) +
  geom_point(aes(y=0.25), size=0.75, alpha=0.3, position=position_jitter(w=0.3,h=0.01)) +
  geom_point(aes(y=0.45), size=0.75, alpha=0.3, position=position_jitter(w=0.2,h=0.01)) +
  geom_point(aes(y=0.65), size=0.75, alpha=0.3, position=position_jitter(w=0.1,h=0.01)) +

Here’s the resulting plot:


Of course, another option we could use would be to print the total number of elements in each section…but that’s for another day.


Different types of bar charts have different advantages and disadvantages.  Each provides us a certain view of the data, but the downside to each view is that we lose some other aspects.  Which graphs we show will depend upon the message we want to send, but it’s important to realize that there are methods to get around some of the deficiencies in certain graphs, such as generating a rug using the jitter function.

Also, be cognizant of color choices.  I’m not always good at that, but it’s something I want to think more about as I visualize more data with R.

Additional Resources

Podcasts I’m Enjoying

With winter comes colder temperatures, and that means I have to put the top up on the Miata.  That is unfortunate for many reasons, but one of the plus sides is that I can comfortably listen to podcasts in my car, squeezing in a few extra hours of learning each week.

Here are the main podcasts I’m listening to today, in alphabetical order:

  • Away From the Keyboard.  Richie Rump and Cecil Phillip keep it a bit lighter, focusing more on stories than hardcore learning.  It’s a good podcast for unwinding after a long day.
  • Paul’s Security Weekly.  This used to be pauldotcom but whatever the name, Paul Asadoorian & co do a great job making security news entertaining.
  • SQL Data Partners podcast.  Carlos Chacon started podcasting not too long ago, and almost all of his podcasts are in interview format.
  • WOxPod.  Chris Bell also has interview-style podcasts, but in addition to that, has contemplative monologues.

At one point in time, I had dozens of podcasts, but it got to be too much.  I’m starting over again with these four.

Pluralsight Reviews: Ethical Hacking: Reconnaissance/Footprinting

This review covers Dale Meredith’s Ethical Hacking:  Reconnaissance/Footprinting.  The material in this course follows pretty closely to the Certified Ethical Hacker material on the topic, and I think Meredith’s rendition has many of the same benefits and flaws that I found with the CEH literature.

The course is 3 1/2 hours long, so it might take a couple plane flights to get through this, and fortunately, it’s not the type of course where you need to be sitting in front of a computer typing away with the instructor to get anything out of it.  Meredith covers topics at a fairly high level and then drills into tools and techniques for collecting information on a target.  The primary emphasis of this course is looking at methods with no or low visibility to defenders, such as doing Google searches, researching employees through social media sites, looking at the public website, and finding ambient information from sources like job postings, news articles, and company blogs.  This is recon, so we’re collecting information that we’ll use later in a penetration test.  Meredith does show off some tools like WinHTTrack, and running that tool might raise the ire of a capable defender, but otherwise it’s smooth sailing for an attacker.  Meredith finishes up the course discussing methods to mitigate exposure in public records.

I’m assuming that many of the people watching this course are preparing for the CEH, and I think Meredith tailored his material to the topic.  The big downside to this is that the CEH hits you with lots of tools and techniques all at once, with relatively little focus on any of them.  I think Meredith did a good job focusing on some of the most important of these (such as Googledorks), but it felt like there were too many things happening all at once in this course, and there were times in which I really wish we could have seen a deep dive on more techniques than just Googledorks.  I hope that the later classes in this series offer much deeper looks at tools like nmap rather than a shotgun blast of software (which is how the CEH presents a lot of their material).

My Personality Insights and Tone Analyzer

Based on Kevin’s experiment from a few days ago, I decided to try out the personality analyzer. I started with a section of the first chapter of my dissertation and got the following:

You are inner-directed, skeptical and can be perceived as insensitive.

You are philosophical: you are open to and intrigued by new ideas and love to explore them. You are calm under pressure: you handle unexpected events calmly and effectively. And you are calm-seeking: you prefer activities that are quiet, calm, and safe.

You are motivated to seek out experiences that provide a strong feeling of organization.

You are relatively unconcerned with taking pleasure in life: you prefer activities with a purpose greater than just personal enjoyment. You consider achieving success to guide a large part of what you do: you seek out opportunities to improve yourself and demonstrate that you are a capable person.

To be as accurate as possible, I did the same with a later section, and I got this:

You are shrewd, skeptical and tranquil.

You are adventurous: you are eager to experience new things. You are imaginative: you have a wild imagination. And you are independent: you have a strong desire to have time to yourself.

You are motivated to seek out experiences that provide a strong feeling of prestige.

You are relatively unconcerned with both tradition and taking pleasure in life. You care more about making your own path than following what others have done. And you prefer activities with a purpose greater than just personal enjoyment.

I think these are absolutely fair points to make: shrewd, skeptical, inner-directed, and tranquil fits me to a tee most days.

The later section got me these results on the tone analyzer:

2015-11-21 22_44_55-Tone Analyzer

Because of the nature of my dissertation (which involves military history), I get a pretty angry score.

I had fun testing both of them, and as Kevin noted, I can’t gripe too much about the results.


I picked up on a tool called WinHTTrack from Dale Meredith’s Pluralsight course on reconnaissance (which I will review in a couple of days).  The tool has a simple premise:  grab website files based on links.

When I ran it against Catallaxy Services, it pulled back results based on each link.  The app handles subdomains separately and presents a reasonable picture of the site.  The advantage to using a tool like this is that you can grab a website and browse it locally later.  This lets you perform site analysis without actually being on the website.

Also, this tool will preserve external links but by default, it will not grab files from external sites.  You don’t want to try to collect the whole internet (right?  Right?), so being able to target downloads to one domain, subdomain, or even directory is very helpful.

The Cost Of Synchronous Mirroring

About a month or so ago, I started dealing with a customer’s performance issues.  When I checked the wait stats using Glenn Berry’s fantastic set of DMV queries, I noticed that 99% of wait stats were around mirroring.  This says that 99% of the time that SQL Server spends waiting to run queries is due to the fact that the primary instance is waiting for the secondary instance to synchronize changes.

The reason that mirroring stats were that high is because my customer is using Standard Edition of SQL Server.  Unfortunately, Standard Edition only allows for synchronous mirroring.  Now, I know that mirroring is deprecated, but my customer didn’t, and until SQL Server 2016 comes out and we get asynchronous (or synchronous) availability groups, they didn’t have many high availability options.

Because this customer was having performance problems, we ended up breaking the mirror.  We did this after I discussed their Recovery Time Objectives and Recovery Point Objectives—that is, how long they can afford to be down and how much data we can afford to lose—and it turned out that synchronous mirroring just wasn’t necessary given the company model and RTO/RPO requirements.  Instead, I bumped up backup frequency and have a medium-term plan to introduce log shipping to reduce recovery time in the event of failure.

But let’s say that this option wasn’t available to me.  Here are other things you can do to improve mirroring performance:

  1. Switch to asynchronous mode.  If you’re using Enterprise Edition, you can switch mirroring to asynchronous mode, which improves performance considerably.  Of course, this comes at the risk of data loss in the event of failure—a transaction can commit on the primary node before it commits on the secondary, so in the event of primary failure immediately after a commit, it’s possible that the secondary doesn’t have that transaction.  If you need your secondary to be synchronous, this isn’t an option.
  2. Improve storage and network subsystems.  In my customer’s case, they’re using a decent NAS.  They’re a small company and don’t need SANs with racks full of SSDs or on-board flash storage, and there’s no way they could afford that.  But if they needed synchronous mirroring, getting those writes to the secondary more quickly would help performance.
  3. Review mirroring.  In an interesting blog post on mirroring, Graham Kent looks at the kind of information he wants when troubleshooting problems with database mirroring, and also points us to Microsoft guidance on the topic.  It’s possible that my customer could have tweaked mirroring somehow to keep it going.

In the end, after shutting off mirroring, we saw a significant performance improvement.  It wasn’t enough and I still needed to modify some code, but this at least helped them through the immediate crisis.  They lost the benefit of having mirrored instances—knowing that if one instance goes down, another can come up very quickly to take over—but because the RTO/RPO requirements were fairly loose, we decided that we could sacrifice this level of security in order to obtain sufficient performance.