Jitter And Color In R

As I work through Practical Data Science with R, I picked up on a jitter function in chapter 3 (free download).  I’m going to explore that jitter function a bit in this post.

Setup

If you want to follow along, grab the data set from the authors’ GitHub repo.  The relevant data file is in Custdata\custdata.tsv.  Custdata is an artificial set of data which is supposed to give a variety of data points to make graphing and analyzing data a bit easier.  This data set has 1000 observations of 11 variables, and the idea is to plot health insurance likelihood against certain demographic characteristics.  In this case, we’ll look at marital status.

Different Bar Charts

It’s pretty easy to build bar charts in R using ggplot2.  Here’s a simple example using the custdata set:

library(ggplot2)
ggplot(custdata) + geom_bar(aes(x=marital.stat,fill=health.ins))

The resulting bar chart looks like this:

barchart

This type of bar chart helps us see overall trends, but makes it difficult to see the details within a marital status type.  To do that, we can create a side-by-side bar chart by changing one little bit of code:  add position=”dodge” to the geom_bar function, like so:

ggplot(custdata) + geom_bar(aes(x=marital.stat,fill=health.ins), position="dodge")

The resulting bar chart looks like this:

sbsbar

This gives us a much clearer look at the relative breakdown across all marital status and insurance combinations, but we lose the knowledge we gained looking at the first bar chart.

At this point, I’m going to take a brief aside.

Color

Color selection is vital to graphical design.  We want colors which complement the medium—be that a monitor, a trade magazine, or a newspaper.  We also need to be cognizant that people interpret colors differently:  red-green colorblindness is the most common form, but there are several forms and we want to design with colorblind individuals in mind.  That means that the red-yellow-green combo probably doesn’t work as well as you’d like.

In Cookbook R, there’s a good colorblind-friendly palette which works for us.

Picking Our Own Colors

In the graphs above, I let R choose the colors for me.  That can work for some scenarios, but graphic designers are going to want to end up choosing their own colors.  Fortunately, doing this in ggplot2 is very easy, especially with factor variables.  All we need to do is add on a scale_fill_manual function.  Here’s our side-by-side bar chart using blue and vermillion:

ggplot(custdata) + geom_bar(aes(x=marital.stat,fill=health.ins), position="dodge") + scale_fill_manual(values=c("#D55E00","#0072B2"))

In this case, I specified hex codes for the colors.  If I wanted to use simpler colors, I could just write “blue” and “red,” but these won’t be quite the same colors.

Here is the new bar chart with our selected colors:

sbsbarcolors

The image is the same as before, but those new colors guarantee that a colorblind person will be able to read our chart.

Fill Bars And Jitter

A third type of bar chart is the fill bar.  The fill bar lets us see very clearly the percentage of individuals with health insurance across different marital statuses:

ggplot(custdata) + geom_bar(aes(x=marital.stat,fill=health.ins), position="fill") + scale_fill_manual(values=c("#D55E00","#0072B2"))

Notice that the only thing we did was change the position to “fill” and we have a completely different chart. Here is how the fill bar chart looks:

fillbar

The problem with a fill bar chart is that we completely lose context of how many people are in each category. The authors re-introduce this concept with the rug: a set of points below each bar that shows relative density. In order to do this, we add on a geom_point function that looks like so:

ggplot(custdata, aes(x=marital.stat)) + geom_bar(aes(fill=health.ins), position="fill") + geom_point(aes(y=-0.05), size=0.75, alpha=0.3, position=position_jitter(h=0.01)) + scale_fill_manual(values=c("#D55E00","#0072B2"))

Here is how the chart looks:

fillbarrug

The geom_point function has all of our changes. It starts by moving y down to -0.05, which is far enough down on the plot that we can see those points without them interfering with our bar chart. The size and alpha channels are set to maximize visibility, especially for the Married option. You can see that the Married option is densest, but there’s still some gap in there. The last option is position_jitter. Position_jitter is a function which “jitter[s] points to avoid overplotting.” Basically, without jittering the points, we would have the same points overlaying one another, but by jittering the height in this case, we distribute the points across a wider space, letting us see the sheer number more clearly than otherwise.

Jitter gives us two options: width and height. In the example above, we see what it looks like with a height value of 0.01 but no width specified. In the next chart, I’m overlaying different jitter width levels on our plot. You can see that the default is closest to a width of 0.4:

ggplot(custdata, aes(x=marital.stat)) + geom_bar(aes(fill=health.ins), position="fill") +
  geom_point(aes(y=-0.05), size=0.75, alpha=0.3, position=position_jitter(h=0.01)) +
  geom_point(aes(y=0.05), size=0.75, alpha=0.3, position=position_jitter(w=0.4,h=0.01)) +
  geom_point(aes(y=0.25), size=0.75, alpha=0.3, position=position_jitter(w=0.3,h=0.01)) +
  geom_point(aes(y=0.45), size=0.75, alpha=0.3, position=position_jitter(w=0.2,h=0.01)) +
  geom_point(aes(y=0.65), size=0.75, alpha=0.3, position=position_jitter(w=0.1,h=0.01)) +
  scale_fill_manual(values=c("#D55E00","#0072B2"))

Here’s the resulting plot:

jitterwidths

Of course, another option we could use would be to print the total number of elements in each section…but that’s for another day.

Conclusion

Different types of bar charts have different advantages and disadvantages.  Each provides us a certain view of the data, but the downside to each view is that we lose some other aspects.  Which graphs we show will depend upon the message we want to send, but it’s important to realize that there are methods to get around some of the deficiencies in certain graphs, such as generating a rug using the jitter function.

Also, be cognizant of color choices.  I’m not always good at that, but it’s something I want to think more about as I visualize more data with R.

Additional Resources

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s