DBAs: Come Learn R With Me

In conjunction with SQL Saturday Madison, I am giving my first full-day training session entitled Enter the Tidyverse:  R for the Data Professional on Friday, April 6th.  I’m using the term “data professional” in particular because I want to hit a relatively under-served part of the community:  database administrators.  I should note that if you’re a database developer, a budding data scientist, or an application developer curious about R, this is a great training for you too, as the skills you’ll pick up are directly transferable to a variety of other jobs, but my pitch in this blog post is to DBAs, as they’re more likely to wonder why they need anything more than T-SQL and maybe a bit of Powershell.

One of my goals with this training is to show database administrators that they have a lot to gain from R as well.  We will see how to visualize DMV data more usefully, build a simple prediction model for messages in the SQL Server error log, and my capstone example will involve tuning backup performance and estimating the impact of a new database on your backup regimen.

If you sign up for the training in Madison, the cost is only $125 and you’ll walk away with a better knowledge of how you can level up your DBA skills with the help of a language specially designed for analysis.  Below is the full abstract for my training session.  If this sounds interesting to you, sign up today!

Course Description

In this day-long training, you will learn about R, the premiere language for data analysis.  We will approach the language from the standpoint of data professionals:  database developers, database administrators, and data scientists.  We will see how data professionals can translate existing skills with SQL to get started with R.  We will also dive into the tidyverse, an opinionated set of libraries which has modernized R development.  We will see how to use libraries such as dplyr, tidyr, and purrr to write powerful, set-based code.  In addition, we will use ggplot2 to create production-quality data visualizations.

Over the course of the day, we will look at several problem domains.  For database administrators, areas of note will include visualizing SQL Server data, predicting error occurrences, and estimating backup times for new databases.  We will also look at areas of general interest, including analysis of open source data sets.

No experience with R is necessary.  The only requirements are a laptop and an interest in leveling up your data professional skillset.

Intended Audience

  • Database developers looking to tame unruly data
  • Database administrators with an interest in visualizing SQL Server metrics
  • Data analysts and budding data scientists looking for an overview of the R landscape
  • Business intelligence professionals needing a powerful language to cleanse and analyze data efficiently


Module 0 — Prep Work

  • Review data sources we will cover during the training
  • Ensure laptops are ready to go

Module 1 — Basics of R

  • What is R?
  • Basic mechanics of R
  • Embracing functional programming in R
  • Connecting to SQL Server with R
  • Identifying missing values, outliers, and obvious errors

Module 2 — Intro To The Tidyverse

  • What is the Tidyverse?
  • Tidyverse principles
  • Tidyverse basics:  dplyr, tidyr, readr, tibble

Module 3 — Dive Into The Tidyverse

  • Data loading:  rvest, httr, readxl, jsonlite, xml2
  • Data wrangling:  stringr, lubridate, forcats, broom
  • Functional programming:  purrr

Module 4 — Plotting

  • Data visualization principles
  • Chartjunk
  • Types of plots:  good, bad, and ugly
  • Plotting data with ggplot2
    • Exploratory plotting
    • Building professional quality plots

Module 5 — Putting it Together:  Analyzing and Predicting Backup Performance

  • A capstone notebook which covers many of the topics we covered today

Course Objectives

Upon completion of this course, attendees will be able to:

  • Perform basic data analysis with the R programming language
  • Take advantage of R functions and libraries to clean up dirty data
  • Build a notebook using Jupyter Notebooks
  • Create data visualizations with ggplot2


No experience with R is necessary, though it would be helpful.  Please bring a laptop to follow along with exercises and get the most out of this course.

Data Processing: The Other 90%

This is part three of a series on launching a data science project.

The Three Steps Of Data Processing

Data processing is made up of a few different activities:  data gathering, data cleansing, and data analysis.  Most estimates are that data scientists spend about 80% of their time in data processing (particularly in data cleansing).  If anything, I consider this an underestimation:  based on my experiences, I would probably put that number closer to 90%.

As with everything else, these separate concepts tend to jumble together and you’ll bounce back and forth between them as you learn about new resources, determine the viability of sources, and try to link data sources together.  But for the purposes of elucidation, we’ll treat them independently.  So let’s start with data gathering.

Data Gathering

At first, you might not know exactly what data you need, but as you flesh out your models and gain a better understanding of the problem, you will go back to the well several times and probably end up getting data from a variety of sourses.  These sources can include, but are not limited to, the following:

  • Internal proprietary data
  • Open data sources (often governmental or academic, but can also be private sources like our data professional survey)
  • Paid APIs or data sources from third parties
  • Data from surveys that you commission

Data Gathering In Our Example

For our example, we will stick to using just the data professional survey.  But if we wanted to expand our analysis out a bit, there are some places we could take it.  For example, we could use the Penn World Table and calculate Purchasing Power Parity GDP per capita to normalize salaries across countries.  This could help us find if there were countries in which database professionals were relatively higher-paid compared to the norm, and is pretty similar to work I did last year.

We could get a geocoding data set to visualize results on a map, or get a data set which has cost of living by ZIP Code.  This would help us normalize salaries within the United States, as long as the ZIP Code data in our set is good enough (spoilers:  it’s not).  We could also use census information to build out more data by ZIP Code or greater metropolitan area.  Finally, we could use data from other surveys to add more information to the sample.  Our sample is pretty small, at just 6000 entries across two years, so supplementing this data could help us a lot.

Data Cleansing

Once we have relevant (or at least potentially relevant) data sets, we probably will want to join them together to gain insight from the mashup of these data sets.  It is rare, however, to have all of your data sets match up exactly on the first try.  Instead, we’re going to have to work with those data sets to get the puzzle pieces to fit together.

Soft Corinthian Data With A Beautiful Grain

For example, we probably need to change the grain of some of our data sets.  Suppose have a data set which is at the daily grain and another which is at the monthly grain.  We can connect these two together, but we have to make a decision.  Do we take the daily grain and aggregate it up to the monthly level?  Do we take the monthly level and create some kind of daily allocation rule?  Do we figure out the month to which the date belongs and include month-level aggregations for each of those daily values?  Each of these is a viable strategy, but has different implications on what you can do with the data.

Next up, you might have to define join criteria.  There won’t always be an obvious natural join key like a date.  You might have a data set from one source which contains product titles and manufacturer codes, and another data set which contains UPCs and gussied up titles.  In this case, you might be able to map manufacturer codes to UPCs using a third data set, or maybe you can write some regular expressions to get the product titles to match up, but there’s work involved.

We may also need to reshape the data.  As mentioned above with grains, we might need to aggregate or disaggregate data to fit a join.  We may also need to perform operations like pivoting or unpivoting data (or gathering and spreading using the tidyverse parlance).

Dealing With Mislabels, Mismatches, And Just Plain Wrong Data

This next category of data cleansing has to do with specific values.  I want to look at three particular sub-categories:  mislabeled data, mismatched data, and incorrect data.

Mislabeled data happens when the label is incorrect.  In a data science problem, the label is the thing that we are trying to explain or predict.  For example, in our data set, we want to predict SalaryUSD based on various inputs.  If somebody earns $50,000 per year but accidentally types 500000 instead of 50000, it can potentially affect our analysis.  If you can fix the label, this data becomes useful again, but if you cannot, it increases the error, which means we have a marginally lower capability for accurate prediction.

Mismatched data happens when we join together data from sources which should not have been joined together.  Let’s go back to the product title and UPC/MFC example.  As we fuss with the data to try to join together these two data sets, we might accidentally write a rule which joins a product + UPC to the wrong product + MFC.  We might be able to notice this with careful observation, but if we let it through, then we will once again thwart reality and introduce some additional error into our analysis.  We could also end up with the opposite problem, where we have missed connections and potentially drop useful data out of our sample.

Finally, I’m calling incorrect data where something other than the label is wrong.  For example, in the data professional salary survey, there’s a person who works 200 hours per week.  While I admire this person’s dedication and ability to create 1.25 extra days per week that the rest of us don’t experience, I think that person should have held out for more than just $95K/year.  I mean, if I had the ability to generate spare days, I’d want way more than that.

Missing Data

People don’t always fill out the entirety of every form, and when you make fields optional, people are liable not to fill them in.  The flipside to this is that if you make optional fields required, people might input junk data just to get the form to submit, so simply marking fields as required isn’t the answer.

When we are missing data, we have a few options available to us.  First of all, if we’re looking at a small percentage of the total, we might just get rid of those records with important, missing data.  Once we start removing more than 2-3% of the data set for this reason, though, I’m liable to get a bit edgy about removing more.  If we’re missing information from non-vital columns like middle initial or address line 3, I’d just remove those features altogether and save the hassle.

Otherwise, if we can’t remove the feature and don’t want to remove the row, we can use statistical techniques to fill in the blanks.  We might substitute missing values with a dummy value, such as “Unknown” or a reasonable default like 0 (assuming 0 is a reasonable default in this case).  We might substitute with an “average” value like mean, median, or mode.  The default here is probably median, but there are times when you want to use mean or mode.  We could also do a quick regression and replace our missing value with the regressed value, assuming that there’s some correlation (positive or negative) between the filled-in features and our missing feature.

These statistical techniques can help fill in gaps, but I wouldn’t use them for more than a few percent of the data; if you’re seeing 15+ percent of the feature values missing, this is probably a compromised feature and you don’t want to use it in analysis.  Also, by applying these statistical techniques, we are knowingly introducing noise, making our models that much less accurate.  That is, we are saying that the value is X, but it’s probably somewhere in the range { X-y, X+y} and we are giving undue precision based on our own made-up calculations in the hopes that y is small enough that it won’t matter too much.  If this hope turns out to be the case, the benefit to salvaging incomplete records could overcome the cost of adding more noise to our analysis.

Inconsistent Data

As I mentioned earlier in the series, one of my data adages is that independent systems tend to end up with inconsistent data.  Sometimes people forget to fill out both systems the same way.  Sometimes there are typos in one data source but not the other (or the typos are different in each).  Sometimes there are subtle differences in data sets which lead to differing results.  Sometimes one data source is newer than the other, or at least one particular record is newer than what’s in the other system.

Regardless of the reason, we need to be able to handle these differences.  One way to handle differences is to make one data source canonical and trust it over the other when there comes time for a discrepancy.  If there’s an authoritative source, it makes sense to behave this way, and your business team might be able to explain which sources (if any) are authoritative.

More commonly, you will have to institute one or more rules to handle data discrepancies.  For example, if both records have a timestamp, pick the later timestamp…although wall clocks lie.  Another example of a rule might be, if the discrepancy is over a numeric value, go with the lower number as it is more likely to be correct.  These rules will be specific to your data sources and systems, and will often require quite a bit of sussing out.

Misshapen Data

If you happen to have data stored in a textual format, you can end up with misshapen data.  For example, you might have XML or JSON data where some of the records are in an invalid format.  Or maybe you have newlines in the middle of a record.  Maybe you have too many delimiters in that CSV, or maybe not enough.  Maybe the file starts as a CSV and then moves to pipe-delimited in the middle because the source system appended data over time using different output mechanisms.

Regular expressions can help out here sometimes.  You might also be able to get satisfaction from the owner of the source data and have them fix the problem.  Sadly, it’s just as likely that your team becomes the owner of the data and gets the pleasure of figuring it out.  If you store data in relational databases and adhere to first normal form (or better), data shape is already enforced, so you can pat yourself on the back for this one and move on to the next problem.

Data Shaping

Speaking of data shapes, we often need to perform some kinds of data shaping activities to turn our raw data into something useful for modeling.  Examples of data shaping include:

  • Vectorizing words.  That is, turning words in a document into numbers for analysis.
  • Categorizing data.  That is, turning strings (or maybe numeric values) into factors.
  • Normalizing values.  That is, transforming numeric data to have a mean of 0 and a standard deviation of 1.  The benefit of normalization is that you can talk about two features which have radically different scales in the context of the same model.  Let’s say that we want to measure a person’s desire to purchase a car based on its gas mileage and its sticker price.  Gas mileage typically ranges from about 15-50, but sticker price can range from $9000-90,000 (or more).  Both the sheer numbers and the wide range of sticker price will overwhelm gas mileage in many models, so the way we get around this is to normalize both features so that each has a mean of 0 and standard deviation of 1.  That way, one feature’s effect cannot swamp another’s.
  • Binning data.  That is, converting continuous values to discrete.  For example, we might have the age of each person in our data set, but we’re more interested in age cohorts, like 18-25 or 30-34.

Data Analysis

As part of building a data dictionary, I’ve already brought up a couple simple data analysis concepts.  Here, I’ll go into a few, focusing on techniques common in Exploratory Data Analysis (EDA).  The goal with EDA is to summarize and visualize the data in a way that furthers your understanding of the data.

Cardinality Analysis

The first simple technique you can use is finding the cardinality of variables.  For example, in the salary survey, there were four options available for Employment Status.

Not included: use multiverse versions of self to do work, thereby freeing up time to lounge on a beach in Cabo.

Cardinality tells us a few things.  For a text field like employment status, it indicates that we probably got the data from a proscribed list—often a dropdown list or set of radio buttons—and we probably want to treat this as categorical data, as people choose one to the exclusion of others.

For free-form entry, like number of hours worked, looking at the cardinality lets us see how many different values people chose, and if you compare the number selected versus the range from minimum to maximum, you can get a fair idea of how packed this set is.

Summarization And Visualization

Next up, we can run a five-number summary of a feature.

The 5-Number Summary: buy five numbers, get one free.

This summary, which is easy to perform in R, gives us five pertinent values:  the minimum, maximum, 25th percentile, 75th percentile, and 50th percentile (i.e., median).  R also gives you the mean, which isn’t technically part of the five-number summary but is helpful, as the difference between median and mean clues you into how skewed the distribution of your sample data is.

You can also visualize these five-number summaries with a box plot.

A box plot.  Buy five boxes, get one free?

This type of plot will show you min, max, 25th, 50th, and 75th percentiles, where 25-50-75 make up the box and min and max are the furthest points in the plot.  This plot also shows you “expected” values (those within 1.5 * [75th percentile – 25th percentile], shifted to the next data point in each direction), as well as outliers—those which fall outside the interquartile range.  Box plots are great at visualizing differences by category.

When looking at the distribution of data within a category, you can easily build a histogram.

A histogram using base-10 logged values because if you can’t do exponential math in your head, get out of my classroom.

This distribution visualization lets us see the skew—in this case, our plot skews left, meaning that the left-hand side of the median is more filled out than the right-hand side.  This is fairly common when looking at income data, so I wasn’t surprised to see this setup when I created the above graph using the 2017 survey data.

Correlation Checking

Another part of data analysis involves testing for correlation between explanatory variables.  You want correlation between your independent (explanatory) variables and your dependent variable—that’s what data analysis is all about, after all—but having tightly related explanatory variables can lead to analytical problems down the road.  This correlation is called multicollinearity and is one of the bugbears of linear regression analysis.

As an example of this, suppose that we want to regress plant growth rates using two variables:  amount of rainfall in inches and amount of precipitation in centimeters.  There is collinearity between rainfall in inches and precipitation in inches:  precipitation is rainfall + snowfall, so we can easily get from one to the other and their movements over time will be linked.  If we use both of these variables to try to model plant growth, it becomes harder for the algorithm we choose to differentiate between the two variables and assign them the correct rate.

The cor() function in R gives us correlation between two variables.

Diamond dimension correlations.

In this case, depth and table have a mild, negative correlation.  By contrast, the x and y dimensions are highly correlated.  If I were performing a linear regression, I would have no qualms about including both depth and table, but would think twice about including both x and y.

Conclusion (For Now)

In the next post, I will perform some data cleansing & analysis work on the survey, as this post is long enough as-is.  As mentioned, this is by far the longest phase of any data science project; what I’ve touched on in this post is just a brief smattering of the things you should expect to try when working through a project.

Building Business Understanding

This is part two of a series on launching a data science project.

How Is Babby Data Science Project Formed?

Behind each data science project, there is (hopefully) someone higher up on the business side who wants it done.  This person might have been the visionary behind this project or might simply be the sponsor who drives it because of the project’s potential value.  Nevertheless, the data science team needs to seek out and gather as much information about that champion’s vision as possible.  In a perfect scenario, this is the person handing out sacks of cash to you, and you want those sacks of cash because they buy pretty hardware and let you bring in smart people to work with (or for) you.  You might even have several people interested in your project; if so, you’ll want to build a composite vision, one which hopefully includes all of the members’ visions.  Just keep in mind that sometimes you can’t combine everybody’s dreams and get a coherent outcome, so you’ll need to drive the champion(s) toward clarity.  Forthwith are a few clues to help.

Learn The Domain

The first clue is figuring out the domain.  This is asking questions like what the company does, what other companies in the industry do, what kind of jargon people use, etc.  The more you know, the better prepared you are to understand the mind(s) of your champion(s).  But even if you’ve been at that company for a long time and have a detailed understanding of the business, you still want to interview your champion(s) and ask questions which expose the ideal outcome.

Stop, Collaborate, and Listen

The second clue is simple:  listen.  When interviewing people, listen for the following types of questions:

  • How much / how many?
  • Which category does this belong to?
  • How can we segment this data?
  • Is this weird?
  • Which option should I choose?

Each of these is a different type of problem with its own set of statistical techniques and rules.  For example, the “Is this weird?” question relates to anomaly detection:  finding outliers or “weird” results.  Figuring out which of these types of questions is most important to your champion is crucial to delivering the right product.  You can build the absolute best regression ever, but if the person was expecting a recommendation engine, you’re going to disappoint.

As you listen to these types of questions, your goal is to nail down a specific problem with a specific answer.  You want to narrow down the scope to something that your team can achieve, ideally something with a built-in measure for success.  For example, here are a few specific problems that we could go solve:

  • Find a model which predicts quarterly sales to within 5% no later than 30 days into the quarter.
  • Given a title and description for a product, tell me a listing category which Amazon will, with at least 90% confidence, consider valid for this product.
  • Determine the top three factors which most affect the number of years the first owner holds onto our mid-range sedan.

With a specific problem in mind, you can look for relevant data.  Of course, you’ll probably need to modify the scope of this problem over time as you gather new information, but this gives you a starting point for success.  Also, don’t expect something as clear-cut as the above early on; instead, people will hem and haw, not quite sure what they really want.  You can take a fuzzy goal into data acquisition, but as you acquire data, you will want to work with the champion to focus down to a targeted and valuable problem.

Dig For Data

Once you have an interesting question, or even the bones of a question, start looking for data.  Your champion likely has a decent understanding of what is possible given your data, so the first place to look is in-house data sources like databases, Excel, flat files, data accessible through internal APIs, and even reports (generated reports like PDFs or even printed-out copies).  Your champion will hopefully be able to point you in the right direction and tell you where some of this data is located—especially if it’s hidden in paper format or in a spreadmart somewhere—but you’re going to be doing a lot of legwork between here and the data processing phase, and you’ll likely bounce back and forth between the two phases a number of times.

As you gather data resources, you will probably want to build a data dictionary for your data sources.  A great data dictionary will include things like:

  • The data type of each attribute:  numeric, string, categorical, binary.
  • The data format:  CSV, SQL Server table, Hive table, JSON file, etc.
  • The size of the data and number of records.
  • The enumeration of valid categorical values.
  • Other domain rules (if known).

I’d love to say that I always do this…but that’d be a lie.  Still, as the saying goes, hypocrisy is the tribute that vice pays to virtue.

Learn Your Outputs

While you’re looking for data and focusing in on the critical problem, you also need to figure out the endgame for your product.  Will there be a different engineering team (or multiple teams?) expecting to call a microservice API and get results back?  Will you get a set of files each day and dump the results into a warehouse?  What’s the acceptable latency?

The Engineering team should help solve this technical problem, although your champion should have insight here depending upon how the business side will need to use your results.  If the business side is already getting files in once a day, they may be fine with your process running overnight and having results in a system by 8 AM, when the analysts start showing up at work.  By contrast, you may have a fraud detection system which needs to make a decision in milliseconds.  These two systems will have radically different sets of requirements, even if the output looks the same to the business side.

Going Through An Example

My motivating example, as mentioned yesterday, is data professional salaries—figuring out how to get more money does, after all, motivate me!

Let’s suppose we work for Data Platform Specialists, a company dedicated to providing DBAs and other data platform professionals with valuable market knowledge.  We have come into possession of a survey of data professionals and want to build insights that we can share with our client base.

If you haven’t seen the survey yet, I recommend checking it out.

You can tell that this is a quality survey because it’s in blue, which is the most serious of survey result colors.

Once we have the salary data, we want to start building a data dictionary and see what the shape of the data looks like.  We’d get information on the total number of rows, note that this is stored in Excel on a single worksheet, and then make some notes on the columns.  For example, a number of these features are categorical:  for example, TelecommuteDaysPerWeek has six options, ranging from “less than 1” to “5 or more.”  By contrast, hours worked per week is an integer, which ranges from 5 to 200 (umm…).

There are quite a few columns, most of which originally came from dropdowns rather than users typing the data in.  This is good, because users are typically the biggest enemy of clean data.  But even in this example, we can see some interesting results:  for example, about halfway through the image, you can see “111,000” in the SalaryUSD column.  It turns out that this was a string field rather than numeric.  If you simply try to turn it into numeric, it will fix “111,000” but it’d turn a German’s entry of “111.000” from $110,000 to $111 if you’re in the US.  But I’m getting a bit ahead of myself here…

Where I want to go first is, what are the interesting types of questions we can ask given our vast wealth of domain knowledge and this compendium of valuable pricing insight?  (Too obsequious?  Maybe.)

  • How much money does a DBA make?  Similarly, how much does a data scientist make, or a developer who specializes in writing T-SQL?
  • Which category of DBA (e.g., junior, mid-level, senior) does a particular type of work?
  • How can we segment the DBAs in our survey?
  • Suppose I work 80 hours per week.  Compared to my peers, is this many hours per week weird?
  • Which option should I choose as a career path?  DBA?  Data scientist?  BI specialist?

Talking this through with our champion and other stakeholders, we can talk through some of the information we’ve already gathered, prime them toward asking focused questions, and narrow down to our question of interest:

How much money should we expect a data professional will make?

Well, that’s…broad.  But this early on in the process, that’s probably a reasonable question.  We might not want to put hard boundaries on ranges yet, but as we go along, we can narrow the question down further to something like, “Predict, within $5K USD, how much we should expect a data professional to make depending upon country of residence, education level, years of experience, and favorite Teenage Mutant Ninja Turtle (original series only).”

The end product that we want to support is a small website which allows people to build profiles and then takes that profile information and gets an estimate of how much they could make in different roles.  Our job is to build a microservice API which returns a dollar amount based on the inputs we define as important.


Today’s post was all about getting into the business users’ heads.  These are the people handing us big sacks of cash, after all, so we have to give them something nice in return.  In the next post, we’ll go a lot deeper into data processing and ask the question, if data platform professionals are the gatekeepers between good data and bad, why are we so bad at filling out forms?

The Microsoft Team Data Science Process

This is part one of a series on launching a data science project.

This is the beginning of a series of posts around growing a data science project from the germ of an idea to its fruition as a stable oak.  Before I get into the process, I want to start with a few data adages to which I stubbornly subscribe and which drive the need for quality processes.

Data Adages

You may disagree with any of these adages, but they drive my decision-making processes and I believe that they feed at least a little bit of the paranoia necessary to be an effective data architect.

Clean Data Is An Aspiration, Not A Reality

The concept of “clean” data is appealing to us—I have a talk on the topic and spend more time than I’m willing to admit trying to clean up data.  But the truth is that, in a real-world production scenario, we will never have truly clean data.  Whenever there is the possibility of human interaction, there is the chance of mistyping, misunderstanding, or misclicking, each of which can introduce invalid results.  Sometimes we can see these results—like if we allow free-form fields and let people type in whatever they desire—but other times, the error is a bit more pernicious, like an extra 0 at the end of a line or a 10-key operator striking 4 instead of 7.

Even with fully automated processes, we still run the risk of dirty data:  sensors have error ranges, packets can get dropped or sent out of order, and services fail for a variety of reasons.  Each of these can negatively impact your data, leaving you with invalid entries.

Data Source Disagreements Will Happen

If you have the same data stored in two systems, there will inevitably be disagreements.  These disagreements can happen for a number of reasons, including:

  1. Different business rules mean that different subsets of data go into each system.  For example, one system might track data on work items, and a separate system could track data on work items specifically for Cap-X projects.  If you don’t understand the business rules of each system, you might look at the difference in numbers and get confused.
  2. With manual data entry, people can make mistakes, and those mistakes might manifest in different ways in separate systems.  If a person has to type data into two systems, the likelihood of typos affecting each system exactly the same way is fairly low.
  3. Different systems might purport to be the same but actually are based on different data sources.  For example, the Penn World Table has two different mechanisms to calculate GDP:  expenditure-side GDP and output-side GDP.  In an ideal world, these are the same thing, but in reality, they’re a little bit different.  If I build one system based off of expenditure-side GDP and you build another system based off of output-side GDP, our calculations will clash even though they’re supposed to represent the same thing.
  4. Some systems get updated more frequently than others, so one side might have newer data, even if they both come from the same source and have the same rules and calculations applied.
  5. Even in a scenario where you are reading from a warehouse which gets its data from a single source system, there is still latency, meaning that you might get an extract of data from the warehouse which is out of date.  That too can lead to data discrepancies between sources.

You Will Always Have More Questions Than Data

This seems pretty self-explanatory—our ability to collect and process information is finite, whereas the set of questions we could ask is infinite.  You might be able to collect an exhaustive data set about a very particular incident or set of incidents, but there are always more and broader questions a person can ask for which the data is not available.  For example, let’s say that we have a comprehensive set of data about a single baseball game, including lineups, game actions, pitch locations, bat speeds on contact, and so on.  No matter how detailed the data you provide, someone will be able to ask questions that your data cannot answer.  One class of these questions involves trying to discern human behavior:  why the manager picked one reliever over another, why the runner decided to advance from first to third base on a single to left field in the 4th inning, why this pitcher followed up a four-seam fastball inside with a changeup outside, etc.

Decision-Makers Often Don’t Know The Questions They’ll Have

I’ve built a few data warehouses in my time.  The most frustrating part of building a data warehouse is that you have to optimize it for the question that people have, but it’s hard for people to imagine the questions that they will have far enough in advance that you can develop the thing.  Decision-makers tend not to know the types of questions they can ask, including whether those questions are realistic or reasonable, until you prod them in a direction.

The even worse part is, you’ll be able to answer some of their questions, but invariably they will have questions which they cannot answer using your system, meaning either that you extend the system to answer those questions as well, users find some other way to satisfy their curiosity, or they forget about the question and potentially lose a valuable thread.

We have ways of coping with this, like storyboarding, iterative development, and storing vast amounts of semi-structured data, but it’s tough to figure out what to include in your data lake when you don’t have a proscribed set of required questions to answer (like, for example, a set of compliance forms you need to fill out regularly).

Data Abstracts The Particulars Of Time And Place

This is one that I have repeatedly stolen over the years from FA Hayek, who made this point in his essay The Use of Knowledge in Society (which ended up winning him a Nobel Prize three decades later).  We abstract and aggregate data in order to make sense of it, but that data covers up a lot of deeper information.  For example, we talk about “the” price of something, but price—itself an abstraction of information—depends upon the particulars of time and place, so the spot price of a gallon of gasoline can differ significantly over the course of just a few miles or a few days.  We can collect the prices of gasoline at different stations over the course of time and can infer and adduce some of the underlying causes for these levels and changes, but the data we have explains just a fragment of the underlying reality.

The Need For Process

I consider all of the adages above to be true, and yet it’s my job to figure something out.  To deal with these sorts of roadblocks, we build processes which give us structure and help us navigate some of the difficulties of managing this imperfect, incomplete, messy data.

Over the course of this series, I’m going to cover one particular process:  the Microsoft Team Data Science Process.  I don’t follow it strictly, but I do like the concepts behind it and I think it works well to describe how to increase the likelihood of launching a good data science project.

The Team Data Science Project Lifecycle (Source)

There are a couple of things that I like about this process.  First, it hits most of the highlights:  I think the combination of business understanding, data acquisition, modeling, and deployment is probably the right level of granularity.  Each of these has plenty of details that we can (and will) dig into, but I think it’s a good starting point.

The other thing that I like about this process is that it explicitly recognizes that you will bounce around between these items—it’s not like you perform data acquisition once and are done with it.  Instead, you may get into that phase, gather some data, start modeling, and then realize that you need to go back and ask more pointed business questions, or maybe you need to gather more data.  This is an explicitly iterative process, and I think that correctly captures the state of the art.

Our Sample Project

Through this series, I’m going to use the 2018 Data Professionals Salary Survey.  I took a look at the 2017 version when it came out, and then another look at the 2017 survey during my genetic programming series, but now I want to use the latest data.  As we walk through each step of the Team Data Science process, we’ll cover implementation details and try to achieve a better understanding of data professional salaries.  Of course, the key word here is probably try

Stripping Out HTML With StringR

This is a quick post today on removing HTML tags using the stringr package in R.

My purpose here is in taking some raw data, which can include HTML markup, and preparing it for a vectorizer.  I don’t need the resulting output to look pretty; I just want to get rid of the HTML characters.

descr <- "<div>This is a <b>Tag</b> page. <p align=\"true\">Something.  Something else. <span>Tag</span>.</div><div>8 < 3</div><div>14 <> 9 </div>"
descr <- stringr::str_replace_all(descr, '<[^<>]+?>', ' ')
descr <- stringr::str_replace_all(descr, ' {2,}', ' ')
# Expected results:
# " This is a Tag page. Something. Something else. Tag . 8 < 3 14 <> 9 "

I have a two-step process here. In the first step, I’m replacing any instance of text between angle brackets with a single space. The second replacement removes multiple spaces. I don’t do any checking for correctness, so there is a chance that fake tags ought to be part of the document don’t make it past the doorman. That said, it works well enough when you don’t need perfection.

Solving For Boyce-Codd Normal Form

Last time, I pointed out that tidy data, at its core, embraces Third Normal Form.  But I left you hanging with a hint that there’s an even better place to be.  Today, we are going to look at that place:  Boyce-Codd Normal Form.

The reason we’re interested in this normal form is that it helps us solve the following data anomalies:

Update anomalies

If we have multiple rows in a table with the same value for a column, we might update one instance of it but forget to update the other rows.

Here we have some data:


Let’s say that Lindsay gets married and changes her last name.  We update her last name (and thus login) on one project but we might forget to update her login on the other project.


Duplicate data

Now let’s say that we bring Ann in off the bench to do all of the ETL, moving Lindsay and Julia off of it.  We update the database and we have a problem:


If our application simply updates rows that already exist, we now have two rows with the same data.  That’s trouble.

Missing data

By the way, where did Ann come from here?  She works for the company, but in the original data set, you would not have known.  If we want this table to have a row for Ann while she’s on the bench, we’d have to insert some sort of placeholder to bring her into existence:


This isn’t good either, because we’re now creating a bogus non-project in order to observe Ann’s existence.  There’s a better way to do this:  get our tables into Boyce-Codd Normal Form.

Defining Boyce-Codd Normal Form

Boyce-Codd Normal Form is a generalization of Second and Third Normal Forms.  There are a couple of requirements to be in Boyce-Codd Normal Form.  First, your table must be in First Normal Form.  This means that:

  • Every entity (row) has a consistent shape.  This is something relational databases do for you automatically:  you can’t create a table where one entity has an attribute (column) but the next entity doesn’t.
  • Every entity has a unique value.  You can uniquely identify any particular row.
  • Every attribute is atomic:  you don’t try to pack more than one value into a single attribute.
  • There are no repeating groups of attributes, like PaymentMethod1, PaymentMethod2, PaymentMethod3, etc.

The other half of BCNF is that every determinant on an entity is a key.

Fun With Determinants

Okay, so what’s a determinant?  Simple, it’s:

An attribute or combination of attributes on which any other attribute or combination of attributes is functionally dependent.

Putting this a little more abstractly, if we have some attribute Z that depends upon the combination of X and Y, we call (X, Y) the determinant of Z and write it as (X, Y -> Z).  So here’s an example:

1 4 7
3 5 2
1 4 7
6 8 7

In this table, whenever we see X = 1 and Y = 4, we always have Z = 7.  In fact, we only see one Z for any combination of (X, Y) , so we can say that (X, Y) determines Z.  In this simple example, it’d be a bit sketchy to make that determination just after four rows, but in practice, you’ll likely have some logical understanding that explains why there is a functional dependency.  For example, if we have UserID and UserName on the same table, any sane design will have UserID determine UserName, meaning that whenever we see a particular UserID, it corresponds to one and only one UserName.

How Do We Determine Determinants?

Well, there  are a couple of methods we can use.  I’ll separate them between deductive and inductive techniques…though if you’re feeling cheeky, you can use abductive techniques too.

Deductive techniques involve reasoning through the data model.  That includes things like:

  • Look for common prefixes like UserID and UserName
  • Look at what “belongs” to an object.  For example, a password belongs to a User object.  Does a password belong to a UserLoginAttempt object?  Maybe, if we’re tracking the thing the user typed in, but usually we mean the actual password associated with a particular user.
  • Look for equivalencies, like English versus inferior Metric measures.  If you can convert from one to the other, you don’t need both.
  • Discuss with product owners and stakeholders what things belong together.

Inductive techniques are all about querying the data.  For example:

  • Look for common patterns.  In my table above, I showed how we could determine Z based on X and Y.  I couldn’t tell you a priori that X and Y determine Z, but I could see it in the data.
  • Get a count of distinct attribute values for a given attribute.  This is a good way of finding 1:1 relationships like UserID to UserName.  It’s also a good way of seeing if there is dirty data somewhere, like where someone only updated some of the rows for a particular ID and missed others.
  • Build a correlation matrix for combinations of attributes.  This is the most time-consuming method, but helps you find multi-attribute determinants.

One Easy Technique For Finding Boyce-Codd Normal Form

Jennifer Widom opened my mind to a great technique for turning a mega-table into an appropriately normalized data structure.  Let’s show you the technique via example.

We are building software to support a Digital Marketing campaign for our employer, a record company.  We want to advertise a new album, a death metal-blues mashup entitled Holes In My Heart.  The product owner has given us all of the data points we need to collect and has asked for a good data model.  We want to build a data model which satisfies Boyce-Codd Normal Form and prevents relevant data anomalies.

Here’s our starting mega-table, which includes every column they need:


We can get into First Normal Form by pointing out that the combination of CampaignID and AudienceTargetID is a primary key here, so we can guarantee that each row will be unique.  Showing that all other aspects of 1NF hold is trivial.

The process for getting us into BCNF is simple:

Step 1:  Find Functional Dependencies

We first need to find any functional dependencies.  I don’t have any data, but I have a savvy product owner who is able to give me the following dependencies:

  • CampaignID —> CampaignName, CampaignOwnerName, AgencyID, MaximumDailyBudget, MinimumBid
  • AgencyID —> AgencyName, TakeRate
  • AudienceTargetID —> TargetedAgeRange,  TargetedGenrePreference
  • CampaignID, AudienceTargetID —> AudienceBidModifier

There were some tips that helped us figure out some of these details, like how the things with “Campaign” prefixing the name are determined by CampaignID.  But there were a couple more which we’d need to suss out in conversations.

Step 2:  Are all determinants candidate keys?

It’s a pretty simple question:  go through each determinant above and figure out if it is a candidate key to some entity.  We only have one entity here and the only candidate key is (CampaignID, AudienceTargetID).  So we can definitively answer that question with a resounding “no.”

Step 3:  Break out a determinant if no

We have three determinants here which are not candidate keys, so let’s take one of them and break it out into its own table.  I’m going to pick AgencyID because it’s small.

Now we have an Agency table:


And we have our new DM campaign table, now minus AgencyName and TakeRate and plus a new foreign key constraint:


Go To Step 2, Collect $200

Then we go back to step 2:  checking if we’re done.  We clearly aren’t, so let’s pick the next functional dependency:  CampaignID.  Break those columns out into a table:


Notice that we moved AgencyID to the Campaign table, so there’s still a foreign key constraint, but it won’t be on our DM Campaign table anymore.  Anyhow, we once again strike the non-key columns from DM Campaign:


That table is getting a lot smaller, but we still fail step 2 because AudienceTargetID alone is a determinant, but our key is on the combination of CampaignID and AudienceTargetID.  So once more into the breach:


Once we strip out those columns, we’re done.  I decided to rename DM Campaign to something that makes more sense, and now we have the final data model:


This data model is in Boyce-Codd Normal Form:  we are in 1NF and each determinant is a candidate key.  Note that we don’t always need one table per determinant, however; there can be cases where a single table has multiple candidate keys.


We just walked through a reasonably simple example of turning a mega-table into Boyce-Codd Normal Form.  By doing so, we eliminate several classes of data anomaly, and that’s critical when dealing with a transactional system which contains important data.

Tidy Data And Normalization

In Hadley Wickham’s paper on tidy data, he makes a few points that I really appreciated.

Data sets are made up of variables and observations.  In the database world, we’d call variables attributes and observations entities.  In the spreadsheet world, we’d call variables/attributes columns and observations/entities rows.

Each variable contains all values which measure the same underlying attribute across units.  For example, a variable might be the height of a person.  In that case, every instance of that variable should be the height of a person.  You should not intersperse measures, like having the variable contain height for a person, wingspan for a bird, and hind leg length for a dog.

Each observation contains all values measured on the same unit.  For example, we might have a person, and different variables which represent the person:  height, weight, wingspan, primary handedness, maximum number of ice cream cones consumed in a single sitting, etc.  We should not have data for two separate people in the same observation; each person, in this case, gets his own observation.

The reason we want to arrange our data this way is that it makes life easier for us.  First, it is easier to describe relationships between variables.  For example, your age is a function of your date of birth and the current date.  If we have date of birth and current date as two variables, we can easily calculate age.  Here we can see it in R and SQL:

people$age <- as.double(difftime(people$current_date, people$date_of_birth, units = "days")) / 365.25 
 SELECT 	DateOfBirth, 	CurrentDate, 	DATEDIFF(DAY, DateOfBirth, CurrentDate) / 365.25 AS Age FROM dbo.Person; 

By contrast, it is easier to make comparisons between observations.  For example, we can easily determine how many people are using a particular telephone number:

 telephones %>%
  group_by(telephone_number) %>%
  summarize(number_of_users = n()) %>%
  select(telephone_number, number_of_users)
	COUNT(1) AS NumberOfUsers
FROM dbo.Telephone

The kicker, as Wickham describes on pages 4-5, is that normalization is a critical part of tidying data.  Specifically, Wickham argues that tidy data should achieve third normal form.

Now, in practice, Wickham argues, we tend to need to denormalize data because analytics tools prefer having everything connected together, but the way we denormalize still retains a fairly normal structure:  we still treat observations and variables like we would in a normalized data structure, so we don’t try to pack multiple observations in the same row or multiple variables in the same column, reuse a column for multiple purposes, etc.

Next time around, I’m going to make an argument that 3NF isn’t where we need to be, that there’s a better place for those analytics cool kids to hang out.

ggplot2: Radar Love

This is part eight of a series on ggplot2.

As I bring this series to a close, I want to show off one last geom:  the radar chart.  I’m a fan of radar charts and you can build them natively with ggplot, but there is also an extension called ggradar.

This brings me to a bit of a sidebar.  During the course of this series, I’ve looked at several ggplot2 extensions:  ggthemes, ggrepel, and cowplot.  These extensions and more are available at the ggplot2 extensions gallery.  There are some quite good extensions here and if you’re struggling to conceptualize a graph in ggplot2, this might give you other alternatives.

Anyhow, let’s get our radar chart on.

Building A Radar Chart

The first trick to a good radar chart is having normalized data, where everything is scaled on a common range.  Typically, we standardize data on a range from 0 to 1, where 1 is the largest value in the set and the point value for each entity is its value divided by the largest value in the set.  So for example, in the gapminder data set, Norway’s per-capita GDP in 2007 is $49,357.19.  The United States is 4th at $42,951.65.  Standardized values would set Norway = 1.0 and the US at 0.87.

The second trick to a good radar chart is to have several variables of interest and a relatively small number of observations to track.  With gapminder, we can group by continent so we only have five observations, and can track a few variables for the year 2007:  number of countries, min and max life expectancy, average population, and min and max GDP per capita.

Here’s the code:

devtools::install_github("ricardo-bion/ggradar", dependencies=TRUE)

standardize <- function(x){

radar_data <- gapminder %>%
    filter(year == 2007) %>%
    group_by(continent) %>%
        n = n(),
        minLife = min(lifeExp),
        maxLife = max(lifeExp),
        meanPop = mean(pop),
        minGdpPercap = min(gdpPercap),
        maxGdpPercap = max(gdpPercap)
    ) %>%
        funs(standardize(.) %>% as.vector),
        vars=c("n", "minLife", "maxLife", "meanPop", "minGdpPercap", "maxGdpPercap")

    plot.data = radar_data,
    font.radar = "Gill Sans MT",
    grid.label.size = 6,
    axis.label.size = 5,
    group.point.size = 3,
    group.line.width = 1,
    legend.text.size = 12

Things I can’t quit:  radar charts.

First, we need to install ggradar and load our relevant libraries. Then, I create a quick standardization function which divides our variable by the max value of that variable in the vector. It doesn’t handle niceties like divide by 0, but we won’t have any zero values in our data frames.

The radar_data data frame starts out simple: build up some stats by continent. Then I call the mutate_each_ function to call standardize for each variable in the vars set. mutate_each_ is deprecated and I should use something different like mutate_at, but this does work in the current version of ggplot2 at least.

Finally, I call the ggradar() function. This function has a large number of parameters, but the only one you absolutely need is plot.data. I decided to change the sizes because by default, it doesn’t display well at all on Windows.

What we end up with is a fun radar chart, letting us see how each continent stacks up.  We can see that Africa has the most countries and Oceania the least—Oceania only has two countries in the data set, Australia and New Zealand.

I should note that the “minimum” variables are for the lowest value within that continent, so the lowest GDP per capita in Oceania is significantly higher than the lowest GDP per capita in any other continent.  The easiest way to think about it is to consider these the best of the worst per continent.  The other variables work as you’d expect.


In today’s post, we took a look at a ggplot2 extension, ggradar.  These extensions provide us an easy way of getting functionality which we might possibly only get with great difficulty, or maybe not at all.  You don’t need to use extensions to create good visuals, but knowing where they are and when to use them can make a big difference.

ggplot2: cowplot

This is part seven of a series on ggplot2.

Up to this point, I’ve covered what I consider to be the basics of ggplot2.  Today, I want to cover a library which is still easy to use, but helps you create more advanced visuals:  cowplot.  I was excited by the name cowplot, but once I learned that it had nothing to do with cattle (instead, the author’s name is Claus O. Wilke), that did diminish the charm a little bit.  Nevertheless, there are a couple of great things you can do with this library and we’ll see one of them today.

If you’re interested in cowplot, I recommend reading the vignette first, as it provides several useful examples.  For our case, we are going to use cowplot to stack two related charts.

Charting Genocide

To this point, we have been using the gapminder data set to compare GDP and life expectancy across continents, but without looking at any countries in particular.  In today’s post, I want to show a comparison between one country and the world.

First up, let’s load our libraries:

if(.Platform$OS.type == "windows") {
} else {

Next up, I want to build a plot showing GDP and life expectancy changes over time across the globe.  The gapminder data set has a number of individual year-GDP-expectancy points, so we’re going to summarize them first in a data frame.  After I do that, I will plot them using ggplot.

global_avg <- gapminder %>%
    group_by(year) %>%
    summarize(m_lifeExp = mean(lifeExp), m_gdpPercap = mean(gdpPercap)) %>%
    select(year, m_lifeExp, m_gdpPercap)

plot_global <- ggplot(data = global_avg, mapping = aes(x = m_gdpPercap, y = m_lifeExp)) +
    geom_point() +
    geom_path(color = "#999999") +
    scale_x_continuous(label = scales::dollar) +
        mapping = aes(label = year),
        nudge_y = 0.7,
        nudge_x = -120,
        segment.alpha = 0,
        family = "Gill Sans MT",
        size = 4
    ) +
    theme_minimal() +
        x = "GDP (PPP, normalized to 2005 USD)",
        y = NULL,
        caption = "Source:  Gapminder data set, 2010"
    ) +
        text = element_text(family = "Gill Sans MT"),
        plot.title = element_text(size = 20),
        plot.subtitle = element_text(size = 12),
        plot.caption = element_text(size = 9),
        legend.title = element_text(size = 9),
        axis.title = element_text(size = 10)

Global changes in GDP and life expectancy over time

Notice that I used geom_path().  This is a geom I did not cover earlier in the series.  It’s not a common geom, though it does show up in charts like this where we want to display data for three variables.  The geom_line() geom follows the basic rules for a line:  that the variable on the y axis is a function of the variable on the x axis, which means that for each element of the domain, there is one and only one corresponding element of the range (and I have a middle school algebra teacher who would be very happy right now that I still remember the definition she drilled into our heads all those years ago).

But when you have two variables which change over time, there’s no guarantee that this will be the case, and that’s where geom_path() comes in.  The geom_path() geom does not plot y based on sequential x values, but instead plots values according to a third variable.  The trick is, though, that we don’t define this third variable—it’s implicit in the data set order.  In our case, our data frame comes in ordered by year, but we could decide to order by, for example, life expectancy by setting data = arrange(global_avg, m_lifeExp).  Note that in a scenario like these global numbers, geom_line() and geom_path() produce the same output because we’ve seen consistent improvements in both GDP per capita and life expectancy over the 55-year data set.  So let’s look at a place where that’s not true.

ggplot(data = filter(gapminder, country == "Cambodia"), mapping = aes(x = gdpPercap, y = lifeExp)) +
    geom_point() +
    geom_path(color = "#999999") +
    scale_x_continuous(label = scales::dollar) +
        mapping = aes(label = year),
        segment.alpha = 0,
        family = "Gill Sans MT",
        size = 4
    ) +
    theme_minimal() +
        x = "GDP (PPP, normalized to 2005 USD)",
        y = NULL,
        caption = "Source:  Gapminder data set, 2010"
    ) +
        text = element_text(family = "Gill Sans MT"),
        plot.title = element_text(size = 20),
        plot.subtitle = element_text(size = 12),
        plot.caption = element_text(size = 9),
        legend.title = element_text(size = 9),
        axis.title = element_text(size = 10)

Cambodian changes in GDP and life expectancy over time

Cambodia starts out similar to the rest of the world, seeing some growth in GDP per capita and life expectancy through 1967, but a precipitous drop in both during the 1970s.  The reason was the Khmer Rouge, one of the nastiest communist governments.  This graph alone is evidence of disaster, but I really want to drive the point home:  I want a direct comparison between what happened in Cambodia versus the rest of the world at the same time, and that’s where cowplot comes in.

Plotting A Grid

We’ve seen facet_wrap() and facet_grid() already in ggplot2, but cowplot’s plot_grid() has something very helpful for us:  the rel_heights parameter.  This lets us state what percentage of the total visual space each chart should take.  Let’s take the global plot, attach the Cambodian plot, and clean up titles and axes.  Then we’ll call cowplot’s plot_grid() function.  Here’s the full code:

plot_cambodia <- ggplot(data = filter(gapminder, country == "Cambodia"), mapping = aes(x = gdpPercap, y = lifeExp)) +
    geom_point() +
    geom_path(color = "#999999") +
    scale_x_continuous(label = scales::dollar) +
        mapping = aes(label = year),
        segment.alpha = 0,
        family = "Gill Sans MT",
        size = 4
    ) +
    theme_minimal() +
        x = NULL,
        y = NULL,
        title = "The Khmer Rouge Legacy",
        subtitle = "Charting Cambodian life expectancy and GDP over time, compared to global averages."
    ) +
        text = element_text(family = "Gill Sans MT"),
        plot.title = element_text(size = 20),
        plot.subtitle = element_text(size = 12),
        plot.caption = element_text(size = 9),
        legend.title = element_text(size = 9),
        axis.title = element_text(size = 10)

global_avg <- gapminder %>%
    group_by(year) %>%
    summarize(m_lifeExp = mean(lifeExp), m_gdpPercap = mean(gdpPercap)) %>%
    select(year, m_lifeExp, m_gdpPercap)

plot_global <-
    ggplot(data = global_avg, mapping = aes(x = m_gdpPercap, y = m_lifeExp)) +
    geom_point() +
    geom_path(color = "#999999") +
    scale_x_continuous(label = scales::dollar) +
        mapping = aes(label = year),
        nudge_y = 0.7,
        nudge_x = -120,
        segment.alpha = 0,
        family = "Gill Sans MT",
        size = 4
    ) +
    theme_minimal() +
        x = "GDP (PPP, normalized to 2005 USD)",
        y = NULL,
        caption = "Source:  Gapminder data set, 2010"
    ) +
        text = element_text(family = "Gill Sans MT"),
        plot.title = element_text(size = 20),
        plot.subtitle = element_text(size = 12),
        plot.caption = element_text(size = 9),
        legend.title = element_text(size = 9),
        axis.title = element_text(size = 10)

cowplot::plot_grid(plot_cambodia, plot_global, rel_heights = c(0.55, 0.45), ncol=1)

Comparing Cambodia to the rest of the world

We used relative heights of 55% versus 45% for this plot.  If you squeeze the world chart down, the line flattens out and distorts the image, so we want to keep these plots relatively similarly sized.

One thing I don’t like about this chart is that the year labels still end up overlapping the lines.  The ggrepel library will have text shift away from data points, but it doesn’t appear to prevent overlapping lines in a geom_path() geom.  I tried different nudge values but nothing quite worked right.

Keeping The Same X Axis

In this next chart, we’re going to look at Rwanda, another country which experienced a well-known genocide.  This time, instead of plotting both GDP per capita and life expectancy, we’re only going to look at life expectancy changes over time.  In the top chart, I’ll show Rwanda’s figures.  In the bottom chart, I’ll show a line chart with global averages over the same time frame.  Because we’ll use the same X axis, I don’t want two separate X axes for the two charts; I want them to blend.

plot_rwanda <- ggplot(data = filter(gapminder, country == "Rwanda"), mapping = aes(x = year, y = lifeExp)) +
    geom_point() +
    geom_line(color = "#999999") +
    theme_minimal() +
        x = NULL,
        y = NULL,
        title = "The Rwandan Genocide",
        subtitle = "Charting Rwandan life expectancy over time, compared to the global average."
    ) +
        text = element_text(family = "Gill Sans MT"),
        plot.title = element_text(size = 20),
        plot.subtitle = element_text(size = 12),
        axis.text.x = element_blank(),
        axis.title.x = element_blank(),
        axis.ticks.x = element_blank()

plot_global <-
    ggplot(data = global_avg, mapping = aes(x = year, y = m_lifeExp)) +
    geom_point() +
    geom_line(color = "#999999") +
    theme_minimal() +
        x = NULL,
        y = NULL,
        subtitle = "Global Average",
        caption = "Source:  Gapminder data set, 2010"
    ) +
        text = element_text(family = "Gill Sans MT"),
        plot.title = element_text(size = 20),
        plot.subtitle = element_text(size = 12),
        plot.caption = element_text(size = 9),
        legend.title = element_text(size = 9),
        axis.title = element_text(size = 10)

cowplot::plot_grid(plot_rwanda, plot_global, rel_heights = c(0.55, 0.45), ncol=1)

Seeing the Rwandan genocide in stark contrast to global averages

There are a couple of changes here.  Because I have a consistent X axis, I removed the ticks from the top graph.  I also removed the text labels, as we are now showing year explicitly instead of implicitly through the data path.


This is only one of the uses for cowplot, but it’s a good one.  We are also not limited to two charts—we could just as easily stack an indefinite number of charts on top of one another and define relative sizes for each chart.  We can also combine cowplot’s plot_graph() with facet_wrap() to group together a set of charts and fit it in relationship to another chart.  This would be helpful if, say, we showed one country’s change in life expectancy over time to plots of similar countries’ changes over time.

We Speak Linux

I’m pleased to announce the launch of We Speak Linux, a site dedicated to helping Windows administrators and developers become familiar with Linux.  This has been Tracy Boggiano’s pet project for several months.  Along for the ride are Brian Carrig (who still needs to update his blog), Mark Wilkinson, Anthony Nocentio, and me.  I have the good fortune to work with Tracy, Brian, and Mark on a daily basis, but we haven’t recruited Anthony yet…

Why This?  Why Now?

All five of the founders are data platform professionals specializing in SQL Server.  Three years ago, that wouldn’t exactly have sounded like the kind of group to start up a Linux-oriented site, but then things started changing.  Now, it’s been a dream of mine to have SQL Server Management Studio on Linux (e.g., this post from 2014) so that I could dump Windows and go full-time with Linux on my computers, but I figured the chances of SQL on Linux happening were nil.  Then Scott Guthrie went and announced SQL Server on Linux in March of 2016.  As soon as we heard about it, Mark and I began conspiring to get involved in the program preview.  We caught the eye of the SQL Server Customer Advisory Team (SQLCAT) and had an opportunity in early 2017 to test a workload in Linux.  Seeing how serious the SQL Server team (especially Slava Oks) was sold it for me.  Half a year later, Microsoft released SQL Server 2017 on Linux.

Since then, Tracy and Anthony have been active in the SQL Server community, introducing Windows administrators and developers to Linux and showing that the transition from Windows to Linux for SQL Server is rather straightforward (especially for developers, who generally shouldn’t care what the underlying OS does).  We’re now taking this one step further.

What Are We Doing Here?

We Speak Linux is dedicated to building a virtual user group experience for Linux.  First, we’re hosting monthly webinars  on various topics of interest.  Kellyn Pot’vin-Gorman will present at our inaugural event and we’re working on filling out the schedule for the rest of the year.

Second, we’ve set up a Slack for We Speak Linux.  Joining this Slack is easy; just fill in your e-mail address and we’ll get you an automatic invitation.

Third, we have a Twitter account.

From there, we have other things planned but I don’t want to spoil everything just yet.

What Can You Do?

Are you a Windows developer or administrator interested in learning about Linux?  Check out our upcoming webinars and as we get closer to go-live for the first webinar, there will be registration details.

Are you a seasoned Linux professional and willing to help Windows developers make the leap to Linux?  Contact us; we’d love to talk to you about doing a session.  This isn’t a SQL Server-specific group, so we want a broad range of talks.