In conjunction with SQL Saturday Madison, I am giving my first full-day training session entitled Enter the Tidyverse: R for the Data Professional on Friday, April 6th. I'm using the term "data professional" in particular because I want to hit a relatively under-served part of the community: database administrators. I should note that if you're a…
Data Processing: The Other 90%
This is part three of a series on launching a data science project. The Three Steps Of Data Processing Data processing is made up of a few different activities: data gathering, data cleansing, and data analysis. Most estimates are that data scientists spend about 80% of their time in data processing (particularly in data cleansing). …
Building Business Understanding
This is part two of a series on launching a data science project. How Is Babby Data Science Project Formed? Behind each data science project, there is (hopefully) someone higher up on the business side who wants it done. This person might have been the visionary behind this project or might simply be the sponsor…
The Microsoft Team Data Science Process
This is part one of a series on launching a data science project. This is the beginning of a series of posts around growing a data science project from the germ of an idea to its fruition as a stable oak. Before I get into the process, I want to start with a few data…
Stripping Out HTML With StringR
This is a quick post today on removing HTML tags using the stringr package in R. My purpose here is in taking some raw data, which can include HTML markup, and preparing it for a vectorizer. I don't need the resulting output to look pretty; I just want to get rid of the HTML characters.…
Solving For Boyce-Codd Normal Form
Last time, I pointed out that tidy data, at its core, embraces Third Normal Form. But I left you hanging with a hint that there's an even better place to be. Today, we are going to look at that place: Boyce-Codd Normal Form. The reason we're interested in this normal form is that it helps…
Tidy Data And Normalization
In Hadley Wickham's paper on tidy data, he makes a few points that I really appreciated. Data sets are made up of variables and observations. In the database world, we'd call variables attributes and observations entities. In the spreadsheet world, we'd call variables/attributes columns and observations/entities rows. Each variable contains all values which measure the…
ggplot2: Radar Love
This is part eight of a series on ggplot2. As I bring this series to a close, I want to show off one last geom: the radar chart. I'm a fan of radar charts and you can build them natively with ggplot, but there is also an extension called ggradar. This brings me to a…
ggplot2: cowplot
This is part seven of a series on ggplot2. Up to this point, I've covered what I consider to be the basics of ggplot2. Today, I want to cover a library which is still easy to use, but helps you create more advanced visuals: cowplot. I was excited by the name cowplot, but once I…
We Speak Linux
I'm pleased to announce the launch of We Speak Linux, a site dedicated to helping Windows administrators and developers become familiar with Linux. This has been Tracy Boggiano's pet project for several months. Along for the ride are Brian Carrig (who still needs to update his blog), Mark Wilkinson, Anthony Nocentio, and me. I have…