In Hadley Wickham’s paper on tidy data, he makes a few points that I really appreciated.
Data sets are made up of variables and observations. In the database world, we’d call variables attributes and observations entities. In the spreadsheet world, we’d call variables/attributes columns and observations/entities rows.
Each variable contains all values which measure the same underlying attribute across units. For example, a variable might be the height of a person. In that case, every instance of that variable should be the height of a person. You should not intersperse measures, like having the variable contain height for a person, wingspan for a bird, and hind leg length for a dog.
Each observation contains all values measured on the same unit. For example, we might have a person, and different variables which represent the person: height, weight, wingspan, primary handedness, maximum number of ice cream cones consumed in a single sitting, etc. We should not have data for two separate people in the same observation; each person, in this case, gets his own observation.
The reason we want to arrange our data this way is that it makes life easier for us. First, it is easier to describe relationships between variables. For example, your age is a function of your date of birth and the current date. If we have date of birth and current date as two variables, we can easily calculate age. Here we can see it in R and SQL:
people$age <- as.double(difftime(people$current_date, people$date_of_birth, units = "days")) / 365.25
SELECT DateOfBirth, CurrentDate, DATEDIFF(DAY, DateOfBirth, CurrentDate) / 365.25 AS Age FROM dbo.Person;
By contrast, it is easier to make comparisons between observations. For example, we can easily determine how many people are using a particular telephone number:
telephones %>% group_by(telephone_number) %>% summarize(number_of_users = n()) %>% select(telephone_number, number_of_users)
SELECT TelephoneNumber, COUNT(1) AS NumberOfUsers FROM dbo.Telephone GROUP BY Telephone;
The kicker, as Wickham describes on pages 4-5, is that normalization is a critical part of tidying data. Specifically, Wickham argues that tidy data should achieve third normal form.
Now, in practice, Wickham argues, we tend to need to denormalize data because analytics tools prefer having everything connected together, but the way we denormalize still retains a fairly normal structure: we still treat observations and variables like we would in a normalized data structure, so we don’t try to pack multiple observations in the same row or multiple variables in the same column, reuse a column for multiple purposes, etc.
Next time around, I’m going to make an argument that 3NF isn’t where we need to be, that there’s a better place for those analytics cool kids to hang out.