Last time, I pointed out that tidy data, at its core, embraces Third Normal Form. But I left you hanging with a hint that there’s an even better place to be. Today, we are going to look at that place: Boyce-Codd Normal Form.
The reason we’re interested in this normal form is that it helps us solve the following data anomalies:
If we have multiple rows in a table with the same value for a column, we might update one instance of it but forget to update the other rows.
Here we have some data:
Let’s say that Lindsay gets married and changes her last name. We update her last name (and thus login) on one project but we might forget to update her login on the other project.
Now let’s say that we bring Ann in off the bench to do all of the ETL, moving Lindsay and Julia off of it. We update the database and we have a problem:
If our application simply updates rows that already exist, we now have two rows with the same data. That’s trouble.
By the way, where did Ann come from here? She works for the company, but in the original data set, you would not have known. If we want this table to have a row for Ann while she’s on the bench, we’d have to insert some sort of placeholder to bring her into existence:
This isn’t good either, because we’re now creating a bogus non-project in order to observe Ann’s existence. There’s a better way to do this: get our tables into Boyce-Codd Normal Form.
Defining Boyce-Codd Normal Form
Boyce-Codd Normal Form is a generalization of Second and Third Normal Forms. There are a couple of requirements to be in Boyce-Codd Normal Form. First, your table must be in First Normal Form. This means that:
- Every entity (row) has a consistent shape. This is something relational databases do for you automatically: you can’t create a table where one entity has an attribute (column) but the next entity doesn’t.
- Every entity has a unique value. You can uniquely identify any particular row.
- Every attribute is atomic: you don’t try to pack more than one value into a single attribute.
- There are no repeating groups of attributes, like PaymentMethod1, PaymentMethod2, PaymentMethod3, etc.
The other half of BCNF is that every determinant on an entity is a key.
Fun With Determinants
Okay, so what’s a determinant? Simple, it’s:
An attribute or combination of attributes on which any other attribute or combination of attributes is functionally dependent.
Putting this a little more abstractly, if we have some attribute Z that depends upon the combination of X and Y, we call (X, Y) the determinant of Z and write it as (X, Y -> Z). So here’s an example:
In this table, whenever we see X = 1 and Y = 4, we always have Z = 7. In fact, we only see one Z for any combination of (X, Y) , so we can say that (X, Y) determines Z. In this simple example, it’d be a bit sketchy to make that determination just after four rows, but in practice, you’ll likely have some logical understanding that explains why there is a functional dependency. For example, if we have UserID and UserName on the same table, any sane design will have UserID determine UserName, meaning that whenever we see a particular UserID, it corresponds to one and only one UserName.
How Do We Determine Determinants?
Well, there are a couple of methods we can use. I’ll separate them between deductive and inductive techniques…though if you’re feeling cheeky, you can use abductive techniques too.
Deductive techniques involve reasoning through the data model. That includes things like:
- Look for common prefixes like UserID and UserName
- Look at what “belongs” to an object. For example, a password belongs to a User object. Does a password belong to a UserLoginAttempt object? Maybe, if we’re tracking the thing the user typed in, but usually we mean the actual password associated with a particular user.
- Look for equivalencies, like English versus inferior Metric measures. If you can convert from one to the other, you don’t need both.
- Discuss with product owners and stakeholders what things belong together.
Inductive techniques are all about querying the data. For example:
- Look for common patterns. In my table above, I showed how we could determine Z based on X and Y. I couldn’t tell you a priori that X and Y determine Z, but I could see it in the data.
- Get a count of distinct attribute values for a given attribute. This is a good way of finding 1:1 relationships like UserID to UserName. It’s also a good way of seeing if there is dirty data somewhere, like where someone only updated some of the rows for a particular ID and missed others.
- Build a correlation matrix for combinations of attributes. This is the most time-consuming method, but helps you find multi-attribute determinants.
One Easy Technique For Finding Boyce-Codd Normal Form
Jennifer Widom opened my mind to a great technique for turning a mega-table into an appropriately normalized data structure. Let’s show you the technique via example.
We are building software to support a Digital Marketing campaign for our employer, a record company. We want to advertise a new album, a death metal-blues mashup entitled Holes In My Heart. The product owner has given us all of the data points we need to collect and has asked for a good data model. We want to build a data model which satisfies Boyce-Codd Normal Form and prevents relevant data anomalies.
Here’s our starting mega-table, which includes every column they need:
We can get into First Normal Form by pointing out that the combination of CampaignID and AudienceTargetID is a primary key here, so we can guarantee that each row will be unique. Showing that all other aspects of 1NF hold is trivial.
The process for getting us into BCNF is simple:
Step 1: Find Functional Dependencies
We first need to find any functional dependencies. I don’t have any data, but I have a savvy product owner who is able to give me the following dependencies:
- CampaignID —> CampaignName, CampaignOwnerName, AgencyID, MaximumDailyBudget, MinimumBid
- AgencyID —> AgencyName, TakeRate
- AudienceTargetID —> TargetedAgeRange, TargetedGenrePreference
- CampaignID, AudienceTargetID —> AudienceBidModifier
There were some tips that helped us figure out some of these details, like how the things with “Campaign” prefixing the name are determined by CampaignID. But there were a couple more which we’d need to suss out in conversations.
Step 2: Are all determinants candidate keys?
It’s a pretty simple question: go through each determinant above and figure out if it is a candidate key to some entity. We only have one entity here and the only candidate key is (CampaignID, AudienceTargetID). So we can definitively answer that question with a resounding “no.”
Step 3: Break out a determinant if no
We have three determinants here which are not candidate keys, so let’s take one of them and break it out into its own table. I’m going to pick AgencyID because it’s small.
Now we have an Agency table:
And we have our new DM campaign table, now minus AgencyName and TakeRate and plus a new foreign key constraint:
Go To Step 2, Collect $200
Then we go back to step 2: checking if we’re done. We clearly aren’t, so let’s pick the next functional dependency: CampaignID. Break those columns out into a table:
Notice that we moved AgencyID to the Campaign table, so there’s still a foreign key constraint, but it won’t be on our DM Campaign table anymore. Anyhow, we once again strike the non-key columns from DM Campaign:
That table is getting a lot smaller, but we still fail step 2 because AudienceTargetID alone is a determinant, but our key is on the combination of CampaignID and AudienceTargetID. So once more into the breach:
Once we strip out those columns, we’re done. I decided to rename DM Campaign to something that makes more sense, and now we have the final data model:
This data model is in Boyce-Codd Normal Form: we are in 1NF and each determinant is a candidate key. Note that we don’t always need one table per determinant, however; there can be cases where a single table has multiple candidate keys.
We just walked through a reasonably simple example of turning a mega-table into Boyce-Codd Normal Form. By doing so, we eliminate several classes of data anomaly, and that’s critical when dealing with a transactional system which contains important data.