Avoiding NULL with Normalization

Aaron Bertrand sticks up for NULL:

A long time ago, I answered a question about NULL on Stack Exchange entitled, “Why shouldn’t we allow NULLs?” I have my share of pet peeves and passions, and the fear of NULLs is pretty high up on my list. A colleague recently said to me, after expressing a preference to force an empty string instead of allowing NULL:

“I don’t like dealing with nulls in code.”

I’m sorry, but that’s not a good reason. How the presentation layer deals with empty strings or NULLs shouldn’t be the driver for your table design and data model. And if you’re allowing a “lack of value” in some column, does it matter to you from a logical standpoint whether the “lack of value” is represented by a zero-length string or a NULL? Or worse, a token value like 0 or -1 for integers, or 1900-01-01 for dates?

I linked to this on Curated SQL, where I’d started to write out a response. After about four paragraphs of that, I decided that maybe it’d make sense to turn this into a full blog post rather than a mini-commentary, as I think it deserves the lengthier treatment. I’m going to assume that you’ve read Aaron’s post first, and it’s a well-done apologia in support of using NULLs pragmatically. I’ll start my response with a point of agreement, but then move to differences and alternatives before laying out where I see additional common ground between Aaron’s and my thoughts on the matter.

Token Values (Generally) Aren’t the Answer

Token values, like the ones Aaron describes, aren’t a good answer. And there’s a really good reason why: values represent something specific. An admission date of 2021-07-01 represents a specific fact: some person was admitted on 2021-07-01, and “2021-07-01” itself represents July 1, 2021. I’m belaboring this point because token values do not follow this same pattern. A person with an admission date of 1900-01-01 might represent this same pattern, but it might not. The problem is that there is an information sub-channel, where I need to know if the value represents the thing in itself, or if it in fact represents some other, unrelated notion. Quite frequently, a date of 1900-01-01 (or 1763-01-01 or 0001-01-01) means “I don’t know what the actual date was, but it was long enough ago that it shouldn’t really matter…I think.”

This is a problem for all of the reasons Aaron describes: there is typically no information on the fact that an information sub-channel exists, no clear representation of what each datum represents (is 1900-01-02 different from 1900-01-01 here?), and the risk of problems down the line if people don’t know that special cases exist.

One area where token values can be the answer is in data warehousing, where we might use well-known sentinel values (such as 1900-01-01 or 9999-12-31) to avoid NULLs in data warehousing queries and handle type-2 slowly changing dimensions. But in most places, this is a workaround rather than a solution, and not a good one.

Embrace Normalization

The proper answer here is normalization, specifically 6th Normal Form. Given that I need two hands to count up to 6th Normal Form, that’s a pretty big number. Fortunately, I happen to have a talk which goes into detail on database normalization, so I can swipe a slide from it:

Well, if that doesn’t clear everything up, I don’t know what will!

Okay, so what does this mean? Fortunately, I have an example on the next page. Suppose that we have some table (okay, relvar, but let’s not spend too much time on semantics here) with a few attributes: Product = { ProductKey, ProductName, ProductColor, ProductWeight, StorageCity }. Our natural key is ProductKey, and each attribute describes a fact about the specific product. Putting this into 6th Normal Form would mean creating four separate tables/relvars:

ProductKeyProductName = { ProductKey, ProductName }

ProductKeyProductColor = {ProductKey, ProductColor }

ProductKeyProductWeight = {ProductKey, ProductWeight }

ProductKeyStorageCity = { ProductKey, StorageCity }

Now, before you run me out of town, let’s stipulate that I don’t recommend doing this, just that if you wanted to be in 6th Normal Form, this is what you’d have to do.

What I do recommend, however, is that we use 6NF to avoid NULLs. Suppose that we might not always have a product weight, but that we always have the other attributes. Instead of making ProductWeight NULLable, we can instead create two tables:

Product = { ProductKey, ProductName, ProductColor, StorageCity }

ProductWeight = { ProductKey, ProductWeight }

That way, if we do have a product weight, we insert a row into the table. If we don’t have a product weight, we don’t insert a row.

Using Aaron’s example where EnteredService is not always there, we have two tables. Here are the two tables:

CREATE TABLE dbo.Widgets
(
    WidgetID     int IDENTITY(1,1) NOT NULL,
    SerialNumber uniqueidentifier NOT NULL DEFAULT NEWID(),
    Description  nvarchar(500),
    CONSTRAINT   PK_W6NF PRIMARY KEY (WidgetID)
);

CREATE TABLE dbo.WidgetEnteredService
(
    WidgetID int NOT NULL,
    EnteredService datetime,
    CONSTRAINT PK_WES PRIMARY KEY (WidgetID)
);

When you have a relatively small number of NULLable columns, this is a great solution.

How Does It Perform?

I think the above solution is better from a relational theory standpoint, but let’s do a quick performance comparison. As a note, my priors on this are that this will not always be great for performance, especially if you have several 6NF tables connected to your base table. Furthermore, because I’m not doing a comprehensive test of all possible scenarios (or even a reasonable subset of all interesting scenarios), if it does perform better, I’m not going to claim that the 6NF solution is better because of performance. Instead, my argument is based on relational architecture over performance. But if it runs like a dog, that’s a strong argument in favor of Aaron’s more pragmatic approach. So let’s give it a go and find out, shall we?

Here’s the setup script I used, adding in my tables and keeping things as similar to his setup as I could.

CREATE TABLE dbo.Widgets
(
    WidgetID     int IDENTITY(1,1) NOT NULL,
    SerialNumber uniqueidentifier NOT NULL DEFAULT NEWID(),
    Description  nvarchar(500),
    CONSTRAINT   PK_W6NF PRIMARY KEY (WidgetID)
);

CREATE TABLE dbo.WidgetEnteredService
(
    WidgetID int NOT NULL,
    EnteredService datetime,
    CONSTRAINT PK_WES PRIMARY KEY (WidgetID)
);

CREATE TABLE dbo.Widgets_NULL
(
    WidgetID     int IDENTITY(1,1) NOT NULL,
    SerialNumber uniqueidentifier NOT NULL DEFAULT NEWID(),
    Description  nvarchar(500),
    CONSTRAINT   PK_WNULL PRIMARY KEY (WidgetID)
);
 
CREATE TABLE dbo.Widgets_Token
(
    WidgetID     int IDENTITY(1,1) NOT NULL,
    SerialNumber uniqueidentifier NOT NULL DEFAULT NEWID(),
    Description  nvarchar(500),
    CONSTRAINT   PK_WToken PRIMARY KEY (WidgetID)
);

INSERT dbo.Widgets_NULL(Description) 
OUTPUT inserted.Description INTO dbo.Widgets_Token(Description)
SELECT TOP (100000) LEFT(OBJECT_DEFINITION(o.object_id), 250)
FROM master.sys.all_objects AS o 
    CROSS JOIN (SELECT TOP (50) * FROM master.sys.all_objects) AS o2
WHERE o.[type] IN (N'P',N'FN',N'V')
	AND OBJECT_DEFINITION(o.object_id) IS NOT NULL;

INSERT INTO dbo.Widgets(Description) SELECT Description FROM dbo.Widgets_NULL;

ALTER TABLE dbo.Widgets_NULL  ADD EnteredService datetime;
ALTER TABLE dbo.Widgets_Token ADD EnteredService datetime;
GO
 
UPDATE dbo.Widgets_NULL  
SET EnteredService = DATEADD(DAY, WidgetID/250, '20200101') 
WHERE WidgetID > 90000;
 
UPDATE dbo.Widgets_Token 
SET EnteredService = DATEADD(DAY, WidgetID/250, '20200101') 
WHERE WidgetID > 90000;
 
UPDATE dbo.Widgets_Token 
SET EnteredService = '19000101'
WHERE WidgetID <= 90000;

INSERT INTO dbo.WidgetEnteredService(WidgetID, EnteredService)
SELECT
	WidgetID,
	EnteredService
FROM dbo.Widgets_NULL
WHERE
	EnteredService IS NOT NULL;

CREATE INDEX IX_EnteredService ON dbo.Widgets_NULL (EnteredService);
CREATE INDEX IX_EnteredService ON dbo.Widgets_Token(EnteredService);
CREATE INDEX IX_EnteredService ON dbo.WidgetEnteredService(EnteredService);

Here are the size totals for each of the three states: initial table loading, after adding EnteredService dates, and after adding an index on EnteredService.

Note that I do have complete agreement with Aaron’s values for the first two sets, though my “After index” sets are ever so slightly larger than his. It’s not enough to make a difference in the analysis, though.

Running Aaron’s sample query for all three tables, we can see that the 6NF form is just about the same as the NULL form:

In order, Token, 6NF, NULL.

If we change the query a bit to include serial number and not aggregate, we end up with a query like:

SELECT WidgetID, SerialNumber, EnteredService
FROM dbo.Widgets_NULL 
WHERE EnteredService <= '20210101';

SELECT wes.WidgetID, w.SerialNumber, wes.EnteredService
FROM dbo.WidgetEnteredService wes
	INNER JOIN dbo.Widgets w
		ON wes.WidgetID = w.WidgetID
WHERE wes.EnteredService <= '20210101';

In this case, both have durations of 3ms on my machine and both take 5607 reads (well, the NULL version takes 5608, but close enough). That’s because both of them ultimately have very similar-looking query plans:

Always with the nested loops.

In fairness, the NULLable version of this has a benefit that we won’t get from the 6NF version: if we know we include SerialNumber on queries, we can include that in the index and remove the key lookup, making this an 11-read operation versus 5607 reads. This says that yes, you should expect a degradation in performance if you rely on the NULLable columns for filtering and you can create covering indexes.

For other types of queries, the answer is a bit murkier. For example, pulling rows over a range filtered by WidgetID will give you differing results based on the size: for smaller ranges, the 6NF form may end up with fewer reads; for larger ranges, the NULLable version may end up with fewer reads. For point lookups (WidgetID = x), it takes 5 reads for 6NF and 3 for NULL based on these row counts.

SELECT WidgetID, SerialNumber, EnteredService
FROM dbo.Widgets_NULL 
WHERE WidgetID > 95000;

SELECT wes.WidgetID, w.SerialNumber, wes.EnteredService
FROM dbo.Widgets w
	LEFT OUTER JOIN dbo.WidgetEnteredService wes
		ON wes.WidgetID = w.WidgetID
WHERE w.WidgetID > 95000;

In short, it’s a mixed bag but I’m willing to say that 6NF will probably perform worse, especially on very large datasets.

When to Compromise?

In this last section, I want to talk about cases where I do agree with Aaron and have no real problem using NULLable columns. The relational model is a beautiful thing, but implementations aren’t perfect and we do need to keep performance in mind, so “No NULLs ever” isn’t a stance I can support. Also, note that I came up with this list late at night and I’m probably missing some cases. So think of it as illustrative rather than comprehensive. Sort of like the whole post, now that I think about it…

Performance-Critical Tables with Frequent Use of NULL Columns

The first example fits in with what we saw above. If you constantly query the Widgets table by EnteredService and you need the absolute best performance, then yes, making this one single table can make sense. But I want to emphasize that this is performance-critical tables, and frankly, I’d lay out the claim that query performance is rarely a critical need—there may be a few tables in an environment which absolutely need supporting queries to run as fast as possible, but for most other tables and queries, “good enough” is a fair distance from “absolute fastest possible.” Mind you, I’m not talking about completely ignoring performance or being okay with OLTP queries taking 5 minutes; instead, we’re talking about 3ms with 5000 reads vs 3ms with 11 reads. If you run this query a few million times per day, then yeah, use the NULL. If you run the query a few hundred times per day (or run other variants in which the difference is smaller or even in 6NF’s favor), I don’t think that’s a real deciding factor.

Staging Tables

Staging tables are a good example of a case where I’d happily use NULL, and that’s because I have no idea what kind of garbage I’m going to get from source systems. I need to develop ETL processes with garbage inputs in mind, and that means dealing with missing values or bad data that I might need to drop. Also, I might merge together data from multiple sources in separate chunks, loading 4 columns from a dimension based on this source, 3 from that source, and 5 more from yet another source. I don’t want to try to connect all of these together (perhaps for performance reasons), and although I could solve the problem without NULL, it’s pretty easy when you let the staging tables be NULLable. But when going from staging tables to real tables, I have much more control over the domain and this reason goes away.

Temp Tables

Similarly to staging tables, I don’t typically care about NULL values in temp tables. They’re not permanent tables and they’re typically intended to help work on a problem piecemeal. NULL away in these temp tables unless you need the benefits of NOT NULL (or need to add a primary key or something which requires non-NULL attributes).

Conclusion

Do you need NULL in your database? Nope. Using 6th Normal Form can eliminate the need for NULLs, but understand the implications of it.

Am I going to be insulted that you have NULL in your database? Nah.

Do I create tables with NULL? You betcha.

Should I have created all of those tables with NULL? Um…let’s not answer that one.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s