In the previous T-SQL anti-pattern, I discussed zombie fields. One of the reasons to avoid zombie fields (aside from the risk of infection) is that you typically don’t get appropriate data constraints on your zombified attributes. Today’s anti-pattern is the generalization of this problem.
Suppose you had a set of tables that looked like this:
CREATE TABLE A ( Column1 VARCHAR(50) NOT NULL, Column2 VARCHAR(50) NOT NULL, Column3 VARCHAR(50) NOT NULL, Column4 VARCHAR(50) NOT NULL ); CREATE TABLE B ( Column2 VARCHAR(50) NOT NULL, ColumnA VARCHAR(50) NOT NULL, ColumnB VARCHAR(50) NOT NULL ); CREATE TABLE C ( Column3 VARCHAR(50) NOT NULL, ColumnB VARCHAR(50) NOT NULL, ColumnZ VARCHAR(50) NOT NULL );
Let me enumerate the problems here, even if we consider the actual table and attribute names to be okay:
- There are no primary keys. This means we could have duplicate rows, and that could mess with any analysis of the data set we try to do.
- There are no alternate keys, something especially important with tables using surrogate keys like identity integer columns. The reason for an alternate key is the same as a primary key: preventing duplicate rows from creeping in.
- We have no foreign keys. Does Column2 in table B link up to Column2 in table A? We can’t know that for sure.
- We have no check constraints. Are there any appropriate or inappropriate values for ColumnB? If we’re only supposed to accept values of 1, 2, or 3, we should have a check constraint telling us that.
- All of the columns are VARCHAR(50). I highly doubt that every one of the columns in a database really should be varchar(50). But then again, at least it’s better than varchar(max) for everything…
- There are no default constraints. Considering that all of these columns are non-nullable, it might be okay for the calling app to pass in all of the values. If so, great; otherwise, maybe we should have some reasonable defaults. If ColumnZ is supposed to be a “record created” timestamp, we should create a default constraint setting the value to CURRENT_TIMESTAMP. That way, any calling procedures can skip the created date column and still get a correct value.
In all of these cases, we’re taking flexibility out of the hands of higher-level developers and data entry people. Given that bad data comes from this flexibility, that’s a good thing.
Here’s the same set of tables, but this time with better names and all appropriate keys:
CREATE TABLE A ( Column1 INT IDENTITY(1,1) NOT NULL, Column2 TINYINT NOT NULL, Column3 DECIMAL(9,2) NOT NULL, Column4 CHAR(12) NOT NULL ); ALTER TABLE A ADD CONSTRAINT [PK_A] PRIMARY KEY CLUSTERED(Column1); ALTER TABLE A ADD CONSTRAINT [UKC_A] UNIQUE(Column2, Column4); ALTER TABLE A ADD CONSTRAINT [CK_Column3] CHECK (Column3 >= 0); ALTER TABLE A ADD CONSTRAINT [CK_Column4] CHECK (LEFT(Column4, 3) = 'RPT' AND RIGHT(Column4, 3) = 'USR'); CREATE TABLE B ( Column2 TINYINT NOT NULL, ColumnA SMALLINT NOT NULL, ColumnB VARCHAR(30) NOT NULL ); ALTER TABLE B ADD CONSTRAINT [PK_B] PRIMARY KEY CLUSTERED(Column2); CREATE TABLE C ( Column3 VARCHAR(50) NOT NULL, ColumnB CHAR(2) NOT NULL, ColumnZ DATETIME2(0) NOT NULL ); ALTER TABLE C ADD CONSTRAINT [PK_C] PRIMARY KEY CLUSTERED(Column3, ColumnB); ALTER TABLE C ADD CONSTRAINT [DF_ColumnZ] DEFAULT(CURRENT_TIMESTAMP) FOR ColumnZ; ALTER TABLE A ADD CONSTRAINT [FK_B_A] FOREIGN KEY(Column2) REFERENCES [B](Column2);
The differences here are apparent:
- We can see the reason for each table: those primary and alternate keys define entities.
- We can see how the tables relate. In this case, table B’s ColumnB was not the same as table C’s ColumnB; it was just happenstance that the two attributes are named the same. Also, we could use a tool like Visio to create a diagram of this table structure. It may not be important for such a small set of tables, but when you’re dealing with hundreds of tables, being able to see clusters of tables and relationships definitely helps.
- We can guarantee appropriate data values for many of the columns. Column3 is supposed to represent the amount of money a person has. In this case, they can’t have less than $0 available, so using a numeric value and a check constraint guarantee that nobody puts in “a lot” for the amount of money a user has.
It takes a bit more time to do, but once we have this in place, we’ve solved a lot of potential data errors without messy cleanup processes. It also makes reviewing the code significantly easier. Even if you’re the one who wrote the code, you’ll tend to forget certain things, and this sort of database structure makes it a lot easier to refresh your memory than trying to figure out which of the many VARCHAR(50) columns are supposed to join together.