Yesterday’s post was all about simplifying individual procedures. With today’s post, I’m going to bring it up one level and talk about inter-procedural design. For this topic, think about how you might re-design a class in an object-oriented language. In a class, you have private, protected, and public methods and members. The public methods and members form a contract and it’s harder to refactor that contract because you then need to check every bit of code which calls that class. For private and protected methods and members (and the internal how-does-it-work bits of public methods), however, you have free reign to refactor code. In the database world, our tables are akin to private members, stored procedures which are only called internally (i.e., by other stored procedures) are akin to private methods, and stored procedures which are called by external code are sort of like public methods. With this in mind, I’m going to dive into the topic of refactoring an entire design.
To understand this topic better, I’m going to use as a case study a recent change I made to one table and its adjoining stored procedures. I think the easiest way to describe the initial design starts with a picture, so here goes. Note that I’ve removed some attributes and anonymized/modified it a little bit to protect the innocent:
As a quick note, dotted lines represent stored procedures which update the underlying table. Using my original metaphor, here’s the breakdown of elements in this original design:
- CostLog — Table (private member)
- Insert — Stored procedure (private method)
- CreateWork — Stored procedure (public method)
- GetForService — Stored procedure (public method) — this procedure gets called from within CreateWork, but there are also schedules which call this procedure directly.
- Update — Stored procedure (public method)
- GetRecentActivity — Stored procedure (public method)
- Get — Stored procedure (private method)
- GetByDate — Stored procedure (public method)
- GetLastUpdate — Stored procedure (public method)
The reason I wanted to elaborate on this metaphor is to note that it’s relatively easy to change the “private method” style procedures, but a lot harder to change the “public” ones, as this would involve changing C# or website code somewhere.
Let me walk through the pain points in this design. You can probably see some of them just from the diagram, but I want to enumerate them all.
- The table is named CostLog, but is actually used for two purposes. In some sense, it actually is a log: it acts as a storage table of record for cost processing information over time. On the other hand, it is also a queue table, as new cost log records go in and get updated as the records go through different steps in the process. This leads to naming confusion: it is a log and it isn’t a log. The inability to come up with a good name for an entity is a code smell that should tell us that the entity may be trying to do too much and we should look at splitting out individual concerns.
- All “Get” procedures should be idempotent, but GetForService is not. An idempotent procedure is one whose outputs will always be the same so long as no external force changes the inputs. The GetForService procedure updates rows in the CostLog table and if you called GetForService twice, the update statement would actually preclude results from showing up in the second call.
- The Get procedure is never used. There is obsolete code we can eliminate.
- The Insert procedure gets called from the CreateWork procedure, but as a weird edge case when an external process is making the call and needs to re-queue work. It would be nice to simplify the design and eliminate this procedure, or at least make it an independent procedure call.
- We have a “garbage collection” procedure which doesn’t really garbage collect. It cancels specific queued records after a certain number of days. This kind of design feels a bit hacky; instead of having an external service tell us when to stop attempting to process, the service code which has intelligence on the matter should know when to stop processing.
- There is a column called “GUID” on the table. This is a very unhelpful name, as it describes the data type rather than the purpose of the attribute.
- There is another column called JDate, which is an integer value and represents a Julian date (of sorts). I’m going to get into this in more detail in a later part, but I want to replace this with a Date type called ReportDate. That makes it the data to understand and replaces a four-byte integer which acts essentially as a surrogate key for dates with a three-byte, easily-readable date type.
- The SendAlerts attribute goes entirely unused, so it’s gone.
So let’s take a moment to talk about how I got to these conclusions. The misnamed table and attribute names are basic code smells. If you can’t define the purpose of an entity or attribute based on its name, there is likely a problem with that entity or attribute. In the case of my entity, the problem is that we’re trying to do too much: we have a queue table and a log table jammed into one. A simpler design would split these two tables out and provide interfaces (stored procedures) to interact with these tables. Remember that “simpler” doesn’t always mean “fewer.” In this case, a simpler design will end up with more database objects, but each database object will be clearer to understand. As for the misnamed attribute, I learned after discussing with the software engineers that the GUID attribute actually gets used as a unique filename for the cost report we create. For this reason, I decided to rename GUID to FileName.
The SendAlerts and JDate attributes are also problematic on CostLog. As I did research on the procedures calling this table, I noticed that none of them ever mention SendAlerts, meaning that we are setting a value on a table but never using it. Attributes which serve no purpose should generally be eliminated, and this is no exception. As for JDate, this is a case in which prolonged exposure breeds contempt. This attribute exists in a number of tables I maintain and every time I need to know an actual date, I need to convert that value from an integer to a real date. This is a minor but persistent pain point, and incremental re-design helps solve exactly those types of pains.
With these thoughts in mind, here is my new design:
Before I start, a quick color note: yellow Get procedures read from the CostQueue table, blue Get procedures read from the CostLog table, and green Get procedures read from both tables.
This new model doesn’t look radically different at first glance, but there are several improvements that I was able to sneak in. First of all, I split out the tables into CostQueue and CostLog. The queue table tells us our current step in the process; the log table, meanwhile, tells us the status of this combination of profile, provider, and report date. Speaking of that combination, I’ve marked those attributes in bold to note that they will form a unique constraint on both tables. That may seem a little weird for a log table, but we aren’t concerned with storing data on how many times a process was run or the individual steps along the way; we only care about the final status of a particular cost run.
I made the other changes that I mentioned above in the design problems: JDate and GUID are now ReportDate and FileName, respectively. This makes records and result sets easier to understand and also cut down a little bit on processing to translate JDates back into real dates. I also eliminated the SendAlerts attribute.
Now let’s look at the procedure changes. The first change is that all Get procedures are idempotent. Instead of having two procedures (including a Get procedure) to create and update log records, I now have four procedures to handle the queue. GetNewRecords reads from CostLog and CostQueue to figure out which records should go on the queue, but it does not insert any rows. Instead, the CreateWork procedure inserts rows onto CostQueue. Then, the scheduler code will call the GetWork procedure to retrieve work from the queue. Finally, there is an Insert procedure in the event that some process needs to re-queue work automatically—which does happen from time to time, and which was handled as an edge case in the old CreateWork procedure.
From there, we move on to the C# service code. That service code will get a set of work items and start working on them. At each step in the process, the service code will call UpdateStep so the next bit of service code knows what is happening. Each step in the service code also has the opportunity to merge results into CostLog, updating the Status and RecordCount attributes as necessary. Once the final step in the process is complete, the service code will call Delete to delete the cost row from the queue and we’re done. From there, website and monitor code can query the CostLog table like before, thereby limiting the scope of changes somewhat.
With these changes in place, cost processing becomes a lot simpler. We can troubleshoot queue problems, easily re-queue data as necessary, and test the process much more easily.
Let’s wrap this section up with a higher-level discussion of how we get from point A to point B in terms of research. For a project of any magnitude—even one which has one table and nine stored procedures—it helps to be organized. If I had to do this over again, here’s what I would do to get organized:
- Get the list of affected tables, procedures, and processes. At this point, I would just sketch notes in a text editor.
- Use draw.io or some other tool to generate a high-level diagram like the ones I have above. Drawing these diagrams came late in the process and helped show me some flaws in my design. Start with the current design.
- Create a design document for the existing design. In this case, my design document started with a picture of the design process generated in step 2, followed by a breakdown of each table attribute. From there, I moved on to each affected stored procedure. For each procedure, I listed the input parameters, high-level actions, outputs (either descriptively or as an ordered list of attributes, depending upon the scenario), and processes which call that procedure.
- Start applying incremental design changes to the paper. I knew that I wanted to take care of some of these problems, and after each problem, I would go back to the document and think some more about it. When I came into this design, I knew I wanted to change the JDate, GUID, and SendAlerts attributes on the table, as well as the GetForService stored procedure. As I started working through those, I found a few other problems, so I wrapped those changes into my design. Once I was done with that, I went back over the design again and found a couple more issues.
- Collaborate. Even if you are the domain expert, it helps to get opinions from others related to the project. In this case, most of my discussion happened with software engineers whose code consumed the information and who would assist in troubleshooting incoming tickets related to this process. They helped me realize that “CostLog” was really two separate tables, which simplified operations significantly.
- Iterate, itereate, iterate. Even when you think you’re done, you probably aren’t. Keep iterating over that design. One of the late changes I made to the design was to eliminate the “GC” procedure altogether, instead moving that requirement to the service code. My design would have been fine either way, but I think this improves the design, making one process responsible for dequeueing records and determining when something is done, rather than having several overlapping processes.
- Simplify the code. Note that in this design, I didn’t spent much time talking about actual code. That doesn’t mean that you should ignore the current code as you change the structure. Several of the procedures had parts in which I could refactor logic and make processing easier for everybody to understand. Once you have a good idea of what the top-level design will look like, work out ways to simplify individual procedures within that design. You can sculpt your higher-level design with these ideas in mind, but I recommend not getting too deep into individual procedures’ problems until you are at a point in which you know that this code will be relevant. After all, why waste a few hours fixing up code that you’re going to deprecate or whose underlying table structure will be completely different?
This wraps up my section on simplifying code. Next up, I’m going to talk about obsolete code.