This is part one in a series on near-zero downtime deployments.
For the past 5+ years, I’ve worked at a company with a zero downtime policy for deployments. The last time we had a real downtime for code deployment was January of 2014. Mind you, we’ve had downtimes since then: servers go down for various reasons, sometimes we schedule infrastructure upgrades, that kind of thing. But the idea of taking our core application down to deploy new code? Nope. Not gonna do it.
What’s Wrong with Downtime?
If you work for a company where all employees and customers use corporate services from 8 AM to 6 PM Monday through Friday, you have plenty of room for downtime. In that case, this series probably isn’t going to be for you…at least until you switch jobs and go to a company where this doesn’t hold.
By contrast, we have systems running 24 hours a day because there are customers all across the world expecting our stuff to work. This means no matter what window we pick to take the site down for a couple of hours, it’s going to impact some of our customers’ bottom lines and that affects our bottom line.
Speaking of windows, we never get nice deployment windows as developers. It’s never “Hey, we’re going to push at 1 PM after you’re back from lunch and everything should be done by 4 PM.” Nope, it’s “You get to come in at 7 AM on Sunday” or “You get to come in at 6 AM on a Tuesday” or “You don’t have to be in the office but we’re going to deploy at 3:30 AM Saturday, so try not to stay out too late.”
The reason our windows tend to be at such odd hours is because we’re looking for ways to minimize disruption to the business. But what if I told you there’s a better way?
(Near) Zero-Downtime Releases
What if, instead of taking the application down, we keep it up but in a slightly degraded state? Deployments would go but if they didn’t block our end users, then as long as we aren’t deploying during the busiest moments of the day, we could push code with almost no visibility to the end user.
One technique for doing this is called blue-green deployment. The gist of it is that you have some excess capacity—or if you’re on a cloud provider, you temporarily purchase excess capacity—and upgrade some of your servers to the new code first. Then, if everything passes validation, you upgrade the other servers. Supposing everything works nice, either you’re done, or you spin down the excess capacity and now you’re done.
This technique works really well for stateless code, like most application code. But where it runs into trouble is with stateful code. You know, the kind of stuff that we do all day as database developers. There are a few problems we run into with stateful code:
- Old code—that is, code without your changes—still needs to work during the transition period. This makes removing columns, stored procedures, and other database objects tougher.
- Data logic changes are harder, especially if your data layer comes from ORMs or direct SQL queries rather than through stored procedures. For example, suppose your business logic has you always insert rows into Table A and Table B. Now, we insert into Table A and decide between inserting into Table B or Table C. The old logic will still try to force everything into Table B, whereas new logic will understand the split. Unless you plan ahead, you will end up with weird combinations of rows in unexpected places and might have to perform a database cleanup later.
- If your blue-green deployment fails, you have the option to roll back your changes pretty easily: kill those servers with the new code and stick with the old code. If your database deployment code fails, you can end up with invalid or even missing data, leading to manual cleanup operations.
- Speaking of rollback, how do you roll back deleting a column? That data’s gone unless you restore from backups. It’s not as simple as reverting a Git commit and re-deploying. Even short of dropping columns or tables, one-way data transformations exist. For example, suppose you inherit some nasty data with weird markup in it. You schedule a set of find-and-replace operations but only discover after the fact that it replaced legitimate data as well. Undoing find and replace is easy in a text file, but not in a database.
Why Near-Zero Instead of Zero?
Throughout this series, I’m going to use the phrase “near zero downtime” rather than “zero downtime.” The latter sounds nice, but in practice is almost impossible to implement past a certain level. For example, locking everybody out from the data is downtime. If you don’t believe me, drop and re-create a clustered columnstore index on that busy fact table during office hours and see how well that goes for you.
Even online operations still have minor amounts of downtime. One technique we’re going to use in the series is renaming tables using
sp_rename. But this requires a lock, meaning that for the few milliseconds that we’re running the operation, people are locked out and we have system unavailability due to developers or administrators—that is, downtime. In most systems, we’re willing to experience a short amount of downtime occasionally, but when you start measuring downtime in terms of millions of dollars per minute, “a short amount” becomes rather short indeed.
For this series, I’m not going to go quite that far. Instead, we’ll focus on techniques which limit blocking and reduce the amount of database outage time to small levels, generally microseconds or milliseconds but sometimes even a few seconds.
Bonus Material: The Swart Series
Michael J. Swart put together a great series on 100% online deployments last year. My tactics and techniques will be similar, but we will have different implementation specifics. I consider this a good thing: it means that it’s not just my way or the highway; there are other ways to do the job too.
If you haven’t already read his series, I highly recommend it.