Last time around, I looked at the tragedy of the commons as it relates to server resources. Today, I want to introduce two patterns for optimal resource utilization given different sets of actors with independent goals.
The Microservice Solution
The first use pattern is the easier one: remove the contention by having independent teams own their own hardware. Each team gets its own servers, its own databases, and its own disk and network capacity. Teams develop on their own servers and communicate via proscribed APIs. Teams have budgets to purchase hardware and have a couple of options: they can rent servers from a cloud provider like AWS or Azure, they could rent hardware from an internal operations team, or they could purchase and operate their own hardware. Most likely, organizations won’t like option #3, so teams are probably looking at some mixture of options 1 and 2.
If a team depends upon another team’s work—for example, suppose that we have a core service like inventory lookup, and other teams call the inventory lookup service—then the team owning the service has incentives to prevent system over-subscription. That team will be much more likely to throttle API calls, improve call efficiency by optimizing queries or introducing better caching, or find some other way to make best use of their limited resources.
This situation removes the commons by isolating teams onto their own resources and making the teams responsible for these resources.
There are a couple downsides to this arrangement. First, doing your work through APIs could be a major latency and resource drag. In a not-so-extreme case, think about two tables within a common service boundary. The ProductInventory table tells us the quantity of each product we sell. The ProductAdvertisement table tells us which products we are currently advertising. Suppose we want to do a simple join of Product to ProductAdvertisement to ProductInventory and get the products where we have fewer than 5 in stock and are paying at least $1 per click on our digital marketing campaign.
In a standard SQL query running against a modern relational database, the query optimizer use statistics to figure out which table to start from: ProductAdvertisement or ProductInventory. If we have relatively few products with fewer than 5 in stock, we drive from ProductInventory. If we have relatively few advertisements with a $1 CPC, we drive from ProductAdvertisement. If neither of those is sensitive on its own but the inner join of those two is tiny, the optimizer might start with ProductInventory and join to ProductAdvertisement first, taking the relatively small number of results and joining to Product to flesh out those product details. Regardless of which is the case, the important thing here is that the database engine has the opportunity to optimize this query because the database engine has direct access to all three tables.
If we put up an API guarding ProductInventory, we can now run into an issue where our advertising team needs to drive from the ProductAdvertisement table. Suppose that there are tens of thousands of products with at least $1 CPC but only a few with fewer than 5 in stock. We don’t know which products have fewer than 5 in stock and unless the inventory API has a “get all products with fewer than X in stock” method, we’re going to need to get the full list of products with ads with $1+ CPC and work our way through the inventory API until we find those products with fewer than 5 in stock. So instead of a single, reasonably efficient database query, we could end up with upwards of tens of thousands of database lookups. Each lookup on its own is efficient but the sum total is rather inefficient in comparison, both in terms of the amount of time spent pushing data for hundreds of thousands of products across the network and then filtering in code (or another database) and in terms of total resource usage.
The other problem, aside from resource latency, is that the throttling team does not always have the best signals available to determine what to throttle and why. If our inventory team sees the advertising team slamming the inventory API, the inventory team may institute code which starts returning 429 Too Many Requests response codes, forcing the advertising team to slow down their calls. This fixes the over-subscription problem but this might be a chainsaw solution to a scalpel problem. In other words, suppose the advertisements team has two different operations: a low-value operation with a number of requests, and a high-value operation with a number of requests. The inventory team doesn’t necessarily know which operation is which, so without coordination between the teams, the inventory team might accidentally block high-value operations while letting low-value operations through. Or they may cooperate and block low-value operations, but do so much blocking that they starve the low-value operation instead of simply throttling it back. Neither of those answers is great.
The Price Signal
Instead of having different teams own their own hardware and try to live in silos, my preferred solution is to institute price signals. What follows is an ideal (some might say overkill) implementation.
In this setup, Operations owns the servers. Like the farmer in my previous example, ops teams want their property to remain as useful as possible. Operations wants to make their servers enticing and prohibit over-subscription. To do this, they price resource utilization. On a server with a very low load, teams can use the server for pennies per hour; when the server is at 100% CPU utilization, it might spike to $20 or $30 per hour to use that server. There are three important factors here:
- All teams have real-time (or very close to real-time) knowledge of the spot price of each server.
- Operations may set the price as they see fit. That might lead to out-of-equilibrium prices, but there’s a factor that counteracts that quite well:
- The prices are actual dollar prices.
Operations is no longer a cost center within the company; it now has the opportunity to charge other teams for resource utilization. If Operations does a good job keeping servers running efficiently and prices their products well, they end up earning the money to expand and improve; if they struggle, they shape up or ship out. That’s because teams have alternatives.
Suppose the Operations team can’t manage a server to save their lives. Product teams are free to go back to options #1 or #3: they can use cloud offerings and set up their own services there, or they can purchase their own hardware and build internal operations teams within the product team. These real forms of competition force the Operations team to perform at least well enough to keep their customers. If I’m going to pay more for servers from Operations than I am from AWS, I had better get something in return. Sure, lock-in value is real and will play a role in keeping me on the Operations servers, but ops needs to provide additional value: lower-latency connections, the ability to perform queries without going through APIs (one of the big downsides to the microservices issue above), people on staff when things go bump in the night, etc.
These viable alternatives will keep the prices that Operations charge fairly reasonable; if they’re charging $3 per gigabyte of storage per month, I’ll laugh at them and store in S3 or Azure Blob Storage for a hundredth of the price. If they offer 5 cents per gigabyte per month on local flash storage arrays, I’ll play ball.
Taking Pricing Into Account
FA Hayek explained the importance of price as a signal in his paper, The Use of Knowledge in Society. Price is the mechanism people use to share disparate, situational, and sometimes contradictory or at least paradoxical information that cannot otherwise be aggregated or collected in a reasonable time or in a reasonable fashion. Tying this work to our situation, we want to explain resource utilization to disparate teams at various points in time. We can return a bunch of numbers back and hope for the best, but if I tell the inventory team that they’re using 48% of a database server’s CPU resources and that the current CPU utilization is 89%, what does that mean? Does that mean they can increase their load? That they should decrease their load? That things are doing just fine?
By contrast, we tell the inventory team that right now, the spot price of this server is $5 per CPU hour, when the normal price is 20 cents per CPU hour. This is a signal that the server is under heavy load and maybe I should cut back on those giant warehouse-style queries burning up 16 cores.
When teams know that the price has jumped like this, they now have a number of options available:
- Prioritize resource utilization. Are there any low-priority operations going on right now? Maybe it’d be wise to reschedule those for later, when the server is cheap again.
- Refocus efforts elsewhere. If there’s a server which regularly gets too expensive, maybe it’d be wise to relocate to someplace else, where the price is cheaper. This can spread the load among servers and make resource utilization more efficient. As a reminder, that server doesn’t need to be on-prem where Operations owns it; it’s a pretty big world out there with plenty of groups willing to rent some space.
- Tune expensive operations. When dollars and cents come into play, it’s easy to go to a product owner with an ROI. If my advertising team just got hit with a $5000 bill for a month’s worth of processing on this server and I know I can cut resource usage down to a tenth, I’m saving $4500 per month. If my next-best alternative does not bring my team at least that much (plus the incremental cost of resource utilization) in revenue, it’s a lot harder for product owners to keep engineering teams from doing tech debt cleanup and resource optimization work.
- Burn the money. Sometimes, a team just has to take it on the chin; all of this work is important and the team needs to get that work done.
Getting There From Here
Okay, so now that I’ve spent some time going over what the end game looks like, how do we possibly get there from here? I’ll assume that “here” is like most companies I’ve worked at: there’s a fairly limited understanding of what’s causing server heartache and a limited amount of responsibility that product teams take.
Here are the steps as I see them:
- Implement resource tracking. Start with resource tracking as a whole if you don’t already have it. Cover per-minute (or per some other time period) measures of CPU load, memory utilization, disk queue length, network bandwidth, and disk utilization. Once those are in place, start getting resource usage by team. In SQL Server, that might mean tagging by application name.
- Figure out pricing. Without solid knowledge of exactly where to begin, there are still two pieces of interesting information: what other suppliers are charging and break-even cost for staff + servers + whatnot. Unless you’ve got junk servers and terrible ops staff, you should be able to charge at least a little bit more than AWS/Azure/Google/whatever server prices. And if your ops team is any good, you can charge a good bit more because you’re doing things like waking up at night when the server falls over.
- Figure out budgeting. This is something that has to come from the top, and it has to make sense. Your higher-value teams probably will get bigger budgets. You may not know at the start what the budgets “should” be for teams, but at least you can start with current resource utilization shares.
- Figure out the market. You’ll need an API to show current server price. Teams can call the API and figure out the current rate. Ideally, you’re also tracking per-team utilization and pricing like Azure or AWS does to limit surprise.
Once this is in place, it gives teams a way of throttling their own utilization. There’s still a chance for over-subscription, though, so let’s talk about one more strategy: auctions.
Thus far, we’ve talked about this option as a specific decision that teams make. But when it comes to automated processes, companies like Google have proven that auctions work best. In the Google advertisement example, there is a limited resource—the number of advertisement slots on a particular search’s results—and different people compete for those slots. They compete by setting a maximum cost per click, and Google takes that (plus some other factors) and builds a ranking, giving the highest score the best slot, followed by the next-highest score, etc. until all the slots are filled.
So let’s apply that to our circumstance here. Instead of simply having teams work through their resource capacity issues—a long-term solution but one which requires human intervention—we could auction off resource utilization. Suppose the current spot price for a server is 5 cents per CPU hour because there’s little load. Each team has an auction price for each process—maybe we’re willing to pay $10 per CPU hour for the most critical requests, followed by $1 per hour for our mid-importance requests, followed by 50 cents per hour for our least-important requests. Other teams have their requests, too, and somebody comes in and floods the server with requests. As resource utilization spikes, the cost of the server jumps up to 75 cents per CPU hour, and our 50-cent job stops automatically. It jumps again to $4 per CPU hour and our $1 job shuts off.
That other team is running their stuff for a really long time, long enough that it’s important to run the mid-importance request. Our team’s internal API knows this and therefore automatically sets the bid rate up to $5 temporarily, setting it back down to $1 once we’ve done enough work to satisfy our team’s processing requirements.
Implementing this strategy requires a bit more sophistication, as well as an understanding on the part of the product teams of what happens when the spot price goes above the auction price—that jobs can stop, and it’s up to product teams to spin them down nicely.
Another Spin: Funbucks
Okay, so most companies don’t like the idea of giving product teams cash and having them transfer real money to an Operations team. So instead of doing this, you can still have a market system. It isn’t quite as powerful because there are fewer options available—you might not be able to threaten abandoning the current set of servers for AWS—but it can still work. Each team still has a budget, but the budget is in an internal scrip. If you run out of that internal scrip, it’s up to higher-ups to step in. This makes it a weaker solution, but still workable.
So, You’re Not Serious…Right?
Of course I am, doubting title. I’m so serious that I’ll even point out cases where what I’m talking about is already in place!
First of all, AWS offers spot pricing on EC2 instances. These prices tend to be less than the on-demand price and can be a great deal for companies which can afford to run processes at off hours. You can write code to check the spot price and, if the spot price is low enough, snag an instance and do some work.
As a great example of this, Databricks offers their Community Edition free of charge and uses AWS spot instances to host these. That keeps prices down for Databricks because they have a hard cap on how high they’re willing to go—I’ve had cases where I’ve tried to spin up a CE cluster and gotten a failure indicating that the spot price was too high and that I should try again later.
For the graybeards in the audience, you’ll also appreciate this next example: mainframe time slicing. This was a common strategy for pricing computer utilization and is very similar to what I’m describing.
We’ve spent the past couple of days looking at how development teams can end up in a tragedy of the commons, and different techniques we can use to extricate ourselves from it. The main purpose of these posts is to show that there are several options available, including creating markets internally. We still haven’t talked about agorics yet, but let’s save that for another day.