I tend to see resource over-subscription problems frequently at work. We have a set of product teams, and each team has a manager, a product owner, and a set of employees. These teams share computing resources, though: they use the same servers, access the same databases, and use the same networks. This leads to a tragedy of the commons scenario.
Tragedy of the Commons
The tragedy of the commons is a classic concept in economics: the basic idea is that there exists some common area accessible to all and owned (directly) by none. Each member of a group is free to make use of this common area. Let’s get a little more specific and talk about a common grazing field that a group of shepherd share. Each shepherd has a total desired utilization of the field: shepherd A has x sheep which will consume f(x) grass, where 0 <= f(x) <= 1. Shepherd B has y sheep consuming f(y) grass, and shepherd C has z sheep, consuming f(z) grass.
If f(x) + f(y) + f(z) < 1, then the three shepherds can share the commons. But let’s say that f(x) = 0.5, f(y) = 0.4, and f(z) = 0.3. That adds up to 1.2, but we only have enough grass for 1.0. This means that at least one of the three shepherds will end up losing out on this arrangement—if shepherds A and B get there early, their sheep are going to consume 90% of the available vegetation, leaving shepherd C with 10% instead of his needed 30%.
The tragedy of the commons goes one step further: because there is overgrazing on this land, eventually the vegetation dies out and the land goes fallow for some time, leaving the shepherds without that place to graze.
There are different spins on the tragedy of the commons, but this is the scenario I want to walk through today.
Solutions to the Tragedy
There are three common solutions to a tragedy of the commons scenario: private ownership, a governing body, or commons rules.
This is the easiest solution to understand. Instead of the three shepherds grazing freely on the commons, suppose that farmer D purchases the land with the intent of subletting it for grazing. Farmer D will then charge the shepherds some amount to graze on the land, with the goal being to earn as much as he can from use of the land. Right now, the shepherds are oversubscribed to the land at a cost of 0. Once the farmer starts charging the shepherds, then we see movement.
As a simple example (without loss of generality), let’s say that the farmer knows that he can’t allow more than 80% of the land to be grazed; if he goes above that mark, we get into overgrazing territory and run the risk of the land going fallow for some time. Thus, the farmer wants to set prices such that no more than 80% of the land gets grazed and the farmer maximizes his profits.
We won’t set up the equilibrium equation here, but if you really want to go down that route, you’re welcome to. Let’s say that the equilibrium price is $10 per acre, and at $10 per acre, shepherd B realizes that he can go graze somewhere else for less. That leaves shepherds A and C, whose total use adds up to 80%. As an aside, we could just as easily have imagined a scenario where all three shepherds still use the plot but at least one uses less than the f(x, $0) amount would indicate, so that the sum was still somewhere around 80% utilization.
In contrast to the private property solution, central planning involves an individual or committee laying down edicts on use patterns. In this case, we have a centralplanner telling each shepherd how much of the land he is allowed to use. We can work from the assumption that the central owner also knows that overgrazing happens at more than 80% of utilization, so the central planner won’t allow shepherds to graze beyond this. How specifically the central planner allocates the resource is beyond the scope of discussion here, as are important issues like public choice theory.
The third method of commons resource planning comes from Elinor Ostrom’s work (with this being a pretty good summary in a short blog post). The gist of it is that local groups tend to formulate rules for behavior on the commons. This is group decision making in an evolutionary context, in contrast to spontaneous order (the finding of equilibrium using market prices) or enforced order (the central planner making choices).
All three of these mechanisms work in different contexts at different levels. Historically, much of the literature has argued in favor of centralization closer to the point of decision and decentralization further out—for example, look at Coase’s The Nature of the Firm. In his work, Coase works to explain why firms exist, and his argument is that it is more efficient for firms to internalize some transaction costs. His work also tries to explain when firms tend to be hierarchical in nature (closer to central planning) versus market-oriented or commons-oriented—though a careful reading of Coase shows that even within a firm, spontaneous orders can emerge and can be efficient…but now I’m going down a completely different track.
Let’s Get Back To Computers
Okay, so we have three working theories of how to manage resources in a commons scenario. Let’s shift our commons scenario around just a little bit: now our common resource is a database server. We have a certain number of CPU cores, a certain amount of RAM, a certain amount of disk, a certain disk throughput, and a certain network bandwidth. Instead of shepherds grazing, we have product teams using the database.
Each product team makes use of the database server differently, where product team A makes x calls and uses f(x) resources, product team B makes y calls and uses f(y) resources, and product team C makes z calls and uses f(z) resources, again, where 0 <= f(x) / f(y) / f(z) <= 1. We hope that f(x) + f(y) + f(z) < 1; if that’s the case, we don’t have a resource allocation problem: nobody has saturated the database server and all is well. But realistically, most of us end up with resource contention, where every team pushes more and more product out and eventually team desires overwhelm server resources. In this case, the equivalent of overgrazing is resource saturation, which leads to performance degradation across the board: when CPU is at 100% or disk queue keeps growing and growing, everybody suffers, not just the marginal over-user.
As a quick note, “1” in this case is a composite measure of multiple resources: CPU, memory, disk capacity, disk throughput, and network throughput. More realistically, f(x) describes a vector, each element of which corresponds to a specific resource and each element lies between 0 and 1. For the sake of simplicity, I’m just going to refer to this as one number, but it’s certainly possible for a team to kill a server’s CPU while hardly touching disk (ask me how I know).
Thinking About Implications
In my experience, there are two factors which have led to servers being increasingly more like a commons: Operations’s loss of power and Agile product teams.
Gone are the days when an Operations team had a serious chance of spiking a project due to limited resources. Instead, they’re expected to make do with what they have and find ways to say yes. Maybe it’s always been like this—any curmudgeonly sysadmins want to chime in?—but it definitely seem to be the case now.
On the other side, the rise of Agile product teams has turned server resources into a commons problem for two reasons. First, far too many people interpret “Agile” as “I don’t need to plan ahead.” So they don’t plan ahead, instead just assuming that resource planning is some combination of premature optimization and something to worry about when we know we have a problem.
Second, Agile teams tend to have their own management structure: teams run on their own with their own product owners, their own tech leads, and their own hierarchies and practices. Those tech leads report to managers in IT/Engineering, sure, but they tend to be fairly independent. There often isn’t a layer above the Agile team circumscribing their resource utilization, and the incentives are stacked for teams to behave with the servers like our example shepherds did with the common grazing field: they focus on their own needs, often to the exclusion of the other shepherds and often to everybody’s detriment.
Thinking About Solutions
Given the server resource problem above, how would each of the three solutions we’ve covered apply to this problem? Let’s look at them one by one, but in a slightly different order.
This is the easiest answer to understand. Here, you have an architect or top-level manager or somebody with authority to determine who gets what share of the resources. When there’s resource contention, the central planner says what’s less important and what needs to run. The central planner often has the authority to tell product teams what to do, including having different teams spend sprint time to make their applications more efficient. This is the first solution people tend to think about, and for good reason: it’s what we’re used to, and it works reasonably well in many cases.
The problem with this solution (again, excluding public choice issues, incomplete information issues, and a litany of other, similar issues central planning has) is that typically, it’s management by crisis. Lower-level teams act until they end up causing an issue, at which point the central planner steps in to determine who has a better claim on the resources. Sometimes these people proactively monitor and allocate resources, but typically, they have other, more important things to do and if the server’s not on fire, well, something else is, so this can wait.
In the “rules of the commons” solution, the different product teams work together with Operations to come up with rules of the game. The different teams can figure out what to do when there is resource over-subscription, and there are built-in mechanisms to reduce over-utilization. For example, Operations may put different teams into different resource groups, throttling the biggest users during a high-utilization moment. Operations can also publish resource utilization by team, showing which teams are causing problems.
Let’s give two concrete examples of this from work. First, each team a separate application name that they use in SQL Server connection strings. That way, our Database Operations group can use Resource Governor to throttle CPU, memory, and I/O utilization by team. This reduces some of the contention in some circumstances, but it doesn’t solve all of the issues: assuming that there is no cheating on the parts of teams, there are still parts of the application we don’t want to throttle—particularly, those parts of the application which drive the user interface. It’s okay to throttle back-end batch operations, but if we’re slowing down the user experience, we are in for trouble.
Another example of how these rules can work is in Splunk logging. We have a limited amount of logging we can do to Splunk per day, based on our license. Every day, there is a report which shows how much of our Splunk logging we’ve used up in the prior day, broken down by team. That way, if we go over the limit, the teams with the biggest logs know who they are and can figure out what’s going on.
The downside to commons rules is similar to that of the central planner: teams don’t think about resource utilization until it’s too late and we’re struggling under load. Sure, Operations can name and shame, but the incentives are stacked in such a way that nobody cares until there’s a crisis. If you’re well under your logging limit, why work on minimizing logging when there are so many other things on the backlog?
Next Time: The Price Signal
Despite the downsides I listed, the two systems above are not failures when it comes to IT/Engineering organizations. They are both viable options, and I’m not saying to ditch what you currently do. In the next post, I will cover the price option and spend an inordinate number of words writing about marketplaces in computing.