Marketplaces In Computing, Part 2

Last time around, I looked at the tragedy of the commons as it relates to server resources.  Today, I want to introduce two patterns for optimal resource utilization given different sets of actors with independent goals.

The Microservice Solution

Concept

The first use pattern is the easier one:  remove the contention by having independent teams own their own hardware.  Each team gets its own servers, its own databases, and its own disk and network capacity.  Teams develop on their own servers and communicate via proscribed APIs.  Teams have budgets to purchase hardware and have a couple of options:  they can rent servers from a cloud provider like AWS or Azure, they could rent hardware from an internal operations team, or they could purchase and operate their own hardware.  Most likely, organizations won’t like option #3, so teams are probably looking at some mixture of options 1 and 2.

Benefits

If a team depends upon another team’s work—for example, suppose that we have a core service like inventory lookup, and other teams call the inventory lookup service—then the team owning the service has incentives to prevent system over-subscription.  That team will be much more likely to throttle API calls, improve call efficiency by optimizing queries or introducing better caching, or find some other way to make best use of their limited resources.

This situation removes the commons by isolating teams onto their own resources and making the teams responsible for these resources.

Downsides

There are a couple downsides to this arrangement.  First, doing your work through APIs could be a major latency and resource drag.  In a not-so-extreme case, think about two tables within a common service boundary.  The ProductInventory table tells us the quantity of each product we sell.  The ProductAdvertisement table tells us which products we are currently advertising.  Suppose we want to do a simple join of Product to ProductAdvertisement to ProductInventory and get the products where we have fewer than 5 in stock and are paying at least $1 per click on our digital marketing campaign.

In a standard SQL query running against a modern relational database, the query optimizer use statistics to figure out which table to start from:  ProductAdvertisement or ProductInventory.  If we have relatively few products with fewer than 5 in stock, we drive from ProductInventory.  If we have relatively few advertisements with a $1 CPC, we drive from ProductAdvertisement.  If neither of those is sensitive on its own but the inner join of those two is tiny, the optimizer might start with ProductInventory and join to ProductAdvertisement first, taking the relatively small number of results and joining to Product to flesh out those product details.  Regardless of which is the case, the important thing here is that the database engine has the opportunity to optimize this query because the database engine has direct access to all three tables.

If we put up an API guarding ProductInventory, we can now run into an issue where our advertising team needs to drive from the ProductAdvertisement table.  Suppose that there are tens of thousands of products with at least $1 CPC but only a few with fewer than 5 in stock.  We don’t know which products have fewer than 5 in stock and unless the inventory API has a “get all products with fewer than X in stock” method, we’re going to need to get the full list of products with ads with $1+ CPC and work our way through the inventory API until we find those products with fewer than 5 in stock.  So instead of a single, reasonably efficient database query, we could end up with upwards of tens of thousands of database lookups.  Each lookup on its own is efficient but the sum total is rather inefficient in comparison, both in terms of the amount of time spent pushing data for hundreds of thousands of products across the network and then filtering in code (or another database) and in terms of total resource usage.

The other problem, aside from resource latency, is that the throttling team does not always have the best signals available to determine what to throttle and why.  If our inventory team sees the advertising team slamming the inventory API, the inventory team may institute code which starts returning 429 Too Many Requests response codes, forcing the advertising team to slow down their calls.  This fixes the over-subscription problem but this might be a chainsaw solution to a scalpel problem.  In other words, suppose the advertisements team has two different operations:  a low-value operation with a number of requests, and a high-value operation with a number of requests.  The inventory team doesn’t necessarily know which operation is which, so without coordination between the teams, the inventory team might accidentally block high-value operations while letting low-value operations through.  Or they may cooperate and block low-value operations, but do so much blocking that they starve the low-value operation instead of simply throttling it back.  Neither of those answers is great.

The Price Signal

Instead of having different teams own their own hardware and try to live in silos, my preferred solution is to institute price signals.  What follows is an ideal (some might say overkill) implementation.

In this setup, Operations owns the servers.  Like the farmer in my previous example, ops teams want their property to remain as useful as possible.  Operations wants to make their servers enticing and prohibit over-subscription.  To do this, they price resource utilization.  On a server with a very low load, teams can use the server for pennies per hour; when the server is at 100% CPU utilization, it might spike to $20 or $30 per hour to use that server.  There are three important factors here:

  1. All teams have real-time (or very close to real-time) knowledge of the spot price of each server.
  2. Operations may set the price as they see fit.  That might lead to out-of-equilibrium prices, but there’s a factor that counteracts that quite well:
  3. The prices are actual dollar prices.

Operations is no longer a cost center within the company; it now has the opportunity to charge other teams for resource utilization.  If Operations does a good job keeping servers running efficiently and prices their products well, they end up earning the money to expand and improve; if they struggle, they shape up or ship out.  That’s because teams have alternatives.

Viable Alternatives

Suppose the Operations team can’t manage a server to save their lives.  Product teams are free to go back to options #1 or #3:  they can use cloud offerings and set up their own services there, or they can purchase their own hardware and build internal operations teams within the product team.  These real forms of competition force the Operations team to perform at least well enough to keep their customers.  If I’m going to pay more for servers from Operations than I am from AWS, I had better get something in return.  Sure, lock-in value is real and will play a role in keeping me on the Operations servers, but ops needs to provide additional value:  lower-latency connections, the ability to perform queries without going through APIs (one of the big downsides to the microservices issue above), people on staff when things go bump in the night, etc.

These viable alternatives will keep the prices that Operations charge fairly reasonable; if they’re charging $3 per gigabyte of storage per month, I’ll laugh at them and store in S3 or Azure Blob Storage for a hundredth of the price.  If they offer 5 cents per gigabyte per month on local flash storage arrays, I’ll play ball.

Taking Pricing Into Account

FA Hayek explained the importance of price as a signal in his paper, The Use of Knowledge in Society.  Price is the mechanism people use to share disparate, situational, and sometimes contradictory or at least paradoxical information that cannot otherwise be aggregated or collected in a reasonable time or in a reasonable fashion.  Tying this work to our situation, we want to explain resource utilization to disparate teams at various points in time.  We can return a bunch of numbers back and hope for the best, but if I tell the inventory team that they’re using 48% of a database server’s CPU resources and that the current CPU utilization is 89%, what does that mean?  Does that mean they can increase their load?  That they should decrease their load?  That things are doing just fine?

By contrast, we tell the inventory team that right now, the spot price of this server is $5 per CPU hour, when the normal price is 20 cents per CPU hour.  This is a signal that the server is under heavy load and maybe I should cut back on those giant warehouse-style queries burning up 16 cores.

When teams know that the price has jumped like this, they now have a number of options available:

  1. Prioritize resource utilization.  Are there any low-priority operations going on right now?  Maybe it’d be wise to reschedule those for later, when the server is cheap again.
  2. Refocus efforts elsewhere.  If there’s a server which regularly gets too expensive, maybe it’d be wise to relocate to someplace else, where the price is cheaper.  This can spread the load among servers and make resource utilization more efficient.  As a reminder, that server doesn’t need to be on-prem where Operations owns it; it’s a pretty big world out there with plenty of groups willing to rent some space.
  3. Tune expensive operations.  When dollars and cents come into play, it’s easy to go to a product owner with an ROI.  If my advertising team just got hit with a $5000 bill for a month’s worth of processing on this server and I know I can cut resource usage down to a tenth, I’m saving $4500 per month.  If my next-best alternative does not bring my team at least that much (plus the incremental cost of resource utilization) in revenue, it’s a lot harder for product owners to keep engineering teams from doing tech debt cleanup and resource optimization work.
  4. Burn the money.  Sometimes, a team just has to take it on the chin; all of this work is important and the team needs to get that work done.

Getting There From Here

Okay, so now that I’ve spent some time going over what the end game looks like, how do we possibly get there from here?  I’ll assume that “here” is like most companies I’ve worked at:  there’s a fairly limited understanding of what’s causing server heartache and a limited amount of responsibility that product teams take.

Here are the steps as I see them:

  1. Implement resource tracking.  Start with resource tracking as a whole if you don’t already have it.  Cover per-minute (or per some other time period) measures of CPU load, memory utilization, disk queue length, network bandwidth, and disk utilization.  Once those are in place, start getting resource usage by team.  In SQL Server, that might mean tagging by application name.
  2. Figure out pricing.  Without solid knowledge of exactly where to begin, there are still two pieces of interesting information:  what other suppliers are charging and break-even cost for staff + servers + whatnot.  Unless you’ve got junk servers and terrible ops staff, you should be able to charge at least a little bit more than AWS/Azure/Google/whatever server prices.  And if your ops team is any good, you can charge a good bit more because you’re doing things like waking up at night when the server falls over.
  3. Figure out budgeting.  This is something that has to come from the top, and it has to make sense.  Your higher-value teams probably will get bigger budgets.  You may not know at the start what the budgets “should” be for teams, but at least you can start with current resource utilization shares.
  4. Figure out the market.  You’ll need an API to show current server price.  Teams can call the API and figure out the current rate.  Ideally, you’re also tracking per-team utilization and pricing like Azure or AWS does to limit surprise.

Once this is in place, it gives teams a way of throttling their own utilization.  There’s still a chance for over-subscription, though, so let’s talk about one more strategy:  auctions.

Auctions

Thus far, we’ve talked about this option as a specific decision that teams make.  But when it comes to automated processes, companies like Google have proven that auctions work best.  In the Google advertisement example, there is a limited resource—the number of advertisement slots on a particular search’s results—and different people compete for those slots.  They compete by setting a maximum cost per click, and Google takes that (plus some other factors) and builds a ranking, giving the highest score the best slot, followed by the next-highest score, etc. until all the slots are filled.

So let’s apply that to our circumstance here.  Instead of simply having teams work through their resource capacity issues—a long-term solution but one which requires human intervention—we could auction off resource utilization.  Suppose the current spot price for a server is 5 cents per CPU hour because there’s little load.  Each team has an auction price for each process—maybe we’re willing to pay $10 per CPU hour for the most critical requests, followed by $1 per hour for our mid-importance requests, followed by 50 cents per hour for our least-important requests.  Other teams have their requests, too, and somebody comes in and floods the server with requests.  As resource utilization spikes, the cost of the server jumps up to 75 cents per CPU hour, and our 50-cent job stops automatically.  It jumps again to $4 per CPU hour and our $1 job shuts off.

That other team is running their stuff for a really long time, long enough that it’s important to run the mid-importance request.  Our team’s internal API knows this and therefore automatically sets the bid rate up to $5 temporarily, setting it back down to $1 once we’ve done enough work to satisfy our team’s processing requirements.

Implementing this strategy requires a bit more sophistication, as well as an understanding on the part of the product teams of what happens when the spot price goes above the auction price—that jobs can stop, and it’s up to product teams to spin them down nicely.

Another Spin:  Funbucks

Okay, so most companies don’t like the idea of giving product teams cash and having them transfer real money to an Operations team.  So instead of doing this, you can still have a market system.  It isn’t quite as powerful because there are fewer options available—you might not be able to threaten abandoning the current set of servers for AWS—but it can still work.  Each team still has a budget, but the budget is in an internal scrip.  If you run out of that internal scrip, it’s up to higher-ups to step in.  This makes it a weaker solution, but still workable.

So, You’re Not Serious…Right?

Of course I am, doubting title.  I’m so serious that I’ll even point out cases where what I’m talking about is already in place!

First of all, AWS offers spot pricing on EC2 instances.  These prices tend to be less than the on-demand price and can be a great deal for companies which can afford to run processes at off hours.  You can write code to check the spot price and, if the spot price is low enough, snag an instance and do some work.

As a great example of this, Databricks offers their Community Edition free of charge and uses AWS spot instances to host these.  That keeps prices down for Databricks because they have a hard cap on how high they’re willing to go—I’ve had cases where I’ve tried to spin up a CE cluster and gotten a failure indicating that the spot price was too high and that I should try again later.

For the graybeards in the audience, you’ll also appreciate this next example:  mainframe time slicing.  This was a common strategy for pricing computer utilization and is very similar to what I’m describing.

Conclusion

We’ve spent the past couple of days looking at how development teams can end up in a tragedy of the commons, and different techniques we can use to extricate ourselves from it.  The main purpose of these posts is to show that there are several options available, including creating markets internally.  We still haven’t talked about agorics yet, but let’s save that for another day.

Marketplaces In Computing, Part 1

I tend to see resource over-subscription problems frequently at work.  We have a set of product teams, and each team has a manager, a product owner, and a set of employees.  These teams share computing resources, though:  they use the same servers, access the same databases, and use the same networks.  This leads to a tragedy of the commons scenario.

Tragedy of the Commons

The tragedy of the commons is a classic concept in economics:  the basic idea is that there exists some common area accessible to all and owned (directly) by none.  Each member of a group is free to make use of this common area.  Let’s get a little more specific and talk about a common grazing field that a group of shepherd share.  Each shepherd has a total desired utilization of the field:  shepherd A has x sheep which will consume f(x) grass, where 0 <= f(x) <= 1.  Shepherd B has y sheep consuming f(y) grass, and shepherd C has z sheep, consuming f(z) grass.

If f(x) + f(y) + f(z) < 1, then the three shepherds can share the commons.  But let’s say that f(x) = 0.5, f(y) = 0.4, and f(z) = 0.3.  That adds up to 1.2, but we only have enough grass for 1.0.  This means that at least one of the three shepherds will end up losing out on this arrangement—if shepherds A and B get there early, their sheep are going to consume 90% of the available vegetation, leaving shepherd C with 10% instead of his needed 30%.

The tragedy of the commons goes one step further:  because there is overgrazing on this land, eventually the vegetation dies out and the land goes fallow for some time, leaving the shepherds without that place to graze.

There are different spins on the tragedy of the commons, but this is the scenario I want to walk through today.

Solutions to the Tragedy

There are three common solutions to a tragedy of the commons scenario:  private ownership, a governing body, or commons rules.

Private Ownership

This is the easiest solution to understand.  Instead of the three shepherds grazing freely on the commons, suppose that farmer D purchases the land with the intent of subletting it for grazing.  Farmer D will then charge the shepherds some amount to graze on the land, with the goal being to earn as much as he can from use of the land.  Right now, the shepherds are oversubscribed to the land at a cost of 0.  Once the farmer starts charging the shepherds, then we see movement.

As a simple example (without loss of generality), let’s say that the farmer knows that he can’t allow more than 80% of the land to be grazed; if he goes above that mark, we get into overgrazing territory and run the risk of the land going fallow for some time.  Thus, the farmer wants to set prices such that no more than 80% of the land gets grazed and the farmer maximizes his profits.

We won’t set up the equilibrium equation here, but if you really want to go down that route, you’re welcome to.  Let’s say that the equilibrium price is $10 per acre, and at $10 per acre, shepherd B realizes that he can go graze somewhere else for less.  That leaves shepherds A and C, whose total use adds up to 80%.  As an aside, we could just as easily have imagined a scenario where all three shepherds still use the plot but at least one uses less than the f(x, $0) amount would indicate, so that the sum was still somewhere around 80% utilization.

Central Planning

In contrast to the private property solution, central planning involves an individual or committee laying down edicts on use patterns.  In this case, we have a centralplanner telling each shepherd how much of the land he is allowed to use.  We can work from the assumption that the central owner also knows that overgrazing happens at more than 80% of utilization, so the central planner won’t allow shepherds to graze beyond this.  How specifically the central planner allocates the resource is beyond the scope of discussion here, as are important issues like public choice theory.

Commons Rules

The third method of commons resource planning comes from Elinor Ostrom’s work (with this being a pretty good summary in a short blog post).  The gist of it is that local groups tend to formulate rules for behavior on the commons.  This is group decision making in an evolutionary context, in contrast to spontaneous order (the finding of equilibrium using market prices) or enforced order (the central planner making choices).

All three of these mechanisms work in different contexts at different levels.  Historically, much of the literature has argued in favor of centralization closer to the point of decision and decentralization further out—for example, look at Coase’s The Nature of the Firm.  In his work, Coase works to explain why firms exist, and his argument is that it is more efficient for firms to internalize some transaction costs.  His work also tries to explain when firms tend to be hierarchical in nature (closer to central planning) versus market-oriented or commons-oriented—though a careful reading of Coase shows that even within a firm, spontaneous orders can emerge and can be efficient…but now I’m going down a completely different track.

Let’s Get Back To Computers

Okay, so we have three working theories of how to manage resources in a commons scenario.  Let’s shift our commons scenario around just a little bit:  now our common resource is a database server.  We have a certain number of CPU cores, a certain amount of RAM, a certain amount of disk, a certain disk throughput, and a certain network bandwidth.  Instead of shepherds grazing, we have product teams using the database.

Each product team makes use of the database server differently, where product team A makes x calls and uses f(x) resources, product team B makes y calls and uses f(y) resources, and product team C makes z calls and uses f(z) resources, again, where 0 <= f(x) / f(y) / f(z) <= 1.  We hope that f(x) + f(y) + f(z) < 1; if that’s the case, we don’t have a resource allocation problem:  nobody has saturated the database server and all is well.  But realistically, most of us end up with resource contention, where every team pushes more and more product out and eventually team desires overwhelm server resources.  In this case, the equivalent of overgrazing is resource saturation, which leads to performance degradation across the board:  when CPU is at 100% or disk queue keeps growing and growing, everybody suffers, not just the marginal over-user.

As a quick note, “1” in this case is a composite measure of multiple resources:  CPU, memory, disk capacity, disk throughput, and network throughput.  More realistically, f(x) describes a vector, each element of which corresponds to a specific resource and each element lies between 0 and 1.  For the sake of simplicity, I’m just going to refer to this as one number, but it’s certainly possible for a team to kill a server’s CPU while hardly touching disk (ask me how I know).

Thinking About Implications

In my experience, there are two factors which have led to servers being increasingly more like a commons:  Operations’s loss of power and Agile product teams.

Gone are the days when an Operations team had a serious chance of spiking a project due to limited resources.  Instead, they’re expected to make do with what they have and find ways to say yes.  Maybe it’s always been like this—any curmudgeonly sysadmins want to chime in?—but it definitely seem to be the case now.

On the other side, the rise of Agile product teams has turned server resources into a commons problem for two reasons.  First, far too many people interpret “Agile” as “I don’t need to plan ahead.”  So they don’t plan ahead, instead just assuming that resource planning is some combination of premature optimization and something to worry about when we know we have a problem.

Second, Agile teams tend to have their own management structure:  teams run on their own with their own product owners, their own tech leads, and their own hierarchies and practices.  Those tech leads report to managers in IT/Engineering, sure, but they tend to be fairly independent.  There often isn’t a layer above the Agile team circumscribing their resource utilization, and the incentives are stacked for teams to behave with the servers like our example shepherds did with the common grazing field:  they focus on their own needs, often to the exclusion of the other shepherds and often to everybody’s detriment.

Thinking About Solutions

Given the server resource problem above, how would each of the three solutions we’ve covered apply to this problem?  Let’s look at them one by one, but in a slightly different order.

Central Planning

This is the easiest answer to understand.  Here, you have an architect or top-level manager or somebody with authority to determine who gets what share of the resources.  When there’s resource contention, the central planner says what’s less important and what needs to run.  The central planner often has the authority to tell product teams what to do, including having different teams spend sprint time to make their applications more efficient.  This is the first solution people tend to think about, and for good reason:  it’s what we’re used to, and it works reasonably well in many cases.

The problem with this solution (again, excluding public choice issues, incomplete information issues, and a litany of other, similar issues central planning has) is that typically, it’s management by crisis.  Lower-level teams act until they end up causing an issue, at which point the central planner steps in to determine who has a better claim on the resources.  Sometimes these people proactively monitor and allocate resources, but typically, they have other, more important things to do and if the server’s not on fire, well, something else is, so this can wait.

Commons Rules

In the “rules of the commons” solution, the different product teams work together with Operations to come up with rules of the game.  The different teams can figure out what to do when there is resource over-subscription, and there are built-in mechanisms to reduce over-utilization.  For example, Operations may put different teams into different resource groups, throttling the biggest users during a high-utilization moment.  Operations can also publish resource utilization by team, showing which teams are causing problems.

Let’s give two concrete examples of this from work.  First, each team a separate application name that they use in SQL Server connection strings.  That way, our Database Operations group can use Resource Governor to throttle CPU, memory, and I/O utilization by team.  This reduces some of the contention in some circumstances, but it doesn’t solve all of the issues:  assuming that there is no cheating on the parts of teams, there are still parts of the application we don’t want to throttle—particularly, those parts of the application which drive the user interface.  It’s okay to throttle back-end batch operations, but if we’re slowing down the user experience, we are in for trouble.

Another example of how these rules can work is in Splunk logging.  We have a limited amount of logging we can do to Splunk per day, based on our license.  Every day, there is a report which shows how much of our Splunk logging we’ve used up in the prior day, broken down by team.  That way, if we go over the limit, the teams with the biggest logs know who they are and can figure out what’s going on.

The downside to commons rules is similar to that of the central planner:  teams don’t think about resource utilization until it’s too late and we’re struggling under load.  Sure, Operations can name and shame, but the incentives are stacked in such a way that nobody cares until there’s a crisis.  If you’re well under your logging limit, why work on minimizing logging when there are so many other things on the backlog?

Next Time:  The Price Signal

Despite the downsides I listed, the two systems above are not failures when it comes to IT/Engineering organizations.  They are both viable options, and I’m not saying to ditch what you currently do.  In the next post, I will cover the price option and spend an inordinate number of words writing about marketplaces in computing.

The Marginal DBA

You Have A Performance Problem; What Do You Do?

Brent Ozar had a blog post about what hardware you can purchase with the price of two cores of Enterprise Edition and argues that you should probably spend some more money on hardware.  Gianluca Sartori has a blog post along similar lines.  By contrast, Andre Kamman had an entire SQLBits talk about not throwing hardware at the problem.  And finally, Jen McCown argues that you should spend the money on Enterprise Edition.  When you have a performance problem, who’s right?

The answer is, all of them.  And to head off any comments, my first paragraph is mostly a setup; ask any of these authors and I’m sure they’ll tell you that “it depends” is the right answer.  The purpose of this post is to dig a bit deeper and discuss when each of these points is the correct answer, looking at things from an economist’s perspective.

Detour Into Basic Economics

Before I get too deep into application of economic principles to the world of database administration, let’s cover these principles really quick so we’re all on the same page.

Scarcity

Your data center is only so big.  Your rack can only hold so many servers.  Your credit card has a limit, and so does your bank account.  At some point, your budget runs out.  And even if your budget is phenomenal, there are technological limits to how big a server can be, or even how big a cluster of servers can be.  No matter how much cash, how many top-tier experts, and how much time you have available, there is always a limit.

I’ve personally never had to worry about hitting the 128 TB RAM limit in Windows Server 2012 R2.  Even if you had a server capable of using 128 TB of RAM, I’m sure it’d be so expensive that there’s no way I’d ever be allowed near it.

This, in short, is scarcity:  you will always run out of resources before you run out of problems.

Opportunity Cost

In a world of scarcity, we can figure out the true cost of something:  what is the next-best alternative to this?  For example, let’s say we have the option to buy $100K worth of hardware.  What other alternatives exist?  We could bring in a consultant and pay that person $100K to tune queries.  We could send the current staff out for $100K worth of training.  We could hire a new person at $100K (although this is a stream of income rather than a one-time payment, so it’s a bit trickier of an answer).  Or we could buy $100K worth of software to help solve our problem.  Dollar values here simply help focus the mind; even without currencies, we can still understand the basic concept:  the actual cost of a good or service is what you forego when you decide to make the trade to obtain that good or service.

Understanding opportunity cost is critical to making good decisions.  It also helps lead into the third major topic today.

Marginal Utility

One of the toughest problems of pre-Marginal Revolution economics was the diamond paradox.  The short version of the diamond paradox is as follows:  water is more valuable than diamonds, in the sense that a person can live with water but no diamonds, but not vice versa.  So why is water so inexpensive, yet diamonds are so expensive?

The answer is in marginal utility.  The idea behind marginal utility is that your valuation of a thing changes as you obtain more of that thing.  The first glass of water for a thirsty man in the desert is worth quite a bit; the 900th glass of water, not so much.  In hardware terms, going from 8 GB of RAM to 16 GB of RAM is worth more to you than going from 192 GB to 200 GB.  The 8-16 jump allows you to do quite a bit that you might not have been able to do before; the jump from 192 to 200, even though it is the same total RAM difference, opens many fewer doors.

Detour’s Over; Back To The Main Point

Armed with these principles of economics, let’s dive into the problem.  To keep things simple, I’m going to think about three competing areas for our IT budgets:

  1. Purchasing newer, better, or more hardware
  2. Paying people to write better code or tune application performance
  3. Purchasing Enterprise Edition licenses rather than Standard Edition (or APS appliances rather than Enterprise Edition)

Applying the scarcity principle first, we know we can’t afford all of these things.  If your company can afford all of the items on the list, then this whole topic’s moot.  But honestly, I’ve never seen a company whose budget was so unlimited that they could keep hiring more and more people to fix code while buying more and more hardware and getting more and more expensive software.  At some point, you hit a limit and need to make the hard decision.

Here’s a simplistic interpretation of those limits:

LinearConstraints1.png

In this example, I have a set of hardware, queries, and SQL Server versions that I allocate to OLTP processing and OLAP processing.  I have three constraints (which I’ve made linear for the sake of simplicity):  hardware, application code, and SQL Server versions.  The idea behind this two-dimensional linear constraint picture is that you are able to pick any point on the X-Y plane which is less than or equal to ALL constraints.  In other words, you can pick any spot in the purple-lined section of the image.

Taking this scenario, the application is tuned reasonably well and the version of SQL Server isn’t going to help us much; we simply don’t have powerful enough hardware.  This might be something like trying to run a production SQL Server instance on 8 GB of RAM, or running on 5400 RPM hard drives.  Even if we hire people to tune queries like mad and push the red boundary out further, it doesn’t matter:  we will still have the same bottlenecks.

By contrast, in this case, if we purchase new hardware, we can shift the purple curve out.  Let’s say we get some fancy new hardware which solves the bottleneck.  Now our constraint problem might look something like the following:

LinearConstraints2.png

We still need to pick a spot somewhere in the purple-lined section, but notice that our constraint is no longer hardware.  In fact, we have two potential constraints:  version limitations and application limitations.  The answer to “what do we need to do?” just got a bit more difficult.  If we are in a situation in which we lean heavily toward OLTP activity, our next bottleneck is the application:  now it’s time to rewrite queries to perform a bit better.  By contrast, if we’re in a big warehouse environment, we will want to upgrade our servers to take advantage of OLAP features we can’t get in Standard Edition (e.g., clustered columnstore indexes or partitioning).

Can We Make This Practical?

In the real world, you aren’t going to create a linear programming problem with defined constraints…probably.  Nevertheless, the important part of the above section is the principle:  solve your tightest constraint.

The Easy Stuff

Here’s a bit of low-hanging fruit:

  1. If you’re running production SQL Server instances with less than 32 GB of RAM, you almost certainly have a hardware constraint.  Buy that RAM.  Going back to Brent’s post, you can buy a brand new server for $7K, but if you have a server that can hold at least 128 GB of RAM, it looks like you can buy that RAM for about a grand (and I’m guessing you can probably get it for less; that’s retail price).
  2. If you haven’t done any query tuning, you can probably find two or three costly queries and tune them easily.  They may not be the biggest, nastiest queries, but they help you reduce server load.  Similarly, find two or three missing indexes and add them.
  3. If you query relatively small sets of static data, put up a Redis server and cache those results.  A client of mine did that and went from about 85 million SQL Server queries per day to 60,000.  Even if your queries are perfectly optimized, that’s an incredible difference.

The nice thing about low-hanging fruit is that the opportunity cost is extremely low.  Spend a grand for RAM, 3-5 hours tuning queries, and a few hundred dollars a month on Premium Redis caching.  For a company of any reasonable size, those are rounding errors.

The Hard Stuff

Once you get beyond the low-hanging fruit, you start to see scarcity creeping back in.  You’ve got reasonable hardware specs, but maybe that Pure Storage array will help out…or maybe you need to bring in top-notch consultants to optimize queries in a way you could never do…or maybe it’s time to buy Enterprise Edition and take advantage of those sweet, sweet features like In-Memory OLTP.

Finding the right answer here is a lot harder of a problem.  One place to start looking is wait stats.  SQL Server keeps track of reasons why queries are waiting, be they hardware-related (disk, IO, CPU), scheduling related, network bandwidth related, or even because your application can’t handle how fast SQL Server’s pushing data across to it.  There’s a bit of interpretation involved with understanding wait stats, but Paul Randal is putting together the definitive resource for understanding wait stats.

The rest of my advice comes down to experience and testing.  If you’re in an environment in which you can run tests on hardware before purchase, that’s an easy way to tell if hardware is your bottleneck.  One scenario I had a few years back involved a case in which we got to play with a solid state disk array back when those were brand new.  We ran our production SQL Server instances on them for a week or so and saw almost no performance gain.  The reason was that our servers had 8 GB of RAM and 1-2 VCPUs, so faster disk simply exacerbated CPU and memory issues.  By contrast, at my current company, we had a chance to play with some newer servers with more powerful CPUs than what we have in production and saw a nice performance boost because we’re CPU-limited.

The nice thing is that my graphics above aren’t quite accurate in one big sense:  application tweaks and hardware work in parallel, meaning that buying new hardware can also push out the application tweak curve a bit and vice versa.  This means that, in many cases, it’s better to find the cheaper answer, be that hardware or sinking hours into application development and tuning.

Emphasizing Sunk Costs

There’s one last thing to discuss here:  the concept of sunk costs.  If you want a long-winded, rambling discussion of sunk costs, here’s one I wrote while in grad school.  For the short version, “sunk costs are sunk.”

Why is it that a company is willing to pay you $X a year to tune queries but isn’t willing to pay 0.1 * $X to get hardware that would end up solving the constraint issue above so much more easily?  The answer is that, from an accounting standpoint, the new hardware is a new cost, whereas your salary is already factored into their budget.  From the standpoint of the in-house accountant, your salary is a fixed cost (which they have to pay regardless of whether you’re spending 80 hours a week tuning queries like a boss or 5 hours a week of work & 35 hours of Facebook + Pokemon Go…at least until they fire you…) and from the accountant’s perspective, your effort doesn’t get shifted around; you’re still a DBA or database developer or whatever it is you do.  So your salary is a sunk cost.

I should mention a huge, huge point here:  salaries are sunk costs in the short term.  In the long term, salaries are neither fixed nor sunk costs.  In the long term, your manager can re-assign you to a different department (meaning that your salary is not sunk), and your salary can change or even go away if it makes sense from the company’s perspective (meaning that salary is a variable cost rather than a fixed cost).  But for our purposes, we’re interested in the short-term ramifications here, as “long-term” could mean a year or even several years.

Also, there is one big exception to the rule that your salary is a sunk cost:  if you get paid hourly, your wages become a variable cost, and it’s a lot easier to sell the company on trading variable costs for fixed costs:  “If you buy this hardware for $X, you don’t have to pay me 3 * $X to tune queries.”  That’s an actual cost reduction, whereas in the salary case, you’re still arguing for a net increase in company cost incurred.

So what can you do in that case?  Selling the company on “reduced cost” doesn’t cut it so much for salaried employees because you generally aren’t really reducing costs from the company’s standpoint.  Instead, you have to sell on opportunity cost:  if you weren’t spending 15 hours a week trying to tune queries to get this ETL process to work adequately, you could focus on Project X, which could net the company $Y in revenue (where $Y is an expected revenue return conditional upon the probability of success of Project X).  If $Y is substantially higher than the cost of hardware, you now have a solid business case you can take to higher-ups to get newer, better hardware.

Similarly, if you’re spending all of your time on application development and the company’s throwing barrels of cash at vendors for new hardware, you could make the case that getting some time to tune queries might allow the company to save money on net by deferring the cost of hardware purchases.  This is an easier case to make because, again, your salary is a sunk cost (unless you’re a wage-earning employee), so the opportunity cost comes down to what the next-best alternative is with your time.

Conclusion

For the most part, I can’t tell you whether you’re better off buying more hardware, tuning those queries, upgrading to Enterprise Edition, or doing something entirely different (at least unless you’re willing to shovel cash in my direction and have me take a look…).  What I hoped to do in this blog post was to give you some basic economics tools, letting you apply them to your own situation.  With these concepts in place, they’ll give you ammunition when going to management to ease your biggest constraint.  Managers and accountants are going to be a bit more amenable to arguments around opportunity cost, and your pleas are less likely to fall on deaf ears when you realize that sunk costs shouldn’t affect future behavior.

100% of the people reading this post are reading this post (and other lies about statistics)

A friend recently shared an article about the Ice Bucket Challenge that claimed only 27% of the money raised is going towards research. Here’s the article. 

Here’s the headline: 

ICE BUCKET CHALLENGE: ALS FOUNDATION ADMITS LESS THAN 27% OF DONATIONS FUND RESEARCH & CURES

$95 Million Later: Only 27% Of Donations Actually Help ‘Research The Cure’

I was pretty angry. The tone of the article is really awful too, slamming the ALS foundation for these heinous crimes. Yet, there’s some additional facts tucked away in a pie chart that give the lie to the headline. 19% of the funds raised go to patient and community outreach; a viable use of funding, don’t you think? 32%, the largest chunk of the funding, goes to public education. How dare they spend the money trying to make people aware of the disease and its effects! That’s what Wikipedia and webMD are for! Oh, and the $95 million figure they quote isn’t what they actually break down in the chart either — it’s only the expenses for the year ending January 31, 2014.

Given that pie chart, in fact, 79% of the donations go directly to aiding sufferers of the disease or increasing awareness; that’s pretty good. The foundation is rated very highly by Charity Navigator too. 

The salary for the CEO is pretty insane — $300k+ is nuts for a non-profit. However, it’s only a tiny slice of the total pie, and not nearly as bad as scaremongers would have you believe. If we, in the United States, don’t want to use tax dollars to contribute to health care, funding of organizations like this one is a great way to contribute. 

Elon Musk: World’s greatest pro wrestler? Or greatest human being?

Sadly, he is not a pro wrestler. But he is the CEO of Tesla Motors, and he just did something so awesome that he may qualify for greatest human being in the world. Let’s set the brilliant title of his post aside for a moment. (Okay, giggle for a moment or two at the title.)

Tesla renouncing all of their patents is more than just whipping it out and laying it on the table. Well, okay, whipping out their patents and laying them on the the table is exactly what they did. It is, however, a gauntlet thrown down at the feet of every major motor vehicle manufacturing in the world. This is Tesla saying, to be bluntly, “our product is so awesome, that we will tell you how to make it, and you still won’t succeed.” It is also Tesla saying, “Yeah, we’re already super rich, but we think that better, safer, more reliable cars is good for everybody and good for the planet.”

Maybe you believe human caused global warming isn’t a thing. Maybe you think that fossil fuel dependency isn’t a big deal. I’m not judging you. I’m not sure either. However, let me share a story for a moment.

As long time readers know, I have been in St. Petersburg, Russia, for the past ten months. Russia has been in the news quite a bit in those ten months for violence both inside the country and outside. I’ve even had my wallet stolen. Yet the one time I felt most in danger was when I was in the car going from the airport to the hostel. Why? Because the air quality was so absolutely putrid (and I’m an asthmatic) that for a few moments here and there, I literally could not breathe.

That’s not good. While I’m generally a misanthrope, I would prefer that human beings not die for stupid reasons. I share DNA with them. So, auto companies: pick up the gauntlet. Mass produce the ever loving shit out of these fuckers. Bring the cost down to a level where the average citizen can afford one. You will make shit tons of money — humans like moving around. I’m pretty sure you like money. There is no downside.

Sorry, Mike Trout: should have gotten an economics degree

Regression is quickly becoming the most interesting site in the Deadspin network (if the least funny). A recent article actually quantifies a superstar General Manager as more valuable than the best player in the game.  You can get his actual paper here.

I’ve skimmed the paper — the writing itself could use some improvements, I would argue, but that’s a stylistic question — and it seems solid to me. Undervaluing things that should be overvalued and vice versa is a skill, perhaps a tradition, in MLB. (Hi RBI and Saves! We love you! Don’t change!)

Yet, while I agree with his conclusions, I raise a further point. Consider that an MLB roster consists of 25 players. Ideally, if all else is equal and nobody gets hurt, you will have 8 or 9 regular players, and five starting pitchers. So, the “meat” of an MLB roster should be 13 to 14 players (depending on league). The rest are injury replacements, situational players, and relievers. Okay. Ignoring the minor leagues for the sake of simplicity (although one could argue that a GM’s influence on the minors is as important, or more so, than the majors), that means he has 13 or 14 critical decisions to make where he can make an immediate, direct impact.

Now let’s look at the NFL. You have 53 players on a roster, and no minor leagues, so this comparatively easier to make sense of. You, nominally, have 22 starters in the NFL. However, unlike MLB, the depth chart doesn’t stay static because, especially on defense, players move around and formations shift. So, out of that 22 starters, I’d offer the following players as “true starters”, in the sense that they’d play every snap in a perfect world where injuries aren’t an issue.

— The quarterback

— The offensive line

— The #1 and #2 WR

— The #1 and #2 CB

— The safeties

— The middle linebacker (or two)

— The nose tackle

That means that an NFL team should reasonably expect these 14 or 15 players to play every snap. The other positions are situational (your defensive front seven will shift positions quite frequently and take plays off) or based purely on formation (absence or presence of a TE or FB or nickelback). That’s… about the same as MLB. You can even argue it’s fewer, because there are nose tackles who don’t play every snap (although few of them). According to Bleacher Report, the average NFL GM makes about $2 million a year. We could argue NFL GMs are undervalued, and I think that’s accurate, but given the number of impact decisions they have to make, but with a much larger roster (and therefore with more margin for error), they should be undervalued relative to MLB GMs.

Consider, finally, the missing element: player development. When you hire a GM in any sport, but particularly MLB and the NFL, the public thinks their job is to make the team better now. It’s actually to make the team better tomorrow. To do this, MLB has, at minimum, six teams (considering one AAA, one AA, High A, A, and two rookie league teams) are devoted to each major league team for this very purpose. In the NFL, you have the 53 man roster. That’s it. If you want to argue that the NCAA is the minors for football, I can’t argue on the whole, but recall that there’s no former affiliation and that talent is much more randomized in this sense.

When we add the player development aspect back into the equation, I would actually consider the NFL executive to have the more difficult job, strictly based on having fewer options to work with. Therefore, I think that if MLB executives are undervalued (and they are), NFL executives are more so, and also deserve some more cash for their trouble.