In yesterday’s post, I made mention of the concept of spot instances. Today, I’ll give you an idea of what they are, how they work, and how you can save a boatload of cash in the cloud using them.
What Are Spot Instances?
Spot instances are an idea which came out of Amazon Web Services. Specifically, the people at AWS realized that they had excess capacity on servers and in the cloud, excess capacity is typically a bad thing, as you’re paying for resources not in use. Going back to basic economics, when you have excess capacity, you have a surplus. There are two ways to deal with a surplus: decrease supply (shift the supply curve back) or decrease prices (move down the demand curve).
There are some complicating factors here which make it tough for AWS or other cloud vendors to do either. First, when it comes to supply, a decrease in supply would not help them much because they’re still experiencing major growth, and so even if they don’t need the servers right at this moment, they will need them in the next X months, where X is a small number like 3-6. Furthermore, demand is not constant—there are low points and high points in any cloud data center, times when there’s relatively little going on and other times when the center is at or near peak capacity. Reducing supply is not a winner.
On the other side, you have the notion of decreasing prices. In the long run, we do see price-setting activity which works out pretty well for consumers. Virtual machine prices tend to drop over time—in some cases, so much so that by the end of a 3-year reserved instance contract, the pay-as-you-go price is actually lower than the reserved price from 3 years back. As we get new classes of virtual machines on newer hardware, the prices tend to be about the same as what we saw when the prior generation was new, meaning that customers have some idea of how much they’re willing to pay and that amount is not necessarily increasing over time.
In an ideal world from the standpoint of AWS and not consumers, they would be able to perform perfect price discrimination, such that you would pay exactly the maximum amount you would possibly pay to use an EC2 instance and not a fraction of a penny less. In the industry, we call this “the Oracle pricing model.” In practice, there are strong limits to price discrimination, especially because people and firms want to have some level of cost certainty prior to making a buying decision. This is especially true when thinking about common cloud VM cases, such as my wanting an easy way to spin up a virtual machine to do some work for some amount of time before possibly spinning it down. That’s an action I won’t take if I need to have a conversation with a sales person, and AWS doesn’t have enough sales reps to handle those scenarios anyhow. So instead they offer up a fixed-price model which covers their expenses and which they expect will clear the market for compute—that is, maximize data center utilization.
But what about the times when the cloud vendor can’t maximize data center utilization? Enter spot instances. The name “spot instance” comes from the concept of spot markets, which are markets in which commodities or financial instruments are bought and sold for immediate delivery—that is, we aren’t buying goods at a discount for future delivery several months from now. The price of a commodity in this market is called, naturally, the spot price. What some smart cookies at AWS figured out was that they could open up a spot market for these unutilized resources and allow bidding on them. This gives us three levels of pricing:
- On-demand or pay-as-you-go virtual machines, in which you pay the fixed price on the cloud vendor’s website.
- Reserved instances, in which you get a discount for an up-front commitment of 1-3 years.
- Spot instances, in which you pay some amount up to the pay-as-you-go price for unutilized resources.
AWS started this trend, but we also have spot instances in Azure and preemptable VMs in GCP.
I named this section after the GCP version because that name gives away the game. You see, spot instances can be a lot cheaper than pay-as-you-go pricing—sometimes up to 90% cheaper—but they run a risk. The cloud vendor is happy to sell you unutilized resources at 10% of their going rate because nobody else is willing to pay more. But once somebody else is willing to pay more, they don’t want your money and will evict your VM, shutting it down within a very short time frame. It’s nothing against you personally, but now they have a higher-paying customer for those resources and therefore you get the boot. But don’t worry—you can get the VM back by starting it up again, and if there’s unutilized capacity available, you’re back in the game.
You might be asking, what kind of idiot would agree to this? Well, the kind of idiot who knows how to play the game…which makes that person not an idiot. Here’s the trick to winning with spot instances:
- Have scripts to check if your VMs are running. If they aren’t running, have the script occasionally try to start up. There’s a specific error message if you fail to start the VM due to a lack of capacity, so you can watch out for this and try again in a few minutes.
- Have processes which can run at odd hours. Think about machine learning work or batch processing operations. If you can schedule it during non-business hours or less busy days, you can take advantage of people who shut their VMs off at night and on the weekends to save some scratch.
- Have processes which don’t need to be always-on. You might not be able to control this, but if you’re in a situation in which 100% uptime is not a guarantee, you’re in good shape.
- Have processes which can stop, save their work, and start again later. As soon as you get the indicator that the VM is going to shut down, you can write your results to disk and prepare for shutdown. Then, when you get the VM back, pick back up where you left off.
A great example of this is Databricks Community Edition. For years, Databricks Community Edition has run on spot instances in AWS. They don’t charge you for use of Databricks Community Edition and its intent is for short-term learning engagements for people. Therefore, Databricks has wanted to keep its costs as low as possible, knowing that most people only use their clusters for an hour at a time or so. This is a perfect case for spot instances. If the current spot price for instances is higher than what they were willing to pay, or if there wasn’t capacity available at that moment, they’d return an error message and let you know that you should try again in a few minutes.
Maximize the Savings
Let’s take a look at a concrete example in Azure. Here’s a Standard D16ds v4 instance with 16 vCPUs and 64 GB of RAM running Ubuntu. On the east coast, this baby will set you back $659.92 a month at pay-as-you-go pricing, which is $0.904 per hour.
You can see that one-year and three-year reservation prices are 53.16 and 34 cents per hour, respectively. But check out the spot price: as low as 23.95 cents per hour, which is even better than a three-year reservation but with none of the commitment!
When you go to run one of these in the Azure portal, you can see the pricing for a pay-as-you-go instance:
But let’s check the spot instance box. Doing this gives us a few extra options.
Let’s look at both of the eviction types separately.
The capacity-only option is for the more risk averse among us. The good news is that you never pay more than the pay-as-you-go price. The better news is that you do pay the spot market price for this machine, so you aren’t simply paying the max price all of the time. The best news is that some spot instance pirate won’t yank the rug out from under you and take over your allocated compute resources because he’s willing to pay a thousandth of a cent per hour more than you. But if you select this radio button, you’re admitting defeat. You’d might as well be one of those people paying full price for everything.
In all seriousness, though, capacity-only pricing is a good entry point into the world of spot pricing. You’re still considered at a lower tier than the pay-as-you-go folks, but you’re at the top of the spot instance hierarchy, meaning that, unlike Sully, your VM really will be killed last.
Price or Capacity
Now we’re talking! This is where the real big-money savings comes in. You roll the dice and set a cap. If there are any suckers willing to pay more than you are right now, no worries—you’ll just try again later. And again and again if need be, but when it’s your wallet on the line, persistence pays off.
You can enter any maximum price you want, but it has to be at least equal to the hardware costs that Microsoft has set up for that class of VM. But want to know another fun trick? Those prices differ region to region. For example, in East US, the minimum spot price is $0.23947. But in West US, it’s $0.2164 even though the pay-as-you-go price is higher in West US than East US! In other words, Azure prices for demand in different regions and if you don’t care which region your code runs in, you might save even more money—like, say in Australia Southeast where the hardware price is $0.1709 per hour or Korea South at $0.1564 per hour. Now, keep in mind that you may have storage or networking costs if you have a storage account in West US 2 (hardware cost of $0.3513 per hour, so don’t use spot instances to run your D16ds v4 VMs there) and your compute runs out of Korea South. So do your homework beforehand and figure out which regions are going to give you the most savings. After all, if you’re scraping this hard, money means more to you than time. Maximize the Savings!
Not Maximizing the Savings
Let’s take a look at another instance, this time using the most recent hardware (as of December 2021): Standard D16ds v5. Right now, the spot price is higher than the three-year reserved instance price:
Yes, spot prices are still better than pay-go pricing, but you aren’t getting the elite savings here. In general, look for prior-generation hardware. There are two good reasons for doing this. First, prior-generation hardware will have lower hardware costs, and so your spot prices can be lower. Second, prior-generation hardware is more likely to have lengthy unused periods, as people move their VMs up to newer generations of hardware to take advantage of the performance gains. We don’t care about performance gains; we care about price savings, and so moving back a generation or more is a better idea.
A big question you might have is, “Kevin, how can I Maximize the Savings on fuel?” My protip here is to use the breath mint after using the rubber hose.
Another big question you might have is, “Kevin, how likely is it that my VMs are going to get the Sully treatment?” Well, I should remind you that this is Matrix’s weak arm. But we can get an idea of just how often eviction occurs by selecting the “View pricing history and compare prices in nearby regions” link. If I do this for my D16ds v4 example, I can get a feeling for price movements over time in geographically similar Azure regions, as well as hardware costs and approximate eviction rates.
Dave Callan has a nice post covering eviction rates as well as doing some Azure Portal intelligence gathering to find the best rates.
Today, we learned one way to Maximize the Savings for virtual machines. If you want to learn more about spot instances across the three major cloud platforms (in the US), Gilad David Maayan has a great article. For Azure specifically, here is a FAQ with more details.
For now, I leave you with the promised land of VM pricing: big GPU boxes for little GPU dollars.