Building Resilient Applications with AWS Spot Instances and EKS

AWS Spot instances are spare compute (EC2) capacity that are up to 90% cheaper than On-Demand instances. As such, they’re a great way to save money on your AWS bill.

However, as with all things in this world, it doesn’t come for free. You’ll need to put in effort to maintain service reliability. This post will discuss some trade-offs, tips, and limitations when using Spot instances with EKS.

Build your apps to be truly ephemeral and stateless

Spot instances can be taken away at any time, so you’ll have to build your apps to be truly ephemeral and stateless. APIs are a great example of this, as well as simple job queues.

State is OK as long as it’s kept separate in RDS, Elasticache, etc.

Use multiple replicas for every app (at least 2)

An easy decision you can make right off the bat is to configure at least 2 replicas for your app. Since a replica can be taken away at any time, having one replica essentially means your service can drop to zero replicas, and you don’t want that.

It’s possible for multiple replicas to be taken away at the same time, and this actually happens more often than you’d think. Having 2 replicas drastically decreases the chances of dropping to zero, and increases your reliability.

Invest in developer awareness and knowledge sharing

Teams should be made aware of the fact that their services can be terminated at any time, and should know how to properly account for it. The larger the org, the more effort this takes.

Invest in knowledge sharing sessions that help teams build resilient and reliable services.

Your apps should terminate gracefully

AWS recommends using their Node Termination Handler (NTH) to allow your EKS control plane to receive Spot interruption notices, EC2 maintenance events, and AZ rebalances. Good news: If you use AWS EKS Managed node groups, Node Termination Handler comes for free! If you use self-managed nodes, you’ll have to add it in yourself.

A Spot interruption notice is essentially AWS telling you that you have 2 minutes to wrap up what you’re doing before your node gets taken away. With NTH in place, your container will receive a SIGTERM that you can listen to and shut the service down gracefully.

Tip for apps that process jobs, or any other queue-based solutions: stop dequeuing jobs as soon as you receive a SIGTERM. I found that this is an easy thing to miss. Let the current jobs process till the end, but don’t pull any new ones. As an example: Bull has a Queue#close method that stops pulling jobs.

Use Pod Topology Spread Constraints

Kubernetes provides a way for you to spread your services throughout your cluster’s nodes, allowing you to minimize the chance that your pods get evicted at the same time. This feature is called Pod Topology Spread Constraints (topologySpreadConstraints field).

What does this solve? Well, if you don’t control which nodes your app gets provisioned to, it could end up having all its replicas on the same (Spot) node. When the node gets eventually taken away, your app will drop to zero replicas and become completely unavailable.

Using Pod Topology Spread Constraints, you can make sure they never end up on the same nodes. You can spread your app’s replicas throughout different availability zones, regions, and nodes. This will drastically increase your app’s availability.

Here are a couple of its fields to give you an idea of the flexibility it provides:

Check out the official docs on how to configure topologySpreadConstraints for your pods.

Pod Anti-affinity

There’s also the more basic Pod Anti-affinity feature, which you can use to tell Kubernetes that this pod should not be provisioned into nodes that already have pods of this type.

As you can see, it’s not as powerful as Pod Topology Spread Constraints. It only lets you control where the pod should not be provisioned, whereas Pod Topology Spread Constraints give you more flexibility and control over how the spreading is done.

Pod Disruption Budget

You can set a Pod Disruption Budget (PDB) for your service, telling Kubernetes to limit the disruption to your application by using configuration parameters such as:

Note: when we mention disruptions, we’re talking about voluntary disruptions such as actions initiated by the admin, or various maintenance events such as:

So, although this doesn’t help much with Spot interruptions, it still improves your app’s availability by limiting the number of pods that can be taken away at any time.

Don’t use Spot instances for long-running jobs.

Use them for services like APIs, or stateless workloads that are OK to be terminated at any time. Technically, you could use them for long-running workloads that support checkpoints and/or saving progress periodically, but that comes with complexity.

Some useful resources

See Also