A Guide to EC2 Autoscaling
You should use autoscaling to
… increase scalability of your system as customers scale up their traffic over time.
… reduce MTTR when there’s a sudden spike in traffic, by responding quicker than a human.
… overcome occasional application bugs like hitting 100% CPU: autoscaling detects this instance as unhealthy, terminates it, and launches a new one. You should fix the root cause, of course, but in the meantime, you don’t want users to suffer. And you want to buy yourself time to fix it if you already have too many problems.
… reduce manual overhead like pages when customers increase their traffic.
… reduce cost by running fewer instances at night.
Before you set up autoscaling
Ensure that your instances start up automatically, without manual steps like ssh’ing in to start services after the instance starts up.
Ensure your instances are consistent in the services they run. If instance 1 runs A, B and C, instance 2 runs A and C, instance 3 runs B and C, you can’t autoscale.
Ensure that each instance is configured automatically, such as by using EC2 metadata, which lets you set key/value pairs that the instance can query. You should not ssh in to modify a config file for each instance to tell it its instance ID, for example.
Dependent services should also autoscale. For example, if you’re running your own Redis, you should ensure that each VM includes Redis, rather than having a fixed capacity for Redis, because a fixed capacity won’t autoscale as the number of VMs scales.
Ensure your instances are stateless. If one of your services is stateful, like a presence service in a chat app, refactor it to be stateless. If that’s not possible, run the presence service by itself in a separate VM, outside the autoscaling group, and run everything else, which is stateless, in the autoscaling group.
How you should set up autoscaling
You need to select a metric based on which your cluster scales. If this metric is high, the system is considered to be overloaded, so it scales up. If this is low, it scales down. Choose CPU utilisation1.
Experiment with different values of CPU utilisation like 80% and 70% to see which one results in the peak latency you want.
There are different scaling policies you can choose from. Choose target tracking. The way this works is that you say that the CPU utilisation should be 70% (for example), and EC2 takes care of the rest. It’s like telling your AC that you want the temperature to be 25 degrees, which is your goal. You don’t want to tell the AC how to achieve your goal, such as “If the temperature > 27, cool the room by 200 BTU, and if the temperature < 23, heat the room by 100 BTU.” You want to leave these details to the AC to figure out.
Enable predictive scaling. For example, if you usually have a peak at 9 AM, you don’t want to wait till you hit the peak, have performance degrade, and then deal with it. Enabling predictive scaling will never cause the number of instances to be lower than what it would have been without predictive scaling, so there’s no downside to enabling it 2.
Autoscaling lets you set a minimum number of instances to run, independent of load. Set that to 2 so that one failure (either due to your code or AWS) doesn’t result in an outage.
Autoscaling lets you set a desired number of instances to run as a default, to be used when the algorithms that detect high and low load don’t trigger. Set that to the minimum number of instances.
Autoscaling lets you set a maximum number of instances to run. Set that to 10x the number that typically run in peak hours.
You should enable connection draining. This ensures that when the load decreases and one of the instances needs to be shut down, new requests won’t be sent to it, but autoscaling will wait for the existing requests to complete before terminating it. Without connection draining, users will get errors.
Autoscaling publishes metrics at one-minute granularity to CloudWatch. This comes at no extra cost, but it should be enabled. Do so 3.
Autoscaling lets you set a maximum instance lifetime after which an instance will be terminated even if it’s healthy, and a fresh one created. Set this to a reasonable duration like a month. Otherwise, some VMs will continue running for months without a reboot, which means they won’t pick up certain updates, which is a bad practice.
To use autoscaling, you have to define a launch template that tells autoscaling how to spin up a new server when needed.
You can have multiple autoscaling groups of instances that are scaled separately. This makes sense if you have a microservices architecture, among other situations. For example, if you’re building an email service, and you have two microservices, the main microservice and an attachment microservice, each can be in its own group, scaled separately.
When using autoscaling, you attach the autoscaling group to the load balancer, rather than individual instances.
If you have multiple instances of your infrastructure, say in different regions or for different customers, standardise the instance size across all of them. You don’t need two ways in which to adjust resources based on traffic — more instances and bigger instances. It makes it harder to tune things, say to avoid out of memory errors, or increasing thread pool sizes to take advantage of bigger instances.
Don’t choose latency as the metric, since the metric chosen needs to be proportional to the load on the instances.
Latency doesn’t increase proportionally to the load.
Latency can increase due to other reasons than autoscaling, like a bottleneck at some other layer of your stack like your database. You don’t want autoscaling interpreting that as insufficient capacity and increasing the number of app servers 10x.
In any case, you can’t use custom metrics like latency with target tracking, only predefined ones, which are CPU utilisation, number of requests processed, network bytes out, and network bytes in, in decreasing order of suitability.
Predictive scaling increases the number of instances if it predicts high demand coming soon, but does not reduce the number of instances if it predicts low demand coming, because that would risk high latency if the prediction is wrong.
With predictive scaling, the number of active instances = max (predicted count based on future traffic, actual count based on current traffic).