Canary and Launch Best Practices
Canary and launch processes, done right, can increase reliability and help your team progress faster. Let’s see how to achieve that:
To begin with, you should canary your deployments, which means exposing them to a limited amount of production traffic for a while before exposing them to 100% traffic.
Why would you do this?
To catch problems when they’ve affected fewer users.
To use a more aggressive strategy to deploy to the canary than you’d feel comfortable with to deploy to production. For example, continuous deployment without requiring blocking code review.
I suggest the following launch process:
Engineers commit code
→ The CI/CD tool builds and run tests. Assuming they pass:
→ The commit is merged into master
→ The update is rolled out to canary
→ Engineers review the code asynchronously
→ The update is rolled out to full production after a certain period of time.
If you feel it’s dangerous to deploy code to production without a review, that’s part of the reason for canaries — by reducing the impact of errors, we free ourselves to be more aggressive than would otherwise be prudent.
Keep the following best practices in mind when canarying:
Routing: You need to decide which traffic to route to the canary. One approach is a fixed routing, such as sending some customers’ traffic, or traffic from users in one country, to the canary. But the problem with that approach is that the traffic going to the canary may not exercise all the functionality. For example, users outside India don’t use UPI. Or some customers may not have certain features enabled in their plan, or may not use them even if they’re enabled. As a result, critical bugs can remain undetected in the canary and bite you after you’ve rolled it out to all users, diluting the purpose of the canary. So it’s better to route a random percentage of traffic to the canary.
Metrics: You need to see metrics for your canary separately from production, rather than a single graph of both of them averaged out, since averaging out canary metrics with a much higher proportion of production metrics will hide problems with the canary, defeating the purpose of canarying.
Alerting: Having metrics is not enough. You need to pay attention to them. You can and should do this in an automated way, by having alerts that fire if the metrics are bad. It’s also a good practice for your on-call engineer to take a quick look at these every morning. Make it part of the on-call responsibility. It takes only a minute to glance at the graphs for error rate and latency. It need not be a big overhead. Companies sometimes get the tech right but don’t set up the right team processes, so things slip through the cracks.
Duration: How long should you wait before rolling canaried changes out to production1? I suggest waiting half a week. Why? Because if the canary does have a problem, users will complain. In a B2B context, they’ll complain to your customer, who’ll probably have a meeting to discuss what to do and then bring it your notice. Half a week is a good time for this process to play out. If you want to be more conservative, since engineers are allowed to deploy un-reviewed code to the canary, canary for a week. On the other hand, if you want to be more aggressive, canary for 24 hours. Some problems manifest themselves only during peak hours, and most services have a daily peak, so I’d suggest canarying at least for 24 hours.
Configuration: Configure canary identical to production. For example, don’t use an instance type with less memory to save cost, since it introduces another variable into the mix. Like a scientist, you want to change only one factor — the code — while keeping everything else the same so that you can attribute any changes you see to the code being canaried.
Multiple instances: If you’re running multiple instances of your infrastructure, or multiple datacenters, one approach to canarying is to pick the least important one, and designate that entire instance as the canary, rather than sub-dividing by percentage. That is, if you have US, Germany, India and Korea datacenters, and Korea is least important to you, use Korea as your canary, without further splitting traffic within Korea by percentage. Deploy your changes to Korea and after some time, to all your other datacenters.
Serverless: Serverless platforms like AWS Lambda let you specify a percentage of traffic to be routed to the canary, and they take care of managing the physical resources to make that happen. If you’re not using serverless, you need to figure out how to manage your physical resources to achieve the desired outcome 2.
Canary Percentage: Use a percentage that makes problems stand out clearly. It shouldn’t be subtle, because if it’s easy to miss, it will be missed. It should be so prominent that an un-observant engineer will notice them. If you’re not sure about the right percentage, err on the higher side. For example, if you’re not sure whether 5 or 15% is the right percentage, use 15%. If you choose too low a percentage, you’ll miss the problem, and affect 100% of users. Better to affect 15% instead 3.
Singletons: Canarying doesn’t apply to singleton resources. For example, if you have a number of microservices, and multiple instances of some but only one instance of others, you can’t canary changes to the latter. As another example, you typically have only one database, so you can’t canary changes to the database, like adding indices or changing the schema. Canarying is only for something that has multiple instances.
In addition to the preceding best practices for canarying, I suggest the following best practices for deployments:
All code should eventually be updated. Don’t run a quarter- or year- old code in some datacenters, as some companies end up doing. You prevent this by automatically promoting all canaries changes to production after a delay. Or you can do it manually, say by having a weekly launch cycle. However you do it, do it. All code should be refreshed sooner or later.
Some companies have a manual launch process, rather than automatic continuous deployment. For example, they may launch every Monday. In such cases, every engineer should be empowered to launch sooner, by following a documented process. If there’s something important to launch, either a feature or a bug, and it’s been adequately tested, and it’s being rolled out carefully, you shouldn’t have to wait for an arbitrary date on a calendar.
Non-technical people should not be empowered to bypass launch processes because they don’t understand or prioritise engineering norms, defeating the point of having well thought out engineering processes.
You can have multiple stages in your rollout process, like 10% of users for a day, then 20%, then 50%, then 100%. Or if you have four datacenters, you can deploy to each after a day. But having this is overkill unless you know you need it. Having one canary stage and then rolling out to 100% gives you most of the benefit of canarying with less complexity.
For example, if you’re running VMs with autoscaling, you probably want to define a canary autoscaling group with a minimum and maximum limit of one instances, and a production autoscaling group with minimum zero instances and a high maximum. To update the canary, you remove it from the load balancer, update it, and re-attach it. You need to figure these implementation details out if you’re not using serverless.
You can tweak these percentages over time, anyway, so don’t succumb to analysis paralysis.