How To Use Alarms Effectively in DevOps
If you’re a DevOps engineer responsible for monitoring production, you should set alarms to automatically notify you when there’s an incident, rather than checking manually every day, or waiting for customers to tell you it’s not working.
Set alarms based on the value of metrics over an hour, in preference to shorter intervals like a minute, to avoid false alarms 1. False alarms are the worst, since they cause you to not take real alarms seriously. I’m including problems that resolve themselves by the time you take a look as false alarms, in this context. When we begin setting alarms for a system that has none, many of us go too far and generate false positives, because we’re worried about missing something. It’s worth recognising this tendency so that we can adjust it.
Set an alarm based on the mean or median value of a metric, as opposed to a high percentile like 95th. An alarm based on the 95th percentile latency will have a lot of false positives (the alarm rings when there’s no problem) and false negatives (the alarm doesn’t ring when there’s a problem), because the underlying data is fundamentally noisy. Garbage in, garbage out. 2
If you get false alarms, increase the duration for which something needs to happen for the alarm to trigger. Or the threshold. Don’t ignore it.
If you want the peak latency to be 100ms, set an alarm to fire if the latency is 200 ms. Otherwise, the alarm will fire when the latency is 101 ms. You may not have intended to be that rigid about it. Typically, when you say “The latency should be < 100 ms”, you mean that the latency should be < 100 ms most of the time, not always.
As software engineers, we live in a world of certainty: the length of an array will always be accurate, for example. 2 + 2 will never be 5. But production is a chaotic environment, with a low and constant rate of things going wrong all the time. So you shouldn’t alert for everything that’s going wrong. As an analogy, someone is always meeting with an accident on the road, as you’re reading this, so the Prime Minister can’t ask to be alerted every time that happens. He has to have a higher threshold, and so should DevOps engineers.
There are two types of alerts: those that track an externally-visible metric like latency. This is called black-box monitoring, since you treat the system as a whole as a black box and observe its external behavior, at at the outermost level, like the load balancer. Alternatively, you can alert on internal metrics like whether the EBS volume of a database has run out of IOPS. This is called white-box monitoring since you’re looking into the system to pinpoint exactly where things are going wrong. You should have both black- and white-box monitoring. Black-, because that captures what’s important to users. White-, because it may not have led to a user-visible problem yet, or if it has, to tell you what the problem is without you having to spend a lot of time under the stress of an outage trying to narrow it down. If you don’t have any alert, start with black-box monitoring.
You should monitor at each layer of your stack. If you have load balancers sending traffic to app servers sending traffic to a database, identify what metrics you need at each layer and add them.
Assign alarms to one engineer, rather than spamming them onto a Slack channel or email list. If it’s everybody’s problem, it’s nobody’s problem. People will ignore them hoping someone else will take care. Instead, alarms should be assigned to one engineer in PagerDuty. He can always reassign to someone else, or post on #engineering and ask, “Can someone handle it?” But route them to only one person by default. Anyone interested in keeping track of what’s happening in production can check PagerDuty every morning, or add themselves to the oncall rotation, but not the whole team.
Create a playbook describing how to handle common problems in the following format:
Alarm: EBS Burst balance is 0
What it means: The EBS volume backing the database has run out of IOPS, so the database will be slow.
What to do: Increase the database storage size by 20%. This increases the IOPS.
As you can see, it should be short and to the point, not a page.Make the playbook available in a place everyone knows, like Notion, give all engineers write access, and announce its existence on Slack.
Also document all information needed to handle an incident, like passwords to login to servers, in a place everyone knows.
Before adding an entry to the playbook, see if you can fix it in an automatic way. In the above example, you can configure the database to autoscale its storage.
It doesn’t mean that it will take an hour to respond to incidents. If your normal latency is 100ms, you set an alarm at 200ms over an hour, and an incident begins that causes the latency to spike to 10s, the average will cross 200ms in just one minute, so the alarm will fire in 1 minute.
Tail latency is a hard problem, and most startups have dozens of more important problems to solve.