Metrics You Should Be Monitoring
If you’re doing DevOps, you should be monitoring the following metrics:
Error rate, as observed by users. This should be measured at the load balancer, as 5xx count / request count.
Availability = 100% - error rate, so it’s measuring basically the same thing. 1
Measure availability over at least a month. Outages typically occur infrequently, so you don’t want to measure availability over a week and claim you have excellent availability, only to make a 180-degree change the next week.
Be careful not to count errors that are hidden by retries inside the system. For example, if an Elastic Load Balancer encounters an error trying to talk to a backend, it retries a different backend, so the user may not see an error. So don’t count this in your error rate.Peak latency measured at the load balancer, over an hour. When you notice high peak latency, check if you also had a spike in the number of requests at that time. For example, your latency is typically 100 ms, but spikes to 200 under high load. What’s actually happening here is that latency is the first metric that increases when a system is unable to scale, so you want to treat this as an early warning, take it seriously, and work on it before it becomes a crisis, like 10s latency.
Crash rate, if you have a client app or SDK. This includes unexpected exceptions (like NullPointerExceptions). Some languages like Swift can crash in addition to throwing an exception. Plus there are crashes caused by consuming too much memory, blocking the UI thread in iOS, and more. Count them all.
High CPU utilisation, like 80%, sustained over some time. Measure this metric at all layers of your stack, like VMs and databases. This goes for many of the other metrics in this list, too.
Free memory.
Free disk space. Unlike CPU utilisation, running out of memory or disk space even momentarily causes a crash, so monitor this over the finest granularity you have data for.
Instances that are unhealthy for a while. Ill health happens, but it should happen only occasionally, and recover fast, as with humans.
Requests being queued for lack of capacity at any layer of the system 2. Or, worse, dropped.
Any time you have a replicated system, you should monitor replication lag.
Latency of a particular subsystem deep inside your stack. For example, an unloaded EBS volume should have a latency < 1ms.
How much traffic you’re getting.
Billing, so that you don’t get bill shock.
If your system is not available at a particular time, it gives an error.
Queues can work in two ways:
When the underlying system is unable to keep up, requests are added to a surge queue. For example, Elastic Load Balancers do this when your app servers are overloaded. This indicates two problems: insufficient capacity and high latency. You ideally want the surge queue to be empty.
Alternatively, the queue can be modeled in a different way: all requests made to a lower-level system are enqueued, and dequeued when a response is received. For example, when an RDS database issues I/Os to an underlying EBS volume, it adds them to a queue, till a response is received. This doesn’t indicate a problem — it’s how the system normally works. If anything, an empty queue would indicate a problem, since it means the system is not processing any requests.
So depending on how the queue is modeled, it could be good or bad.