Measure metrics over longer periods to reduce noise
Let’s say you want to measure the peak latency of your system. You can measure it with a granularity of 1 minute:
Each data point in this graph represents the average of the latencies of all requests in a minute.
Alternatively, you can measure with a granularity of 1 hour:
Which granularity should you use?
In the 1 minute graph, the peak latency is 1.2 seconds, but that’s just one data point. The second highest peak is 400 ms. What does this mean? It means that if the 1.2-second peak went away, by chance, the peak latency would have fallen by a third. Let’s say you change the configuration of the system to try to reduce latency. After making this change, you measure again, and this time your peak latency falls to half, to 600 ms. Is it because of your configuration change? You can’t say. Maybe the random event that caused the spike in the first test didn’t happen in the second? 1
Further, if you set an alarm to alert you if the latency crosses a threshold, the alarm will have a lot of false positives, which alert you about a problem that doesn’t exist, encouraging you to ignore the alarm next time, which could be a genuine, serious problem. You can reduce false positives by increasing the alarm threshold, but that increases false negatives: you won’t get alerts for real problems.
By contrast, in the 5-minute graph, the highest data point is 112 ms, and the second highest is 107 ms. So if the highest data point went away, by chance, the peak latency would fall only by 4%, as compared to by 67% in the 1-minute case. If you change the configuration and the latency falls by half, to 56 ms, you can be sure that your configuration change results in lower latency.
If you set an alarm, it will have fewer false positives and false negatives 2.
Don’t work with noisy data. Garbage in, garbage out. To reduce noise in your metrics, average them over a longer period like an hour, not a short period like a minute.
The exception to tracking metrics over an hour is things like running out of memory, disk or database connections, which cause a serious problem even if it happens for only 1 minute. For such metrics, set an alarm over the shortest possible time period.
In fact, if the random event that caused the 1.2-second spike happened during the second test instead of the first, you’d conclude that your configuration change increased latency from 400 ms to 1.2 seconds.
Statistics that are unduly influenced by one data point are suspect. For example, if you take the max of a data set, one high value in the data set will determine your conclusion. But if you average them out, your conclusion is not affected by outliers.