How to Define SLOs, SLIs and SLAs
If you’re selling a B2B service, customers will ask you what you SLA is. How do you define one?
SLI
Service Level Indicators are the parameters you’ve chosen to measure, in order to define how well your service is doing. I suggest the following SLIs:
Availability, which is 100% - 5xx error rate 1. Measure this over at least a month 2.
Peak latency, as measured over a one-hour interval. This ensures that your system is fast under load.
Crash rate, if you have any client-side code, such as a client app or SDK 3.
SLO
Once you’ve identified SLIs, set a target for them. These constitute Service Level Objectives. In other words, an SLO is of the form SLI <= Target or SLI >= Target.
I suggest the following SLOs:
Availability >= 99% for a startup in its early stages where the focus is on iterating fast rather than fine-tuning. And 99.9% for an evolved startup.
Peak latency < 3 seconds.
Crash rate < 1% for an early stage startup, and 0.1% for an evolved startup.
SLA
Once you have SLOs and SLIs, the final step is to come up with a Service Level Agreement. This is mostly a legal, rather than an engineering, document. When an engineer says “SLA”, he generally means “SLO”.
Monitor what your actual SLOs are and don’t over-promise. For example, don’t offer a four nines (99.99%) SLA — even AWS and Google Cloud don’t. Either your tech is more stable than theirs, or you haven’t thought it through.
There are two approaches you can choose:
Offer a credit proportional to your error rate, as Cloudflare does. That is, if your error rate in a particular month is 1%, and you charge $100/month, offer them a $1 credit for that month.
Offer a 100% credit if your availability in a particular month is below a certain threshold, as RDS does if the availability is less than 95% 4.
SLAs should also define supported platforms. For example, you support only the latest version of Chrome, Edge, Safari, Android and iOS, with a one-month grace period to upgrade. Except Android, which gets a one-year grace period.
With an AWS Classic Load Balancer, a 5xx error can be generated by the backend or by the LB, so you should add them: HTTPCode_Backend_5XX + HTTPCode_ELB_5XX. There’s also a SpilloverCount of requests that were rejected because the backends are all overloaded. I don’t know if that’s included in the HTTPCode_ELB_5XX or should be added to the above formula.
Exclude 4xx errors, since those are beyond your control.
Exclude errors that are solved by retry within your code — the idea is to capture 5xx errors as seen by users. For example, if a load balancer encounters a Backend Connection Error, it retries with a different backend. Users are insulated from this error.
If users interact with your backend only via your own client code, rather than making direct network requests to your server, then you should add to the 5xx error rate any errors introduced by the client code. That is, if the 5xx rate is 1%, but your client SDK throws a NullPointerException 0.5% of the time, then your error rate is actually 1.5%. Conversely, if some server errors are hidden by a client-side retry, then your error rate reduces.
Outages occur infrequently, so you don’t want to measure availability over a fortnight and claim you have excellent availability, only to make a 180-degree change the next fortnight.
Crashes can happen for multiple reasons: exceptions, nil dereference in Swift, using too much memory, using too much CPU in the background on iOS, certain security violations, etc. Count all of them.
Don’t count crashes that happen in the background. They’re very common (at least on iOS), and they don’t affect the user experience, because the app transparently restarts the next time the user accesses it. Of course, if a background crash is visible to the user, like navigation or music playback stopping, it should be counted.
As a customer, I’d consider 95% the real SLA of RDS. Any promise not backed by a financial penalty should be taken with a grain of salt.