How to Retry Failed Requests
Retrying the right way is critical: I have seen bad retries bring down the service, both in a startup and at Google.
Don’t retry at multiple layers of your stack: if your frontend code tries thrice, and your load balancer tries thrice, and your app server tries twice, it puts a 3 x 3 x 2 = 18x load on your database, enough to bring it down. And when the database tries to recover, the load will again take it down. You’ve put yourself in a situation you can’t recover from.
Repeated retries can happen at the backend layer, as in the above example.
Or in the frontend. For example, you use a client SDK, which gives an error, so you retry, but the client SDK has already retried.
Or the frontend retries only once, and the backend retries only once, but that’s two layers of retries when you look at the system as a whole.
To avoid retrying at multiple layers, retry at the highest layer: If you have multiple client layers, like an SDK X built on top of another lower level client SDK Y, retry at X.
APIs should retry on the client side, not server side. The client needs to retry anyway to deal with transient server errors and network issues.
Retry at the server only if you don’t control the client.
Example: You’re offering a REST API without a client SDK.
Example: The top level document load in a browser (not AJAX requests).
If you’re building an API for third parties to use, don’t ask them to retry. It’s more work for them, and they may not follow the principles in this guide, bringing down your service.
Document your retry logic in a Google doc shared with everyone so that both backend and frontend engineers are on the same page. Retry is end-to-end.
Back off exponentially:
First error: retry immediately. Don’t add a delay, because transient errors occur even in the best service, and you don’t want to increase latency for them.
Second error: wait 1 - 10 minutes randomly and retry.
Third error onwards: wait 10 - 20 minutes randomly and retry.
Random is important: never write code that waits X minutes and retries, for any value of X. Otherwise, if you have an outage at 9:00, all clients will pound on the server at 9:10, then 9:20, and so on, preventing the server from recovering.
If the HTTP response has a Retry-After header, use the value in the header or the above values, whichever is higher.
Don’t blindly retry all errors — many indicate problems that won’t be fixed by a retry. Retry only the following HTTP error codes:
408
429 (wait at least 10 minutes)
500
502
503
504
561
Want me to upgrade your engineering?