How to build effective Async APIs by Terry Crowley

Aug 26, 2022

This is a summary of posts by Terry Crowley, former head of Office, on how to build async APIs that work well:

• Async APIs: Synchronous APIs take a predictable time based on the input size, while asynchronous ones take an unpredictable time, because they rely on the network, databases, disks, etc. People sometimes think that synchronous APIs are fast and asynchronous, slow, but that’s not exactly right. Sorting a billion numbers takes time, but you can predict that based on the input size. On the other hand, fetching a one-byte file from a server can sometimes take a long time. So, synchronous APIs are predictable (not fast), and asynchronous ones are unpredictable (not slow).

A second distinction is that asynchronous APIs can fail, which callers need to handle. Technically, a synchronous API can fail, too, say with an index out of bounds exception, but callers typically don’t handle such errors.

This function:

byte[] fetchFile(String url)

is asynchronous based on the above criteria. It’s written in a synchronous style, but when invoking this function, you must treat it as asynchronous. For example, in an Android app, you’re not supposed to block the main thread to read from storage. If you do, your app will run fine most of the time, but will occasionally freeze. So it’s unpredictable, and you should use asynchronous I/O.

So, you can’t look at a function’s signature and say whether it’s asynchronous. Asynchronicity is a matter of behavior, not interface. This puts a responsibility on engineers building an asynchronous function to carefully document that it’s asynchronous, what exceptions it can throw, etc.

• Beyond a certain level of complexity, asynchronous coupling between components is unavoidable. If it appears synchronous, that’s because some internal layer has hidden the asynchrony.

• Timeouts: When you invoke an async API, you need a timeout, because otherwise, if the network connection or the server fails, you might end up waiting forever. You might think you can rely on the TCP connection breaking, but that may not happen for an hour. You need to apply a timeout.

Timeouts must apply when no progress is being made, not when slow progress is being made. That is, if you’re downloading a file from a server, and no byte has arrived in a minute, you want to cancel the download. But if you’re on a slow connection and it’s going to take an hour to download the file, you want to keep going. In the first case, you timed out after a minute, and in the second, you didn’t even after an hour.

• Long-running operations: Sometimes you make a call to a server that then does long-running processing. An example is uploading a video to Youtube. It can take hours to process. Don’t tie the processing to the network request that started the process. If the connection breaks, or the laptop goes to sleep, you don’t want to cancel the processing. Instead, return a success code along with a URL, let the client disconnect, and the client can later make a REST call to the given URL to check its status. Making it first-class also lets you perform higher level operations on the pending tasks. For example, you could prioritise shorter videos first. Or you could offer a Cancel All button that cancels the processing of all uploaded videos. You can’t perform these kinds of operations if the state of the asynchronous operation is hidden inside callbacks, as local variables or parameters, as it usually is. So one technique is to extract out this hidden state and model it as a first class object in your system. In case of YouTube, in addition to the usual objects like videos, channels, comments and playlists, you’d also have pending videos.

• Cancelation is not rollback: If you offer cancelation, it means that you should no longer spend resources processing the request, and you no longer want your callback to be invoked when the operation completes. There’s no reliable way to tell the remote server to undo what’s already been done. For example, if you’re uploading a file, you have a callback that reports progress, so that you can show a progress bar. If the user presses Cancel, you cancel the upload, which means you stop reading from disk, wasting network bandwidth, and you don’t invoke the callback any more. After all, what will the callback do, since there’s no longer any progress bar to update? But there’s a half-uploaded file sitting on the server — cleaning that up should not be the responsibility of the cancelation, but a separate cleanup process.

Kartick Vaddadi: Tech Advisor to CXOs

Discussion about this post