How to build effective Async APIs by Terry Crowley
This is a summary of posts by Terry Crowley, former head of Office, on how to build async APIs that work well:
• Async APIs: Synchronous APIs take a predictable time based on the input size, while asynchronous ones take an unpredictable time, because they rely on the network, databases, disks, etc. People sometimes think that synchronous APIs are fast and asynchronous, slow, but that’s not exactly right. Sorting a billion numbers takes time, but you can predict that based on the input size. On the other hand, fetching a one-byte file from a server can sometimes take a long time. So, synchronous APIs are predictable (not fast), and asynchronous ones are unpredictable (not slow).
A second distinction is that asynchronous APIs can fail, and callers need to handle failure. Technically, a synchronous API can fail, too, say with an index out of bounds exception, but callers typically don’t handle such errors.
This function:
byte[] fetchFile(String url)
is an asynchronous function based on the above criteria. It’s written in a synchronous style, but when invoking this function, you need treat it as asynchronous. For example, in an Android app, you’re not supposed to block the main thread to read from storage. If you do, your app will run fine most of the time, but will occasionally freeze. So it’s unpredictable, and you should use asynchronous I/O.
So, you can’t look at a function’s signature and say whether it’s asynchronous. Asynchronicity is a matter of behavior, not interface. This puts a responsibility on engineers building an asynchronous function to carefully document that it’s asynchronous, what exceptions it can throw, etc.
• Beyond a certain level of complexity, asynchronous coupling between components is unavoidable. If it appears synchronous, that’s because some internal layer has hidden the asynchronicity.
• Timeouts: When you invoke an async API, you need a timeout, because otherwise, if the network connection or the server fails, you might end up waiting forever. You might think can rely on the TCP connection breaking, but that may not happen for an hour. You need to apply a timeout.
Timeouts must apply when no progress is being made, not when slow progress is being made. That is, if you’re downloading a file from a server, and no byte has arrived in a minute, you want to cancel the download. But if you’re on a slow connection and it’s going to take an hour to download the file, you want to keep going.
• Long-running operations: Sometimes you make a call to a server that then does long-running processing. An example is uploading a video to Youtube. Don’t tie the processing to the network request that started the process. If the connection breaks, or the laptop goes to sleep, you don’t want to cancel the processing. Instead, return a success code, let the client disconnect, and model the operation as a first-class object: It should be reachable via the UI, have its own URL if you have a REST API, and have its own status. Making it first-class also lets you perform higher level operations on the pending tasks. For example, you could prioritise shorter videos first. Or you could offer a Cancel All button that cancels the processing of all uploaded videos. You can’t perform these kinds of operations if the state of the asynchronous operation is hidden inside callbacks, as it usually is. So one technique to use is to extract out this hidden state in local variables or closures and model it as a class or struct.
• Cancelation is not rollback: If you offer cancelation, it means that you should no longer spend resources processing the request, and you no longer want your callback to be invoked when the operation completes. There’s no reliable way to tell the remote server to undo what’s already been done. For example, if you’re uploading a file, you have a callback that reports progress, so that you can show a progress bar. If the user presses Cancel, you cancel the upload, which means you stop reading from disk, wasting network bandwidth, and you don’t invoke the callback any more. After all, what will the callback do, since there’s no longer any progress bar to update? But there’s a half-uploaded file sitting on the server — cleaning that up should not be the responsibility of the cancelation, but a separate cleanup process.