What I'd Like From a Serialisation Protocol
Standard data structures: Serialisation formats should model data using standard programming language data structures like arrays, maps, sets and enums. Don’t come up with new ways to model data. As a counter-example, Google’s protocol buffers offers repeated fields, which are fields that can have multiple values. That’s a bad feature in a serialisation format. Instead just model it as an array. As a second example, XML offers elements and attributes. Serialisation formats shouldn’t come up with such non-standard ways to model data. Use standard data structures like arrays, maps, sets and enums. Otherwise, you force engineers to figure out how to map the data model used in the serialisation format to the one used in the language. This mapping adds complexity without adding business value.
Self-describing: Data should be self-describing, unlike {Ramesh, true}. What does the true mean? It’s anybody’s guess! {name: Ramesh, is_male: true} is better. Don’t try to save space by omitting the field names1. When data lives for years, and the schema evolves independently, the data should track its own schema. You shouldn’t need an external schema to decode the data, like protobufs do. The schema should travel with the data. Otherwise you may not know how which version of the schema to use to interpret the data later on. It’s an unnecessary optimisation in 2022.
Backward-compatibility: Serialisation formats shouldn’t be rigidly typed in ways that restrict evolution. For example, imagine an e-commerce app that offers discounts. To apply the discount, the client app makes a call to the server to apply a discount percentage like 9%. Say the initial client and server both assume integer discounts, not 9.5%. Everything is working well. Now, the server is enhanced to support fractional discounts like 9.5%. The client app should not break if it’s not updated. Protocol buffers don’t handle this gracefully, since changing the type changes the binary format. This is a mis-feature, and one that JSON doesn’t have. Many serialisation formats are worse than JSON in multiple ways. They don’t understand what makes JSON excellent, so they end up being over-designed and worse. Programming languages don’t have this flaw either. In Java, for example, if you have a function applyDiscount(int discount_percentage) and you change it to applyDiscount(double discount_percentage), the call sites will continue to work. Systems should not be overly coupled where changing one part breaks the other. Such rigid coupling restricts evolution and causes outages.
First do no harm: Serialisation formats shouldn’t add new pitfalls over JSON. For example, say you have an Employee protocol buffer schema that has three fields: name, salary and manager. You deserialise a buffer, create a new blank buffer and copy all three fields to the new one, and serialise that. Is that same as the original? You’d think so, but the original could have a field like bonus that the new one won’t have since you copied only the name, salary and manager. But if bonus is not part of the schema, how did the original buffer have a bonus? The answer is that the the original buffer was created by a separate program that used a different schema that did have a bonus. JSON doesn’t have this complexity, since it’s dynamically typed. Just as doctors are trained to first do no harm, when defining a serialisation format, stop and think if you’re instead creating new problems compared to JSON. Look at this huge list of do’s and don’ts when updating a protocol buffer!
Constraints, not data types: If you want to prevent illegal values, the serialisation format should let you define constraints like a minimum and maximum value, not data types. For example, if you’re applying a discount while placing an order, the discount percentage should be between 0 and 100. More than 100 implies that you’re paying the customer to place the order, which doesn’t make sense. How do you model this? You can model it as an integer, but that’s not the best way, since you could have a negative discount, which doesn’t make sense. You could use an unsigned integer, but that’s not the best fit, either, that since any value like 300 is not valid: A 300% discount means that you’re paying the customer to buy rather than the customer paying you! You could use an unsigned 8-bit integer, but that still permits 127%. The right way to model it is not as a data type but as a constraint (discount >= 0 AND discount <= 100). Once you model it as a constraint, it can be integer or floating-point; that’s an irrelevant detail. Constraints focus your attention on the important part, while data types focus your attention on minutiae. As a second example, if you’re tracking how many items are included in an order in an e-commerce store, it’s a positive integer. Not unsigned, since you can’t order zero items. As a third example, if you’re tracking how much petrol was pumped at a fuel pump, it’s an unsigned float, but programming languages don’t have such a data type. The mistake in all these cases is that you’re thinking in terms of implementation types, while you should be thinking in terms of business logic, and leave it to the serialisation format to figure out how to encode it. So, don’t specify data types; specify constraints.
Efficient: Suppose you have a 100 MB encoded file on disk containing information about every city in the world, and you want to look up information for a particular city, which is only 200 KB. In this case, the amount of information read from disk should be closer to 200 KB than to 100 MB. This is not the case for (say) JSON — you have to read and parse the entire file from disk.
Protocol buffers instead uses a number. This introduces more complexity like remembering not to change a number when renaming a field. And it amounts to an ad-hoc, poor compression format. If you want to compression format, use a standard one like lz4.
Even otherwise, compression can be done transparently to the programmer. If you have an array like:
[ {first_name: Kartick, last_name: Vaddadi}, {first_name: Ramesh, last_name: Naidu}]
If you have a million elements, you’re storing the key first_name a million times. If that’s a concern, the encoder can transparently convert it to:
[{1: Kartick, 2: Vaddadi}, {1: Ramesh, 2: Naidu}] key: {first_name: 1, last_name: 2}
Here, the encoder has assigned a mapping from strings to integers, and stored the integer 1 a million times instead of first_name. The mapping is carried with the data, so there’s no chance of ending up in a situation where you have data that you don’t know how to interpret.
Importantly, this is an implementation detail, like encryption in HTTPS, and not exposed in the API, which still represents fields as strings.