Reimagining Databases

Oct 11, 2021

Databases were developed decades ago. Postgres, for example, in the 80s. One advantage of such a legacy is that databases are tried and tested. You’re not going to run into Postgres bugs, only your bugs. They’re also flexible, having evolved to meet a variety of needs. The risk of you regretting your decision to use Postgres because your needs fall outside the supported use cases is low.

But it’s worth periodically re-imagining things from scratch. If databases didn’t already exist, how would we build them? This thought exercise overcomes the status quo bias, where we think that the way things already are are right, and new ways are wrong. So, as you read this post, resist the urge to reject these new ideas, and consider them for a while.

Let’s get started:

Databases could come with a HTTP server builtin, and accept requests directly from client devices. Many HTTP endpoint implementations just pass through to the database. For example, in a notes app, a GET /notes/5389538595 request just translates to SELECT * FROM notes where id = 5389538595. Such translation layers that don’t have business logic add no value, and can be skipped. If you’re concerned about security, there can be multiple layers of controls as to which data gets exposed via HTTP: only some tables, and only some rows. And only some columns. For example, if a user table maintains a trust_score that determines whether the user is judged to be malicious, this column can be marked as not served via HTTP to clients. Another layer of access control can be allowing only reads but not writes. Databases can have an external port, which is exposed to the public Internet and subject to this access control, and an internal port, which is exposed only within the datacenter and not subject to this access control.
Databases should come with client SDKs for web, Android and iOS. If you want to fetch a row, it’s more convenient to use a client SDK than making raw HTTP requests. And you get benefits like caching, offline and retries. If you think “That’s not the database’s job” there’s no reason it shouldn’t be. Solve the problem end to end, simplifying the development experience. And developers who don’t want such client SDKs don’t need to use them, so it doesn’t take away anyone’s flexibility.
Databases should come with server SDKs for common platforms like Go or Java. You shouldn’t have to use a framework like Hibernate for basic usability. The server SDK should translate JSON to objects so that you can access order.discount_percentage as opposed to having to parse a raw SQL response string as with today’s databases. Don’t make developers cobble together multiple tools for common use cases. It’s just more work for no extra business value. If a developer does have a preferred tool, he should be able to use it, but databases should come with batteries included, as the Python folks put it.
Databases should be cloud-native. For example, if a database is consistently overloaded, it should make a call to the cloud to upgrade to the next bigger instance. A database can’t detect every case, but it should be able to detect obvious cases like 90% memory or CPU utilisation sustained over a period of time (like an hour) regularly. If the database is already using the biggest instance class, it should automatically set up a read replica of itself, and the server SDK (see previous point) should automatically route read requests there. If a read replica can’t be set up automatically, you should still be able to do it declaratively by clicking a Read Replica checkbox in the admin panel and have the database coordinate with the server SDK. You shouldn’t have to make code changes to route read requests to a different database URL and deploy the modified code. If the write traffic is beyond the ability of a single server, the database should scale horizontally, either automatically. Or manually, but the manual work should again consist just of checking a checkbox. The cognitive load should be far less than it is with today’s databases, which provide building blocks in a bottom-up manner, rather than looking at the user’s problem top-down. This is an area in which engineers can learn from UX designers. You should also be able to configure geographic redundancy declaratively — say what kind of redundancy you want: another availability zone within the region, with synchronous replication. Or asynchronous. Or another region.
Databases should represent data in a hierarchical form like JSON, not a tabular form. The tabular form doesn’t correspond to how UIs work, nor to how data is exchanged between the client and the server (JSON), nor to how data is exchanged between servers in the datacenter (JSON or protocol buffers). It’s awesome to work with JSON as compared to (say) XML, because JSON maps to programming language abstractions like arrays and objects, so you don’t have the cognitive overhead or boilerplate code that converts between formats. Such conversions don’t add business value, and you as an engineer should always focus on creating business value. Here’s how a table could be represented:
```
Table orders {
  order_value >= 0
  num_items_ordered >= 1, integer
  discount_percentage (>= 0 and <= 100%) OR == null
  delivery_address {
    city != "", string
    pin_code >= 0 AND <= 999999
  }
}
```
Compare this with the typical SQL statement you’d write:
```
CREATE TABLE orders {
  order_value INT NOT NULL,
  num_items_ordered INT NOT NULL,
  discount_percentage FLOAT,
  delivery_address STRING NOT NULL
}
```
Since we’re proposing the use of JSON, we can have nested fields, like delivery_address, where nesting is a natural and logical way to represent the components of an address. Second, notice the emphasis on values rather than types. I don’t really care whether the discount_percentage is an int or float. If it’s an int, what type, like SMALLINT, MEDIUMINT or INT. These are implementation details that are less important when modeling data than the logical constraints. For example, if the discount percentage is 150%, that’s a problem, since it means we’re paying the customer to place an order. That’s more important than whether it’s an integer (150%) or float (150.5%). Types can redirect your focus away from what’s more important. And the database can infer the implementation type based on the constraints. In this example, the pin code is less than 999999, so the database can figure out that a 3-byte MEDIUMINT is the optimal storage type for it. Constraints also provide a richer modeling method than types. For example, num_items_ordered is at least 1. If you were to think in terms of types, you’d say it’s unsigned, but that means that a customer can order zero items, which doesn’t make logical sense. As a second example, a petrol station can have a field fuel_filled which is a float with value >= 0. If you think in terms of implementation types, there’s no such thing as an unsigned float. For all these reasons, you should think in terms of your domain model, at a high level, not in terms of data types, which is low level.
Databases should come with a web UI that lets non-technical stakeholders like support reps access data visually rather than writing SQL. Engineers think SQL is easy, but it’s intimidating to non-technical users. Something like Metabase should be built-in. The web UI should also show inferred data like the minimum and maximum value of each column. After all, good UX surfaces relevant information without you having to ask for it. If a query is taking too long to run, you should be able to click an Email me when it’s done check box and close the tab. Unlike today’s databases, where my laptop goes to sleep, closing the connection, preventing me from running a long-running query. When this happens after I’ve waited an hour, it’s a massive waste of time and an irritation. Provide common-sense solutions to such problems.
Databases should handle short-lived connections, as is becoming common with frameworks like FaaS. You shouldn’t have to put in hacks like the RDS proxy.
Each row can have default columns created and updated, instead of developers manually having to maintain this information. If performance is a concern, there can be an opt-out. But by default, reduce work for software engineers. Having the right defaults is important, as any UX designer will tell you. And UX applies to developers as well (developer experience) since they’re human, too.

In summary, if we rethink databases from scratch, we might end up with databases that are far simpler for engineers to use.

Kartick Vaddadi: Tech Advisor to CXOs

Discussion about this post