Choosing Between RDS Multi-AZ and Read Replica
Amazon RDS offers a choice between two types of replicated databases: Multi-AZ and read replicas. Which should you use?
First, depending on your needs, the right answer could be neither. It’s not as if a plain database is unreliable. It’s backed by EBS, which is internally replicated, so less likely to lose data. On top of that, RDS can be configured to back up your database every day, and retain backups for as long as 35 days. And RDS being a managed service, Amazon takes some responsibility for managing your database as compared to running one yourself on an EC2 instance. Even if the underlying EBS block storage doesn’t get corrupted, what if the machine running the database crashes? This is where the builtin reliability features of databases come in, such as committing writes durably before acknowledging them and atomic transactions that are never half done even if the hardware fails mid-way through. So, if you’re an early-stage startup and you don’t have any specific requirements, you don’t necessarily need either Multi-AZ nor read replicas. You should think about it rather than automatically assume you need it.
If you do need one, use Multi-AZ if you need resilience to an AZ failure, with automatic failover without data loss 1.
Read replicas are more flexible:
You can have one in the same AZ, different AZ in the same region, or a different region.
Read replicas can be used for read-scaling, if you’re already using the biggest instance type and have to route reads to one or more read replicas. On the other hand, the slave in Multi-AZ isn’t available for you to use as long as the primary is working correctly 2.
You can have multiple read replicas for a given primary, while you can have only one slave in Multi-AZ.
You can have a chain of read replicas — a read replica of a read replica. Again, you can’t do this with Multi-AZ.
You can have a read replica that has a different configuration from the primary, or running a different version of MySQL. For example, you can use this to upgrade MySQL first on the replica, reducing risk — if something goes wrong, and the read replica is corrupted, you can just set up another read replica. You can’t be so casual with the primary.
You can use a read replica to support bulk exports of data, such as a Download All My Data button — let these full table scans, which typically slam the database, slow down the read replica while the export operation is in progress. It will recover once the export ends. You can’t do this on the primary, since it will slow down your service for all users.
Read replicas are flexible in all these ways.
But if you don’t need the flexibility, they’re just more complex: they are treated as a separate instance that you should set up, configure and monitor. One company I work with has to deal with a read replica that’s hours out of sync from the primary, tweaking replication settings to fix it. Unlike Multi-AZ, which is modeled as a single database in RDS, where you don’t worry about the replication. From a DevOps point of view, Multi-AZ is configured and monitored as one database, while read replicas are two, resulting in double the work.
If your primary goes down, Multi-AZ automatically fails over, while read replicas don’t. You have to fail over manually, update the DNS endpoint in your app servers’ config files, push the file to all servers, and restart them for the change to take effect. With Multi AZ, the DNS endpoint remains the same; RDS internally remaps it to the slave, which is now the primary. You don’t even have to wake up if this happens in the middle of the night, and manual processes always have the chance of mistakes made a bad situation worse.
Failure Domains
A single AZ has a certain level of reliability, which is enough for most companies: their own code and setup is less reliable than the AZ.
But if you want a high level of reliability, you should go Multi-AZ. This is a best practice. RDS, in fact, does not even offer an SLA for Single-AZ databases. I’ve attended meetings where the company I’m representing is being evaluated for its DevOps sophistication by another company (either a VC or a company considering them as a vendor), and this question has come up.
But if you want even more reliability, to an extreme, you should use a multi-region setup.
At this point, we should take a step back and understand what AZs and regions are. An AZ is an individual datacenter with its own power, cooling and networking, separated from other AZs by kilometers. That protects one from outages in another caused by power outages, fire, flood, etc. But in extreme conditions, it may not be enough, as when Hurricane Sandy hit NYC and took down multiple supposedly independent datacenters. Leaving aside such natural disasters, if there’s a bug in Amazon’s code, it can take down multiple AZs, since they depend on each other at the software layer. So, considering both natural disasters and software bugs, there’s a tiny but present chance of one AZ failure affecting another.
If you’re worried about that, and need an extreme level of reliability — most companies don’t — then they should have a multi-region architecture. This is possible only with read replicas.
Summary
In summary, not all startups need any kind of replicated database, be it Multi-AZ or read replicas. If you do, and you can fit within Multi-AZ’s limitations, use Multi-AZ, since it’s less hassle.
Multi-AZ uses synchronous replication, so no data is lost when failing over. Read replicas replicate asynchronously, risking data loss in an outage.
Which can be a problem for a cash-strapped startup — you’re paying for it but you can’t use it 99% of the time. But does a cash-strapped startup need any kind of replicated database — Multi-AZ or read replicas — in the first place?