Whether and How To Load Test
Why to load test
When you load test at a higher traffic level than you have, you get visibility into what’s coming down the road for your company. You can identify whether your system scales to twice its current traffic, or 10x.
Load testing gives you confidence to say yes to a big prospect who wants to send more traffic than you’ve handled, because you’re one step ahead.
Load testing helps you identify specific issues to work on, before they affect users. For example, if you do a load test and have a high error rate, you can debug it to find out why. Say it’s because of a primary key collision caused by using the timestamp as the primary key. You can fix that by increasing the precision of the timestamp to nanoseconds. You can identify and fix such issues before they become a crisis, add stress to your team and upset users.
You can take load testing to the extreme to find out where the limits of scalability are. Say you’re using MySQL with RDS. MySQL doesn’t scale horizontally 1. If your company scales, you may have to re-architect your system. To determine when, you could load test with the biggest RDS instance, which has a terabyte of memory. Oversize your app servers and other layers of the stack to ensure they’re not the bottleneck. Increase traffic till the latency or error rate spikes. This will tell you how much your system can scale. Since rearchitecting takes time, it has to be planned in advance. It’s not a knob you can turn when you need to. And load testing gives you that visibility ahead of time.
Load tests are sometimes used in acquisitions or other high-stakes conversations where a company wants to evaluate a startup’s technology.
These are all the benefits of load testing.
Why not to load test
First, load testing takes time to do, time that could be spent elsewhere.
Second, a load test never gives you the confidence that handling the same amount of real production traffic does. Load tests don’t exercise all features or APIs. Or even every situation supported by the APIs you’re testing. For example, if you’re load-testing a messaging app, you might load test 1:1 messages. But what about group messages? Group messages have much more overhead: when you send a message to a group with 500 members, it has to be sent to all 500 people. This is 500x the load on the system 2 as a 1:1 message. You could extend the load test to group messages, but that only brings up more questions: what about sending messages with attachments? The size in bytes is orders of magnitude higher. How does that impact the system? You have to stop somewhere — you can’t load test every possible scenario, because then the load test becomes a major project in itself. But this, in turn, means that running a load test with a given level of traffic is always less reliable than actually handling that level of real production traffic.
Third, load testing can incur significant cost, like tens of thousands of dollars if you’re testing with huge instances. This may not be affordable in a startup.
Before you load test
Before you load test, consider this alternative:
Is your latency less than 100ms in peak hours 3? Unless you’re doing something sophisticated like deep learning or transcoding videos, if it’s just a CRUD app, it should be. If it isn’t, optimise it.
Ensuring low latency even in peak hours ensures that if traffic increases, it will be a while before it increases to the point where users object. This gives you breathing room to scale your system in an orderly way. As opposed to emergency fixes, which can cause double work.
You don’t need to load test and try to predict the future if you give yourself a latency buffer to handle it when it happens.
You can also examine limits. Many apps have artificial limits. An incident management service might limit the number of incidents that can be created in a minute. A messaging app might limit the number of users in a group. These limits are sometimes justified as ensuring that the system works properly, or that some users don’t use too much resources, but that’s a negative way of looking at it. They’re really a cover up for limited scalability in the system.
The right mental model is that users should be able to use whatever they want, and you should charge them for it. For example, Google Drive offers a 30 terabyte plan! If you order food from a restaurant every day, the restaurant doesn’t blame you for eating too much of their food. They just charge you for it. Similarly, a messaging app might charge for each message received. Then, a group with more members will receive more messages and pay more to offset the increased use of system resources.
You should remove artificial limits like these 4. When you do, you end up increasing the capacity of the system, in this example, by increasing the number of messages that can be delivered per second. Scalability applies not just to users, but also usage. When you increase the capacity of the system this way, it helps existing users, and prospects who ask for more members in a group.
These are higher ROI activities than load testing, and I recommend them in preference to load testing, unless you have a specific need that can be handled only by load testing.
How to load test effectively
If, after considering all the above aspects, you concluded that you should load test, how should you go about it?
Set up a separate cluster. Don’t pollute your production DB with test data, because such data is hard to remove later. It affects multiple things, like your analytics. In the example of the messaging app, your statistics for how many messages sent this month will be screwed up. Not to mention that running a load test on production can affect real users. And real traffic will pollute the results of the load test.
Make sure no other team member is sending traffic at the same time, since it will affect the load test results.
Keep your load testing cluster as similar to production as possible. For example, use the same versions of all software, configured the same way.
If you’re testing multiple APIs, ensure that the ratio of API calls is the same as you see in production. For example, if a messaging app has API calls to send and receive messages, and you find 2x as many send calls as receive in production, maintain that ratio in your load test, to keep it similar to production.
Test your system end to end by sending traffic to your load balancer. Don’t bypass it and send traffic directly to your app servers. You want to test your entire system, not part of it.
The only thing that need not be similar to production is resources, like instance sizes. It’s okay to test with higher resources to see what load they can support.
Overprovision the systems on which the load test tool itself is running (unless you use a serverless platform). You don’t want that to be the bottleneck.
Ramp up traffic slowly over time, like increasing by 1% every minute, till the latency or error rate becomes too high. If you don’t know what too high is, use 3s latency and 0.1% error rate as the limits after you stop your test.
After the test is over:
Validate the error rate by checking the number of rows created in the database to catch bugs where the backend returns 200 despite a failure. If you sent 100 messages and got a 1% error rate, but only 97 rows were inserted into your database, then your error rate is actually 3%.
Write a report showing what the capacity of the system is, such as “We can support 1000 requests/second with 0.5% error rate and 3s latency”. Also give a graph showing how varying traffic levels impact error rate and latency. Measure latency at different percentiles: average or median, 90th, 99th, 99.9th and 100th. While we generally don’t pay attention to the 100th percentile in DevOps, it’s good information to know in the context of a load test. If the value is 10 seconds, you know that not even a single request timed out, and that your system is performing excellently.
Document your assumptions and methodology in your report. There isn’t one right way to load test, so you can make reasonable assumptions, but you should document them so that readers know what’s being measured and how.
Leaving aside slaves and read replicas.
As a first approximation.
Calculate the average latency over one hour, and draw a graph with 24 data points over the day. All 24 points should be less than 100ms.
You can also check error rate, but when traffic increases, latency spikes before error rate, so latency is the leading indicator of a scalability bottleneck.
Or implement twice the limit that users ask for.