What You Need to Know to Succeed as a DevOps Engineer
A lot of DevOps is about knowing specifics: how to work with databases, VMs, block storage and autoscaling, what metrics to monitor, what time period to measure them over, and how to use alarms.
But, a step back from the specifics are the general principles you need to internalise, the philosophy you should have, in order to succeed as a Devops engineer:
Use high-level frameworks if possible, for greater velocity of development and less DevOps overhead: A BaaS like Firebase. If not, a FaaS. If not, a serverless platform like Google App Engine. Raw VMs should be last on your list.
Even things that you think are safe, like upgrading to MySQL 8, can cause an outage. So, spend more time to minimise risk. As an analogy, pilots repeat instructions they received from air traffic control back to ATC to ensure they weren’t misheard. You shouldn’t look at this as wasting time, but as spending time to reduce risk. A software engineer can write whatever code he wants and then see if it works, before deploying to production. If it doesn’t work, there’s no harm. As a DevOps engineer, you can’t do that — you’re working directly in production, so you need to adopt a more conservative approach than software engineers. Measure twice, cut once.
Be calm. If you have a non-technical CEO breathing down your neck, you need to push back: “No, it won’t be done today or tomorrow; it will take a week — and that is if you stop giving me other work.” You need to make space for yourself to do DevOps properly. If you stress yourself, or you let your manager stress you, you’ll make mistakes, and cause an incident.
Good DevOps requires documentation: document various aspects of your architecture on your team’s Notion page. Have an architecture diagram. Document common procedures like upgrading a server. Don’t improvise.
You should allocate some percentage of your time to learn whatever technologies you use. For example, I recently read the 1900-page RDS guide. Make time for it. If you keep saying “not today”, it won’t make time for itself. Today is as good a time as any read it.
Share knowledge in the team, such as by conducting talks.
Track bugs for necessary DevOps improvements in Jira. Good bugs describe before / after, serving as documentation.
Standardise everything, like the versions of various tools used. Don’t use a different version of MySQL or the JVM in a different datacenter or for a different customer. Use the same instance types everywhere. And so on.
Before you make a change in production, discuss it with a colleague, because he may point out a mistake you’re about to make, or you yourself might identify it in the course of the discussion.
Coordinate with team members so that both of you are not making changes in production simultaneously, stepping on each other’s toes.
Make changes in staging before doing so in production.
Canary changes in production. If you have five production databases to upgrade, upgrade the least important one first, such as the one with the least traffic, or the one for the least important customer. Or upgrade a read replica first.
Once you make a change, wait half a week before rolling it out broadly. Don’t make multiple changes at once, like upgrading your app servers and database. It makes it hard to figure out which one caused an incident.
Diagnose before changing. For example, if your server is not able to keep up, don’t blindly upgrade to an instance with more CPU and memory. Identify whether CPU or memory is the bottleneck. If CPU is the bottleneck and you have lots of memory, upgrade to a compute-optimised instance that has more CPU but the same memory. Doctors don’t prescribe medicines before diagnosing the problem, and neither should you.
Make changes gradually. For example, if you’re using a server with 64 GB memory, and you concluded that 16 GB is enough, instead of downgrading to 16 GB at once, downgrade to 32, wait a week, and then downgrade to 16. Another example of a gradual change is if you’re upgrading MySQL, which requires you to upgrade Hibernate, first upgrade Hibernate, roll it out, make sure it works stably for multiple days, and then upgrade MySQL.
When there’s an outage, it’s okay to put in a solution that’s not the right long-term solution. You should first stop the bleeding. But you should track and fix such technical debt.
Emergencies aside, don’t create more technical debt. For example, Spring Boot comes with Hibernate. Say you’re using an older version of Spring, which comes with an older version of Hibernate, but you need to upgrade Hibernate, for some reason. One way is to add your own copy of Hibernate, which Spring will use in preference to its. But having two copies of Hibernate in your binary is a hack. So upgrade Spring instead, even if it takes a week instead of a day.
There are a lot of automated tools nowadays that suggest solutions, like AWS Compute Optimizer, which can tell you that you’ve overprovisioned memory and underprovisioned CPU. Based on this, it suggests which instance type you should use, the difference in cost, and an evaluation of how risky the change would be. Use such automated tools.
Configure maintenance windows at night.
In summary, to succeed as a DevOps engineer, you should understand and internalise this principles.