*Note: The CFP for 2019 All Day DevOps is open. Submit a talk to speak here.
First, what’s the definition of chaos engineering? Well, it’s like a vaccine, where we inject something harmful to make it better.
Why do chaos engineering?
Because our systems and infrastructures are growing in scale at a fast pace, our systems become more vulnerable.
With chaos engineering we can:
- Do outage reproduction
- Do on-call training
- Strengthen new products
- Battle-test new infrastructure
By reproducing an outage, you can analyze how your organization reacts to it, and you can training your organization to react better.
On-call training means having people triage a reproduced incident without having to be afraid of breaking real production systems.
At Gremlin, Chaos Engineer, Ana Medina strengthens new products by getting together with her team and running all sort of experiments with products. She told a story about how they did chaos engineering with two competing container technologies by running experiments on both. This way, they could find out which one was the most resilient—and therefore which one they would be using.
Monitoring and observability is important. If you have no view into what your systems are doing, you don’t have chaos engineering. You only have chaos.
Next, you’ll need on-call and incident management, so you can handle alerts in case certain experiments go wrong. You want to be able to roll back experiments.
Finally, you want to know the cost of downtime per hour. This allows you to talk about the reasons why chaos engineering is needed. If it costs millions of dollars per hour for a system to be down, you want to know if it’s resilient.
What chaos engineering is not
Chaos is not about creating outages or breaking production. It’s not about unexpected or unmonitored experiments. All stakeholders—for instance, teams depending on your application—should be in the know that there will be an experiment and what that experiment is.
Where chaos engineering belongs
Chaos engineering is possible at every layer in your stacks: application, API, caching, database, hardware, and infrastructure, just to name a few.
At Gremlin, Medina found some great places to inject chaos: Cassandra, Kubernetes, Kafka, and Elasticsearch. These are all components that businesses depend on for everyday operation.
Contrary to common approaches in other domains, you don’t want to start with low-hanging fruit. Start with the top five critical systems. Choose one and whiteboard it. This allows you to see what dependencies it has and where you can inject chaos. And that’s even if your critical system is an external vendors. Medina rightly says that customers don’t care if you’re down because an external vendor was down.
Once you’ve chosen your system, determine what experiment you want to run. Do you want to target the network? Or the memory? Think about big failures that have happened in our industry and replicate that as an experiment.
For example, there was a large DNS outage a few years ago. Companies with only one DNS provider were down, while those with multiple providers weren’t. Or you can run an experiment where Amazon S3 goes down. A final example was that of companies that target developing countries. In such a case, network packet loss is a very valuable experiment.
As a last step, you should determine your blast radius. For example, when you’re experimenting with network packet loss, don’t drop all your packets. Start small, because you don’t want to cause a complete outage of your systems and applications.
Chaos engineering is actually quite scientific:
You start with a hypothesis. Then you run an experiment. But be sure to define some abort conditions. You don’t want to actually break everything! When you’re system fails, fix the issues. When your system stays up, scale up (increase the blast radius) and repeat.
In time your system will become more and more resilient. And other companies are taking note of the benefits of this practice. Netflix, Twilio, Amazon, Expedia, and many other companies are all doing chaos engineering.
“Chaos days” are a great way to get started. Instead of focusing on new features, teams focus on what could go wrong in their systems.
On chaos days, you can run several experiments:
- Reproduce outage conditions (check your old tickets or post-mortems)
- Unpredictable circumstances (what happens when your data center fails?)
- Large traffic spikes
- Race conditions
- Datacenter failure
There are open-source tools that can help you out. The most well-known one is Chaos Monkey from Netflix. You can use this on AWS to shut down your instances. After Chaos Monkey, Netflix scaled it up and released Simian Army.
- Kube Monkey (Chaos Monkey for Kubernetes)
- Gremlin (offers 11 different attacks on your infrastructure)
Don’t be scared to get started - the chaos engineering community is quite large. There’s a Slack channel (bit.ly/choas-eng-slack) with many chaos engineers from around the world that you can get involved with if you’d like to learn more.
Missed Ana Medina’s session, or want to see some other great presentations from October 17? Head over to https://www.alldaydevops.com/live and make sure you’re registered. Then, catch up on what you missed (or re-watch your favorites)!
About the author, Peter Morlion
Peter is an experienced software developer, across a range of different languages, specializing in getting legacy code back up to modern standards. Based in Belgium, he’s fluent with TDD, CQRS and other modern software development standards. Connect with Peter at @petermorlion.