Engineered chaos - is that an oxymoron? Not really. By creating chaos in your software development environments you help build more stable and secure systems. Why is this valuable and how can you do it?
Aaron Rinehart (@aaronrinehart) dove into what chaos engineering is, why you need it, and how you can implement it in your organization.
As co-founder of the Chaos Engineering Meetup in Washington, D.C. and Chief Security Architect for UnitedHealth Group, one of the largest companies in the U.S., he spoke about chaos engineering at last year's All Day DevOps conference.
Aaron doesn’t work for some crazy little startup who can afford to experiment in something called “chaos.” UnitedHealth Group has 28,000 developers, 8,000 applications, and, being a health insurance company, is highly regulated. They use DevOps, waterfall, Agile, and other methodologies.
First, the why. According to a recent study, 48% of security breaches are due to a malicious or criminal attack. The other 52% are human error or system glitch. We can kind-of control that 52%. Aaron suggests that if we focus our security efforts on the 52%, the 48% won’t be possible.
For security breaches, we don’t know very much about what is going to happen. Where? Why? Who? How? What? We generally find out after a security incident happens. Too little, too late. Part of our problem, Aaron contends, is that we spend too much time reacting to the outages instead of building more resilient systems.
Aarons suggests, “We can use chaos engineering to drive objectivity in a subjective world.”
Write that down. What does it mean? It means so much of security planning is subjective because you don’t know the when, where, how, etc. You are guessing where the unknown vulnerabilities are in your live system, or how the system will react if one container goes down, if you have a denial of service attack, or if a setting on a server changes, to name a few of nearly endless scenarios.
Digging in more, Aaron defines chaos engineering as, “The discipline of experimenting on a distributed system in order to build confidence in the system’s ability to withstand turbulent conditions.” Notice the word “experimenting.” This differs from testing in that testing is verification of something you already know; experimentation is finding something you didn’t know. It lets you setup a predicted failure, see how the system responds, and, then, if necessary, fix the flaw. And try again, with a different failure point or engineered chaos.
The bottom line is that we need to build confidence in what actually works, and humans and systems need failure to grow. Chaos engineering provides an objective way to understand your systems and find faults before they are problems. Aaron walked through a case study at Netflix, how to adopt a learning culture for chaos engineering, and UnitedTechnologies tool, Chaos Slinger, which is available on GitHub. You can watch all of Aaron’s talk here.