<img height="1" width="1" style="display:none" src="https://www.facebook.com/tr?id=1919858758278392&amp;ev=PageView&amp;noscript=1">

Managing Risk with Service Level Objectives and Chaos Engineering

Oct 28, 2021 3:06:46 PM By Dan Whiting

Chaos engineering. Kind of sounds like planning to be spontaneous. But, that is the point - you introduce “chaos” (planned failures) into your system. You know when they are happening, monitoring it, and at the ready to fix it (and not doing it at 3 AM because you want to keep your team and coworkers happy). 

Liz Fong-Jones (@lizthegrey) knows a thing or two about chaos engineering. In fact, she wrote a book about it. Liz is a developer advocate, labor and ethics organizer, and Site Reliability Engineer (SRE) with 16+ years of experience. She is an advocate at Honeycomb for the SRE and Observability communities, and previously was an SRE working on products ranging from the Google Cloud Load Balancer to Google Flights.

Liz was a keynote speaker at today’s All Day DevOps conference, speaking on Managing Risk with Service Level Objectives and Chaos Engineering. But it isn’t her first ADDO - check out last year’s talk

Liz works in the field of observability. She notes it is a “rapidly evolving space that has a lot of complexity.” She strives for people to achieve five outcomes, from ensuring their code is written correctly on the first code to being able to run and manage it in production to closing the feedback loop and managing technical debt. Overall, she strives to ensure that people have reliability, product philosophy, and scalability.   

More practically, during her talk, Liz discusses service level objectives (SLOs) and using chaos engineering to both meet SLOs and ensure your systems are resilient. 

SLOs are a common language for developers, managers, product owners, and customers. They define expectations and reflect user value. See Honeycomb’s SLOs: home page loads quick (99.9%); user-run queries are fast (99%); and, customer data gets ingested fast (99.99%). 

Even with 99.99% SLOs, there is room to implement failures for testing and stay within the SLOs. What do they do at Honeycomb:

  • Instrument as we code
  • Functional and visual testing
  • Design for feature flag deployment
  • Automated integration and human review
  • Green button merge
  • Auto-updates, rollbacks, and pins
  • Observe behavior in production

Enter chaos engineering - planning failures to see how the system reacts. Did it work as expected? What went wrong? How can we fix it? There is a lot to this and to plan for. One key area is data persistence and integrity. Liz dives into this aspect specifically and outlines their process with Kafka instances and AWS.

Liz covers three case studies of failure and what they learned. After all, a failure can be a positive if you learn and change for the better. 

One of the case studies is on Kafka. Liz notes that Kafka decouples components, which provides safety. But it also introduces new dependencies and things that can go wrong. Additionally, it has a complexity cost. 

In this case study, they changed multiple variables at once to right size storage and CPU usage: 

  • Move to tiered storage
  • i3en-->c6gn
  • Newest revision of AWS Nitro

Instead they blew out the network connection and the IOPS dimensions and tickled the hypervisor bugs and EBS bugs. While they didn’t miss any visible SLOs, they did violate a hidden SLO - to paraphrase, they can’t burn out staff with excess off-hours emergency calls. For a more extensive look at this case study, read her colleague’s blog post

Liz leaves us with a few takeaways: 

  • Design for reliability through full lifecycle
  • Feature flags can keep us within SLO, most of the time
  • But even when they can’t, find other ways to mitigate risk
  • Discovering and spreading out risk improves customer experiences
  • Black swans happen; SLOs are a guideline, not a rule
  • Acknowledge hidden risks
  • Make experimentation routine
  • We are part of sociotechnical systems. Customers, Engineers, Stakeholders. 
  • Outages and failed experiments are unscheduled learning opportunities
  • Nothing happens without discussions between different people and teams
  • DevOps is just talking to each other! Figuring out how to put customers first.

I just barely touched on all that Liz covered in her hour-long keynote at All Day DevOps, the world's largest DevOps conference. The event is taking place live today, October 28. There is still time to register online to participate in the 24-hour, global event and to view the session on-demand later.

Founded in 2016, the virtual event gathers more than 25,000 DevOps professionals for free, hands-on education. The All-Day DevOps is a global community of more than 75,000 DevOps practitioners and thought leaders offering free learning, peer-to-peer insights, and networking with professionals worldwide. The community hosts an annual conference, live forums, and ongoing educational experiences online. The 2021 event features a lineup of 180+ speakers in six separate tracks, including CI/CD Continuous Everything, Cultural Transformation, DevSecOps, Government, Modern Infrastructure, and Site Reliability Engineering. All sessions will be available on-demand following the conference. Register online to participate in the 24-hour live, global event today and to view the session on-demand later.