Keynote Session: Managing Risk with Service Level Objectives and Chaos Engineering
What do you do when you've had a few too many incidents and blown your error budget? Or had a pile of near-misses that burned the team out even though the user-facing SLO wasn't violated? What if the incident trigger was the infrastructure refactoring meant to improve, not harm, reliability & maintainability?
In this talk, you'll hear about the context for two sets of outages that caused our medium-sized startup to pause and re-evaluate our infrastructure plans. In one outage, we experienced a significant event before an immovable external deadline, and found a creative way to push the launch related risk to a separate shard of our infrastructure and de-risk the rest of the SLOs. In the other outage, we scaled back the ambitions of a refactor of our Kafka cluster in order to give the team a break from incident fatigue despite the fact that our SLOs had only partially burned.
And we'll talk about our longer-running chaos engineering and continuous verification practices that we use when we have plenty of error budget and technical risk appetite available.
Liz is a developer advocate, labor and ethics organizer, and Site Reliability Engineer (SRE) with 16+ years of experience. She is an advocate at Honeycomb for the SRE and Observability communities, and previously was an SRE working on products ranging from the Google Cloud Load Balancer to Google Flights. She lives in Vancouver, BC with her wife Elly and a Samoyed/Golden Retriever mix, and in San Francisco and Seattle with her other partners. She plays classical piano, leads an EVE Online alliance, and advocates for transgender rights.