Session Name: Generic Mitigations: A philosophy of duct-tape outage resolution

While a mitigation is any action you might take to reduce the impact of a breakage—such as SSHing into an instance and clearing the cache or switching off machines to close down a vulnerability—generic mitigations are actions that first responders can take even before the root cause is fully understood. As such, they’re useful for addressing a wide variety of outages. Every service should employ at least one or two generic mitigations to minimize outage impacts. In this presentation, Jennifer Mace, site reliability engineer at Google, shows you how to distinguish between specific and generic mitigations and how to identify what generic mitigations your service might need.



Jennifer Mace (Macey) is a Senior Site Reliability Engineer at Google Seattle, where she works to make Google's cloud a more hospitable place for the company's corporate infrastructure. Previously the tech lead of Google Kubernetes Engine SRE, she contributed to The Site Reliability Workbook on topics from incident management to the interplay between load balancing and autoscaling systems. Ask her about multi-single tenancy, and why that phrase should give you nightmares.