David has over thirty years of experience in the systems administration/DevOps/SRE field in large multiplatform environments. He is the curator/editor of the O’Reilly Book Seeking SRE: Conversations on Running Production Systems at Scale and author of the O’Reilly Otter Book (Automating Systems Administration with Perl). He is a co-founder of the wildly popular SREcon conferences hosted globally by USENIX. David currently works for Microsoft as a senior cloud advocate focusing on site reliability engineering.
Session: How to Fail with SRE
Lots of people want to talk about how to succeed with Site Reliability Engineering, but in this talk, I’d like to discuss the ways I’ve seen SRE efforts fail.
We will explore three kinds of failures: technical, organizational, and cultural. All of them have successfully undermined perfectly well-meaning efforts to implement SRE in their respective organizations. And all of them were preventable in ways we will discuss.
I can’t promise that a good knowledge of anti-patterns and cautionary tales will ensure your attempts will succeed, but it’s much more likely if you know some of the pitfalls and speed bumps others have encountered. Let’s dive into some of the failure modes for SRE together.