Session Name: Rethinking Reliability: What We Can (And Can't) Learn From Incident Metrics
This talk presents research collected from the VOID—an open database of public incident reports. Containing nearly 10,000 reports for over 600 organizations, the database allows for more structured review and research about software-related incident reporting. Key results from our research challenge standard industry practices for incident response and analysis, like tracking Mean Time To Resolve (MMTR) and using Root Cause Analysis (RCA) methodology. In particular, we demonstrate how unreliable MTTR can be, and how RCA can lead to environments where people are less likely to admit mistakes and speak up about things that could lead to future incidents. We propose alternate metrics (SLOs and cost of coordination data), practices (Near Miss analysis), and mindsets (humans are the solution, not the problem) to help organizations better learn from their incidents, and make their systems safer and more resilient.
Courtney Nash is a researcher focused on system safety and failures in complex sociotechnical systems. An erstwhile cognitive neuroscientist, she has always been fascinated by how people learn, and the ways memory influences how they solve problems. Over the past two decades, she’s held a variety of editorial, program management, research, and management roles at Holloway, Fastly, O’Reilly Media, Microsoft, and Amazon. She lives in the mountains where she skis, rides bikes, and herds dogs and kids.