<img height="1" width="1" style="display:none" src="https://www.facebook.com/tr?id=1919858758278392&amp;ev=PageView&amp;noscript=1">

Session Name: Modeling Alert Quality

The only thing worse than getting a page at 3am in the morning on Sunday, waking up, and realizing it was a false alarm is not getting a page at 3am in the morning on a Sunday because the alert did not fire, and getting woken up by an irate manager at 4am. How do you decide what to alert on? How do you balance not waking up people with reducing downtime? How do you even measure downtime? You cannot improve what you cannot measure, but just because you can measure something does not mean you can improve it. If you made a change to your alerting configuration, would you be able to tell, even retroactively, if it was a good idea? The first step in improving the alert quality is deciding what metrics to focus on, and how to balance them. Time to recovery, frequency of incidents, false alarms, missing alarms, and misrouted alarms are all things to be taken into consideration. How do you make sure your alerting and paging OKRs are aligned with your team's values?

Speaker Bio:

Moshe has been using open source software since 1995, and has been using Python as his main development language since 1998. He has been a contributor to CPython, is a founding member of the Python Software Foundation, and a founding member of the Twisted project.