<img height="1" width="1" style="display:none" src="https://www.facebook.com/tr?id=1919858758278392&amp;ev=PageView&amp;noscript=1">

Another Side of DevOps: Site Reliability Engineering

Oct 4, 2021 10:14:58 AM By All Day DevOps

All Day DevOps (ADDO), the world's largest DevOps conference, will be streaming live for 24 hours starting at 3 a.m. ET on October 28, 2021. The virtual event gathers more than 25,000 DevOps professionals for free, hands-on education from 180+ speakers, along with peer-to-peer insights and networking with professionals worldwide. 

Salim Virji is a Site Reliability Engineer at Google, where he develops reliable engineering practices and processes for Google’s SRE organization, and previously developed distributed consensus and storage systems. His interests include distributed systems and machine learning. Virji received an AB in Classics from the University of Chicago and is a New York City Master Composter. He's also contributed to the SRE Book, SRE Workbook, Implementing SLOs, and 97 Things Every SRE Should Know, all published by O'Reilly Media. It's probably fair to say that he knows his stuff. 

Virji is the organizer of the Site Reliability Engineering track at this year's ADDO, which will focus on the fundamental practices of Site Reliability Engineering. The talks will also tackle some more advanced and philosophical topics within the discipline. These talks should provide everyone with some useful takeaways - from those just getting started with SRE to the experts in the field.

What is Site Reliability Engineering (SRE)?

Closely related to DevOps, SRE is the practice of using software engineering to automate IT operations tasks, such as production systems management, incident response, and even emergency response, to name a few. Outside of a Site Reliability Engineering framework, these tasks would be performed manually by systems administrators.

The principles that constitute DevOps aim to reduce organizational silos while promoting the use of tooling and automation. SRE is entirely in line with that insofar as it uses the same tooling to automate and streamline operations as developers use to develop and improve software.

SRE's mantra is that using software code to automate the management and oversight of large software systems scales better and hence is a more sustainable approach than manual intervention.

Ben Treynor Sloss, who established the SRE practice at Google, says that, "Site Reliability Engineering is what you get when you treat operations as if it's a software problem." SRE combines principles of systems and software engineering to develop production systems that provide measurable reliability.

What to expect

This track will include multiple speakers, and Virji believes they all provide value. Hence, he recommends them all - and he may have a point. Looking through the various speakers included in this track, we find that all of the fundamental concepts related to SRE will be discussed - and more. 

This includes things like SRE’s blameless post-mortem culture. End-to-end reliability, which aims to apply SRE principles to the entire pipeline, will also be discussed. IT asset tracking, the continuous identification and use of IT resources, and 'any change' management, SRE's focus on all the factors that can disrupt the IT environment, also make the cut. And the latter is just a teaser. This year's SRE track covers the basic concepts, more advanced topics, and beyond because Site Reliability Engineering is in constant expansion.

As Virji states, "I've been part of the SRE organization at Google since its early days, and the practice of reliable engineering continues to evolve, so there's plenty to learn. New ideas appear, and, in the spirit of science, we find evidence to support them or invalidate them. Even as SRE explores topics such as capacity planning and service delivery, we continue to deepen our understanding of these areas and how to bring measurable reliability to our products."

Clearly, there won't be a shortage of valuable insights in the Site Reliability Engineering track. You can have a look at the different speakers, and you're sure to find something to pique your interest - if not many.

Register for All Day DevOps (ADDO)

All Day DevOps is a global community of over 75,000 DevOps practitioners and thought leaders offering free learning and information exchanges. Founded in 2016, the community hosts an annual conference, live forums, and ongoing educational experiences online. The 2021 event will feature a great lineup with many more engaging speakers! Register online to participate in the 24-hour live, global event on October 28 featuring six tracks, including CI/CD Continuous Everything, Cultural Transformation, DevSecOps, Government, Modern Infrastructure, and Site Reliability Engineering.