Over the last three years, Google has ramped up a global training program for SREs and done a retrospective on the subject. Here's what they've learned.
History of SRE Education at Google
Google's SRE practice has actually been around since 2003. The founder of SRE at Google, Ben Treynor Sloss, described SRE originally as, "fundamentally, it's what happens when you ask a software engineer to design an operations function."
For the next decade, Google's SRE training program could be described as "grokking SRE the hard way." There was no consistency in this training, and it was a grassroots, fragmented effort.
But in 2014, Google formed the awesomely titled, "secret alliance" to systematically improve SRE at Google. Eventually, the formed a small, dedicated SRE EDU team in 2015. This team's mission is "consistent and reliable education for all Google SREs."
This newly minted team then launched a production orientation, and began to build experience and talk about scaling SRE education programs. As they matured, they developed an SRE on call program and then later launched a book about training SREs. More recently, they've started to offer a series of courses called SRE EDU week, which they fed back into a more refind version of their SRE EDU orientation program.
And this is all fodder and fuel for the current talk and knowledge titled "Strapping Jetpacks to Unicorns" (and, of course, a lot of built in knowledge and learning)
Why Learning Matters
So why do all of this? Why does SRE training and education matter in the first place?
Building trust is important, and that's no small feat for people learning to be SREs. This is not a skill set that you can learn in school anywhere, so this training serves a critically underserved area of engineering.Developing expert level knowledge here is both critical and challenging.
Setting SREs Up for Success
Setting SREs up for success is thus not an easy task. Here are some tips learned from hard-won experience with SRE EDU.
"SRE" Your SRE Training
Interestingly enough, you can apply a lot of the principles of site reliability engineering to the training of the same. To help you unpack that mentally, let's take a look at some examples of how this works.
First, recognize that knowledge about distributed systems is, not surprisingly, distributed. Making the SRE training an isolated concern will thus not work, and it's important to have a small, core training team, and to scale that team's impact using volunteers throughout the organization.
You also have to define service level objective (SLOs). What's important to your team? The SRE trainers focus on coverage response time to students and to volunteers.
The more you scale, the more important operational concerns become. So, focus on smooth operations. Establish training cadence, have canonical materials, increase your "bus-factor" for critical tasks and automate toil where possible.
Remember that "it's about the culture and not the tools." Imprint culture early, and have volunteers model the behavior that you want the SREs to exhibit.
Heroism tends to mask problems in the short term and create bigger ones in the long term. So don't be a hero. Take the emotion out of saying no to things that create toil, and push back.
You also need to ensure observa-bility and to implement monitoring. You can't improve what you can't measure, so launch and iterate based on observations, and practice on-call responsibility to stay sharp.
A Case Study About Making Things More Hands On
By way of a case study, consider an orientation for SREs for their nuclear production. The team was happy with their initial scores in the response surveys, but they also thought that they could do better. So they took a deep dive into comments.
The most common request/feedback? Make it more hands on.
The loud and clear message was that they didn't want a lecture series, but rather learning-by-doing. So the SRE EDU team completely overhauled their materials, adding just enough lecture material to set hands-on training up for success.
As a result, the next generation of SREs going through the training saw significant increases in self-reported confidence and likelihood of recommendation of the program. People were much happier following hands-on training.
The Key Takeaways from Strapping Jetpacks to Unicorns
Here are a few key takeways from the SRE EDU experience:
In the end, it may seem counter-intuitive that you could apply the principles required for Google to operate at scale to a training program, but it really makes sense in a very elegant way. Just as proactively monitoring complex, interconnected systems is a challenge that requires constant learning, growth, and incorporation of feedback, so does educating and training complex organizations of humans.
***
Missed Jennifer Petoff’s session, or want to see some other great presentations from October 17? Head over to https://www.alldaydevops.com/live and make sure you’re registered. Then, catch up on what you missed (or re-watch your favorites)!
About the author, Erik Dietrich
Erik Dietrich is a veteran of the software world and has occupied just about every position in it: developer, architect, manager, CIO, and, eventually, independent management and strategy consultant. He’s the cofounder of Hit Subscribe and writes at daedtech.com. Connect with him @daedtech.