Over the last three years, Google has ramped up a global training program for SREs and done a retrospective on the subject. Here's what they've learned.
History of SRE Education at Google
Google's SRE practice has actually been around since 2003. The founder of SRE at Google, Ben Treynor Sloss, described SRE originally as, "fundamentally, it's what happens when you ask a software engineer to design an operations function."
For the next decade, Google's SRE training program could be described as "grokking SRE the hard way." There was no consistency in this training, and it was a grassroots, fragmented effort.
But in 2014, Google formed the awesomely titled, "secret alliance" to systematically improve SRE at Google. Eventually, the formed a small, dedicated SRE EDU team in 2015. This team's mission is "consistent and reliable education for all Google SREs."
This newly minted team then launched a production orientation, and began to build experience and talk about scaling SRE education programs. As they matured, they developed an SRE on call program and then later launched a book about training SREs. More recently, they've started to offer a series of courses called SRE EDU week, which they fed back into a more refind version of their SRE EDU orientation program.
And this is all fodder and fuel for the current talk and knowledge titled "Strapping Jetpacks to Unicorns" (and, of course, a lot of built in knowledge and learning)
Why Learning Matters
So why do all of this? Why does SRE training and education matter in the first place?
Building trust is important, and that's no small feat for people learning to be SREs. This is not a skill set that you can learn in school anywhere, so this training serves a critically underserved area of engineering.Developing expert level knowledge here is both critical and challenging.
Setting SREs Up for Success
Setting SREs up for success is thus not an easy task. Here are some tips learned from hard-won experience with SRE EDU.
- It's important to have concrete, sequential learning experiences so that students understand what they should be able to do and what learning prerequisites there are. At scale, it's a challenge to keep this material and these dependencies up to date.
- Creating excellent reverse engineers is also critically important. This demands mentors, a starter project, and a queue of interesting bugs for them to learn on. As your organization and practice grow, you'll need to pay attention to training and controlled scenarios.
- To encourage newbies, review and celebrate postmortems with them. It's important to instill a culture of blamelessness and do regular review in production meetings. For scaling this up, encourage SREs to attend ops reviews and to have "show and tells" in classes.
- Group role playing of disaster scenarios is enormously helpful. Tee up guided scenarios and have SRE trainees review what to do with a mentor. It's important that the trainee be universally viewed as a volunteer here, and not a victim.
- Beyond role playing, you also need to break and fix real things. The best path to doing this is to shadow people that are on call, but to scale it significantly, it's important to have sandbox scenarios where breaking and fixing is safe.
- Ride shotgun as much as possible when it comes to trainees. Mentoring and shadowing is the essence of this practice. To scale this up, Google encourages mutual team visitations and rotation programs.
- The most philosophical and important tip is that the best education is continuous education. Learning does not stop after a few weeks on the job. So make sure to give talks, invest in documentation, and consider learning approaches at scale like a drip email campaign or cross-functional "learning weeks," to break down organizational silos.
"SRE" Your SRE Training
Interestingly enough, you can apply a lot of the principles of site reliability engineering to the training of the same. To help you unpack that mentally, let's take a look at some examples of how this works.
First, recognize that knowledge about distributed systems is, not surprisingly, distributed. Making the SRE training an isolated concern will thus not work, and it's important to have a small, core training team, and to scale that team's impact using volunteers throughout the organization.
You also have to define service level objective (SLOs). What's important to your team? The SRE trainers focus on coverage response time to students and to volunteers.
The more you scale, the more important operational concerns become. So, focus on smooth operations. Establish training cadence, have canonical materials, increase your "bus-factor" for critical tasks and automate toil where possible.
Remember that "it's about the culture and not the tools." Imprint culture early, and have volunteers model the behavior that you want the SREs to exhibit.
Heroism tends to mask problems in the short term and create bigger ones in the long term. So don't be a hero. Take the emotion out of saying no to things that create toil, and push back.
You also need to ensure observa-bility and to implement monitoring. You can't improve what you can't measure, so launch and iterate based on observations, and practice on-call responsibility to stay sharp.
A Case Study About Making Things More Hands On
By way of a case study, consider an orientation for SREs for their nuclear production. The team was happy with their initial scores in the response surveys, but they also thought that they could do better. So they took a deep dive into comments.
The most common request/feedback? Make it more hands on.
The loud and clear message was that they didn't want a lecture series, but rather learning-by-doing. So the SRE EDU team completely overhauled their materials, adding just enough lecture material to set hands-on training up for success.
As a result, the next generation of SREs going through the training saw significant increases in self-reported confidence and likelihood of recommendation of the program. People were much happier following hands-on training.
The Key Takeaways from Strapping Jetpacks to Unicorns
Here are a few key takeways from the SRE EDU experience:
- Training is important, so don't leave it to chance.
- Apply SRE principles to your SRE training to make it more globally reliable and consistent.
- Monitor, observe, and improve continuously.
In the end, it may seem counter-intuitive that you could apply the principles required for Google to operate at scale to a training program, but it really makes sense in a very elegant way. Just as proactively monitoring complex, interconnected systems is a challenge that requires constant learning, growth, and incorporation of feedback, so does educating and training complex organizations of humans.
Missed Jennifer Petoff’s session, or want to see some other great presentations from October 17? Head over to https://www.alldaydevops.com/live and make sure you’re registered. Then, catch up on what you missed (or re-watch your favorites)!
About the author, Erik Dietrich
Erik Dietrich is a veteran of the software world and has occupied just about every position in it: developer, architect, manager, CIO, and, eventually, independent management and strategy consultant. He’s the cofounder of Hit Subscribe and writes at daedtech.com. Connect with him @daedtech.