Building Secure & Reliable Systems: A Conversation with the Authors of Google's SRE Book

Sep 7, 2020 3:59:00 PM By Sylvia Fronczak

The panel session “Building Secure & Reliable Systems: A Conversation with the Authors” covers the latest Google SRE book from O’Reilly: Building Secure & Reliable Systems: Best Practices for Designing, Implementing, and Maintaining Systems. This new book follows Google’s two books about SRE: Site Reliability Engineering and The SRE Workbook.

Our panelists include Betsy Beyer, Dave Rensin, Paul Blankinship, and Piotr Lewandowski. Below, we’ve compiled the questions asked during this panel and a summary of each panelist’s response.

Why This Book Matters

What was the need for this latest book?

Betsy Beyer: As a lot of you know, we’ve put out two other SRE-focused books, and we knew there was an appetite for more information on SRE and on how Google implements security. People want to know how security and reliability go together. Based on what we’ve learned, security and reliability go hand in hand to create the building blocks of a reliable system.

Why Security and Reliability Together?

What is so special about security and reliability that made you want to write a book about it? There are other books on security, and there are other books on reliability. What was the impetus to pair them together in this case?

Piotr Lewandowski: I’ve been watching this space for the better part of a decade. Usually, areas that fall between organizational boundaries get overlooked and don't get enough attention. Those weakest spots reveal themselves, which presents opportunities to improve security.

One of the main things the authors want to share is that you need both security and reliability and you can’t have one without the other. Everyone is responsible for both but SREs often see themselves as the last line of defense for both security and reliability. Why is that? They need to keep up to date with infrastructure changes and because of that, components that passed security audits two years ago might not pass one today. SREs are in the best position to identify those gaps.

Why So Many Animals in the Zoo?

There are more than 100 authors in this book. Why so many people? Why so many animals in the zoo?

Paul Blankinship: At Google, we have a lot of good people, and it’s easy to ask people to contribute to these topics. We’re looking at things from a design perspective. So, in order to bring in the best people for each chapter, we brought in the people that we thought could speak the most comprehensively to each topic.

Main Takeaways for Readers

Pick a small number of things you want readers to take away when they read the book.

Betsy Beyer: The basic premise is that we really want people to start shifting their approach to security and reliability to the left. Both need to be considered as early as possible in the life cycle. Because if you tack them on at the end, it won’t be as effective. It will also be more expensive. 

Piotr Lewandowski: In addition to shifting security and reliability left, people need to think about the organization, and they need to think about how to implement reliability and security into the organization. You need to see the organization as a system to understand how to integrate security and reliablity, and make sure your culture reflects that the work is everyone’s responsibility in order to protect your customers and the data they have trusted you with.

Paul Blankinship: It’s sometimes hard for small startups to spend money on security and reliability. In part, this is because you can’t always clearly see the benefits of improved security. You can’t demonstrate what didn’t happen, or wouldn’t have happened without the security you put into place. There are lots of arguments in this book for adding in security early. The book shows that security and reliability are important to the architecture of systems, and not easily added once your system is built out.

Dave Rensin: There’s a point a lot of people misunderstand. There’s a holy trinity that you need to consider: scalability, reliability, and multi-tenancy. 

Are There Things You Wished They’d Included But Didn’t?

What do you wish you would have been able to add to the book?

Betsy Beyer: There were a lot of case studies that we wanted to add, but we couldn’t publish a 1,200-page book. As is the case with the last couple of books, we still have extra content. We’ll find ways to publish that content. It’s not the last you’ve heard from Google on these topics.

Paul Blankinship: We wanted this book to be helpful for all sizes of companies. There were a lot of stories that we could tell that were specific to Google, but that might not be generally helpful. So despite the fact that these stories and anecdotes were interesting, these stories didn’t make it.

Betsy Beyer: We should have added an addendum of interesting but useless anecdotes?

Piotr Lewandowski: The problem was that we wanted to create a book that applied to a broad range of companies—from startups to large companies. So we had to axe a few pieces that didn’t apply to small companies.

“Well, I Heard at Google That…”

Everyone’s mentioned it to one degree or another: Whenever I talk to people about reliability, and particularly about Google SRE, two times out of 10, three times out of 10, someone starts the question with “Well, I heard at Google that…,” and it’s something weird like “I heard you sacrifice small animals for the safety of your data center” or something like that. I’m curious: Are there misconceptions of the way you do security or reliability at scale at Google that you want to address head-on here?

Paul Blankinship: Looking at Google from the outside, it seems like a big monolithic company with lots of very smart people doing very smart things. In reality Google employees are people, just like the employees of other companies. People make mistakes. The information in the book isn’t valuable because we have the brightest people in the world, but because we’ve tried a lot of things. One thing that’s really impressive about Google is the culture of blameless postmortems. It’s a great way to learn and grow from mistakes.

Dave Rensin: We do things wrong to figure out how to do things right.

Piotr Lewandowski: One misconception is that some people may think that we write about blameless postmortems but don’t really follow that rule. But we really do follow it. Sometimes there’s a culture mismatch when people join from different backgrounds. We introduce the concept of blameless postmortems as one of the first things. We want to learn from our mistakes in a constructive and sustainable way and therefore we write a lot of postmortems, even for smaller, internal incidents and not just the large things that hit the press.

Dave Rensin: Our finance team even writes postmortems, looking at what went well and what could be improved. They use the SRE postmortem template and even file bugs!

Betsy Beyer: Google is not a monolith. Not all teams approach reliability in the same way. Tools and methodologies vary between systems and teams. There are a lot of different effective ways to accomplish reliability goals. 

Dave Rensin: Many organizations have separate dev and separate security groups.

Conclusion

A lot of work and decades of experience in security and reliability went into this book. A wealth of knowledge from industry experts can be found in just a single chapter. For recommendations from each panelist on which chapter they suggest you read, watch the panel.

This post was written by Sylvia Fronczak. Sylvia is a software developer that has worked in various industries with various software methodologies. She’s currently focused on design practices that the whole team can own, understand, and evolve over time.

Photo by Rajeshwar Bachu