We pick up the second part of David Rensin's 2018 All Day Devop's keynote by diving into how DevOps and SRE fit into the enterprise. If you missed Part One, you can find it here.
Enterprises understand concepts like total cost of ownership (TCO) and return on investment (ROI). So anything that lets the software operate more efficiently, reducing TCO and realizing ROI, will make the enterprise happy.
When there is a reliability problem in the enterprise, there's a fundamental source of friction that we can call "intuition fatigue." How much risk do you accept to fix something quickly?
If you're talking in terms of error budget, however, these decisions become less intuitive. If you're within the error budget to deploy a fix, go ahead. If you've spent your error budget, then you have a freeze and you have to wait. Having a system for decision-making like this reduces conflict and lets executives get out of corners they might otherwise be painted into.
SRE creates decision-making frameworks that span organizations and make things clear. The closer to zero you drive the blast radius of a mistake, the closer to infinity you can make your tolerance of risk.
Enterprises appreciate cost-savings, of course, in addition to help with the decision-making process. And this cost savings isn't just dollars—it's about agility and opportunity cost.
SRE principles strongly encourage you to reduce complexity, make systems easier to reason about, and allow for more innovation. And with simplicity and more room for innovation, you can take advantage of more market opportunities and do more with fewer human beings needing to oversee your operation.
SRE also helps to quantify and mitigate risk.
A lot of industries have regulatory concerns. This includes financial services, healthcare, and more. Quantifying reliability and risk is hugely important here.
For instance, you can look at common problems and reason about the amount of error budget that they'll consume. If you know that, and how likely things are to occur, you can start to create a prioritized list of things to address, like this:
You can use this tool to generate your own list.
SRE can be an easier life than DevOps
One of the great things about DevOps is how flexible and broad it is. But one of the challenges that presents is that there's a lot of variance in how different companies practice DevOps.
So, because SRE is a concrete, opinionated set of practices, it can actually be easier to adopt. You have fewer decisions to make and a clear path. You can thus go faster and reduce the complexity of deciding what to do next.
How to Start with SRE
With all that in mind, let's take a look at how you can get started with SRE in your organization.
Thing 0: Willingness is the thing
Before you go any further, you need to understand that willingness to adopt SRE is an absolutely essential prerequisite. Everyone can do it, but you have to be willing to do it.
Anything less than full willingness will torpedo the effort. But if you’re willing to do it, you can realize success. And you don't need to look anything like Facebook, Google, or LinkedIn to do it.
And that willingness has to happen from the ground up. Top-down executive sponsorship will not create success. The people responsible and upon whom this depends must buy in.
1. Do one application first
Don't boil the ocean, and don't try to implement all the things and all of the changes at once. Start small and start discrete.
Don't get overly hung up on the specific definition of "application," per se. Pick something discrete, with a discrete failure domain. Start with that and build on success.
2. Start with the error budget
Once you've got your buy-in and have decided on an application, create an error budget and then stick to it. Lay out things that your customer truly cares about, and then set up your first set of objectives—your guard rails. With that in place, set your error budget policies, which define what you do when you're outside of your budget.
Pick a policy and stick to it.
3. Alerting, monitoring & ops load
Before you start to make changes, audit your monitoring and alerting. More logging and measurement is probably better, but more alerting is probably not. You want a good signal to noise ratio.
If you create more efficiency in your current process, you have more room for automation and project implementation. You can do this by focusing on your error budget and creating alerts only when you're burning it.
4. Blameless culture
The human is never at fault. It's the systems' fault for allowing the human to make a mistake.
This is the philosophy and heuristic that you should use to have a successful SRE culture. You don't blame the people — always blame the system. It will lead to investing in discovering what happened rather than figuring out who to blame. This turns each error into an opportunity to improve your system.
To have a successful journey with SRE, the first thing to bear in mind is that you don't want to try to do everything at once. Start small and build, following the steps in sequence toward incremental progress.
The next thing to understand is that you can do this! If you do it incrementally, celebrate progress, leave yourself on- and off-ramps, and realize that it's okay to stop and come back when you're ready, you will have success.
Missed Dave Rensin’s session, or want to see some other great presentations from October 17? Head over to https://www.alldaydevops.com/live and make sure you’re registered. Then, catch up on what you missed (or re-watch your favorites)!
About the author, Erik Dietrich
Erik Dietrich is a veteran of the software world and has occupied just about every position in it: developer, architect, manager, CIO, and, eventually, independent management and strategy consultant. He’s the cofounder of Hit Subscribe and writes at daedtech.com. Connect with him @daedtech.