<img height="1" width="1" style="display:none" src="https://alb.reddit.com/snoo.gif?q=CAAHAAABAAoACQAAACi3UkU6AA==&amp;s=hMfJ_f7PVQOiL2csDznj0MZz_-_Sym2oeAYASWsHW4c=">

What Is SRE And Why Every DevOps Professional Should Care - Part 1

Oct 17, 2018 4:25:37 PM By Erik Dietrich computer-1245714_1920

A lot of people have heard the term site reliability engineering (SRE). And they often wonder how this relates to the wider DevOps field. Are SREs and people practicing DevOps at odds? Do we have to choose between DevOps and SRE?

Not at all.

Consider the five principles of DevOps.

  • Reduce organizational silos.
  • Accept failure as normal.
  • Implement gradual changes.
  • Leverage tooling and automation.
  • Measure everything.

Let's now take a look at a couple of core principles of SRE.

1. Reliability is the most important feature

There is no such thing as a high value systems that have no users. As a result, reliability is the most important feature of a system. It has to fulfill its promise to users.

Users measure reliability for themselves.

If your system looks fine to you, that's great. But it will be cold comfort to a user having a problem or experiencing an outage. They will measure that and react negatively if their measure of it is poor.

So the only perspective that truly matters is that of the user.

2. 100% is a bad goal

Philosophically speaking, you can't name any system humans have built that has been 100% reliable over a non-trivial amount of time. It's an impossible goal.

In fact, you can't name a 100% reliable system in nature. Whether it's DNA replication or anything else that drives the natural world, things go wrong.

But it also turns out not to matter. Users don't need 100% reliability, and they never have. There are a lot of things that stand between users and your system, and it’s a certainty that those things are not reliable.

The result? Your users will not notice if you are more reliable than the systems between you and them. This marginal reliability is wasted because users never perceive it.

And, finally, consider a perverse incentive. Because 100% is impossible, if you task a team with delivering it, they will eventually dissemble and fudge the truth. They have to, because their charter is impossible for them to achieve.

You have room to offer very high availability without getting to 100%. 99.999% availability gives you about five minutes per year to play with, and most systems don't need nearly this much.

There's a term for this wiggle room. It's called an error budget. And here's something interesting. You don't want to overspend your error budget and violate an SLA. But, interestingly enough, you don't want to under spend your error budget either, because it means you're over-investing resources with diminishing returns.

Teams that become aware of this can set guard rails against failing to live up to expectations, but they can also set guard rails against being wasteful.

The practices of SRE

Having established the two core principles, we can now understand some important SRE practices with context.

  1. Metrics and monitoring. Are we measuring and alerting on the right things?
  2. Capacity planning. Forecast and measure performance.
  3. Change management. It's important to have a consistent policy for change management so that you can automate it.
  4. Emergency response. SREs carry pagers and go on call not because it's necessary to guard the system and firefight but rather to learn what needs to be automated.
  5. Culture. Specifically, manage "toil" (work you could automate but have chosen not to) and ensure blamelessness.
So why not both? SRE implements DevOps 

With these SRE practices in mind, consider that SRE is an implementation of the DevOps principles. If the DevOps principles were a class, SRE would be an instance of that class.

  • Reduce organizational silos by sharing ownership with error budgets.
  • Accept failure as normal by allocating budgets, implementing blamelessness and postmortems.
  • Gradual change implementation by reducing the cost of failure.
  • Leverage tooling and automation to automate common use cases.
  • Measure everything, including the amount of toil you're incurring and reliability.

Think of SRE and DevOps as "best friends."

So the question then becomes, "how do I adopt these within my organization?" Well, let's start with the hardest thing first. If you can do this within the enterprise, in a conservative context, then you can do it anywhere.

See part 2 for how we do this in the enterprise...

***

Missed Dave Rensin’s session, or want to see some other great presentations from October 17? Head over to https://www.alldaydevops.com/live and make sure you’re registered. Then, catch up on what you missed (or re-watch your favorites)!

About the author, Erik Dietrich

Erik Dietrich is a veteran of the software world and has occupied just about every position in it: developer, architect, manager, CIO, and, eventually, independent management and strategy consultant. He’s the cofounder of Hit Subscribe and writes at daedtech.com. Connect with him @daedtech.