<img height="1" width="1" style="display:none" src="https://www.facebook.com/tr?id=1919858758278392&amp;ev=PageView&amp;noscript=1">

The Unmonitored Failure Domain: Mental Health

Nov 12, 2020 12:03:22 PM By Sylvia Fronczak

Jaime Woo has a passion for communications and dumplings. Even though the dumplings aren’t related to the talk, you can’t mess with someone’s passion. But going back to the topic, mental health is important. And it applies to site reliability engineering (SRE).

As with SRE, we cannot guarantee 100% reliability. What amount of reliability do we actually need, and how expensive is it? Likewise, we cannot guarantee 100% reliability from our people. That’s unrealistic. But we can support them.

As you may have noticed, we’re living in stressful times. Casey Newton recently wrote that “We are not working from home. We are all living at work.” That’s how our context has changed.

All of a sudden, all parts of our lives are crashing together. And while this talk was about people in an office space, even working from home in a distributed manner matters.

Incidents Have Impacts

We tend to forget the people in our sociotechnical systems. We forget that systems don’t work well without the people. And we forget that those people aren’t interchangeable and that changing people on a team can make a big impact. 

Thinking about how your people are doing is just as important as how your application is doing.

Woo shared some statistics on how people say they are affected by incidents.

This all affects how we live our lives. Incidents have an impact on people. It affects their work the next day, and it affects their personal lives. So, we have to start looking at this part of the system.

The Four Ingredients to Stress

Woo has been researching this space for three years. He has a background in biomedical engineering and was a molecular biologist. He’s now an advocate for mindfulness and mental health. But he’s not a doctor. Not even on TV.

So, let’s start by talking about wellness. How do we think about work-related topics like mood, satisfaction, stress, and burnout? But keep in mind that the term “work-related” now has a new definition with our work-from-home life.


The important thing is that work-related incidents aren’t impactful only in the immediate aftermath. In fact, they affect an individual’s performance into the future.

So, here’s something you can use. There’s a recipe for stress. And this completely changed Woo’s understanding of stressful moments.

There are four ingredients to stress, and they aren’t the same for everyone. You’ll always have one.

  • Novelty: something you haven’t experienced before
  • Unpredictability: something that we can’t predict shocks us, as it doesn’t fit our mental model
  • Threat to ego: when your competence as a person is called into question
  • Sense of control:  when external circumstances prevent you from managing everything, pushing you outside of your comfort zone

This is Woo’s first time presenting at All Day DevOps. So, there’s novelty, there may be unpredictability with technical issues, and there’s a potential for a threat to the ego. 

When stress comes up, consider which of these four is affecting you. This can help you deal with it. And think through it and learn to go around it.

Mitigate Risk Factors Rather Than Increase Resilience

How do we normally think about destressing? Woo lists many ideas, like vacation, mindfulness, meditation, and therapy. There’s more, but they all have something in common that we’ll touch on. 

Oh, and there are two additional de-stressers: 

  • Beyonce. Woo, though not a doctor, prescribes 10 to 15 minutes of Beyonce a day. 
  • Woo’s dog Taco. He’s a very cute boy.

So, what do all these have in common? Society sends us the wrong message about how to handle stress: Just be stronger. What if I drink more water, sleep better, just do more? That’s putting everything on you.


Does this resonate? We put it all on the people themselves. How are we going to really help our people?

From burnout research, situational and organization factors play more of a role in the workplace than individual ones, but we focus on the individual and their causality and responsibility. The assumption is that it’s easier and cheaper to change people rather than organizations. 

And that’s not right. In fact, it’s easier to change the organization.

Organizational Factors

So, what organizational factors contribute? Imbalanced job design, occupational uncertainty, and lack of value and respect in the workplace all contribute. 

Burnout symptoms include exhaustion, cynicism, ineffectiveness, and chronic negative responses. Are these things that you can relate to or that you’ve felt or seen?

Looking back at the organizational factors, they contribute to burnout. Work overload occurs when job demands exceed human limits. And lack of control and insufficient awards, stemming from IT being viewed as a cost center, contribute to it as well.

We also have a breakdown of community and unfairness.

According to web developer Denise Yu, “Burnout is, in part, caused by unresolved feedback loops.”

During this time, if you’re just trying to survive and you don’t see things better or changing, then you start to burn out. Things aren’t resolving. The situation around you isn’t changing.


Now let’s apply these ideas to the SRE track. How often is feedback happening? Hopefully not quarterly or monthly. We should be checking in much more often. 

This doesn’t mean we need formal pulse surveys every day. But we need to check in and resolve those unresolved feedback loops.


For those unfamiliar with toil, it’s the repetitive, predictable, constant stream of tasks. Like washing dishes. The more people and the more dishes, the more toil. Sure, you could share plates, use bread bowls, etc. But, again, the effort scales with the number of people involved. 

And cleaning the dishes brings us back to base one. We can then improve this through automation, which in this example works like a dishwasher. 

Applying Stress Ingredients to SRE

If we apply the original stress ingredients to this, we can see how some of the SRE work stresses us out. We can look at incidents from that viewpoint.

For an incident, first consider if it was novel. How can we make that better? Can we have practice game days where we learn how to prepare for incidents? By doing so, we can reduce the stress that comes from the novel part.

Next, was the incident unpredictable? If so, how do we make it more predictable? Perhaps we can add chaos engineering. 

And then the control. Did you feel like you didn’t have control? What can give you a better sense of control?

Lastly, the ego. Let’s make it safe to know that we can ask questions and call others at 3 a.m. to pull them in if you need them.

The Takeaway

This desire to improve mental health goes both ways.

Your wellbeing and mental health matter. We can figure out how to intentionally create positive effects. So, let’s build systems that include humans.

This session was summarized by Sylvia Fronczak. Sylvia is a software developer that has worked in various industries with various software methodologies. She’s currently focused on design practices that the whole team can own, understand, and evolve over time.