Embracing SRE as an incomplete approach to reliability

Model error is a fact of life. We operate in a world of stunning complexity, with tens of millions of people producing software and billions consuming it. We build on top of rapidly evolving hardware; today a top-of-the-line cell phone has more resources than a world-class supercomputer did 20 years ago. To iterate quickly in the face of this complexity, we have no choice but to base our decisions on imperfect summaries, abstractions and models.

When we construct models of reliability for our sociotechnical systems, we tend to play shell games with our complexity and ambiguity, hiding it wherever seems least offensive. The Shingo prize-winning Accelerate “asked respondents how long it generally takes to restore service when a service incident occurs” and found that lower mean time to restore (MTTR) predicted (better) organizational performance. But when does a disruption become an “incident”? And when is service “restored” [1]? Complex systems in the real world inevitably run degraded and with latent failures present.

Site reliability engineering—as defined by Benjamin Treynor Sloss, who evolved the discipline at Google beginning when he joined in 2003—has a core tenet of “Pursuing Maximum Change Velocity Without Violating a Service’s SLO” in order to resolve a “structural conflict between pace of innovation and product stability.” There are many interesting threads to pull here. Two that I find fascinating, but which, sadly, are outside the scope of this post are: (1) other ways in which the conflict between innovation and product stability might be solved (perhaps add stability to product manager performance reviews?); and, (2) the wide spectrum of ways in which SRE is implemented in the wild [2]. 

What I believe is more impactful is (Sloss’ definition of) SRE dodging substantial complexity by (implicitly?) arguing that a well-selected, well-monitored collection of SLOs that are all green is sufficient for a system to be reliable [3]. This is a misconception which puts us at risk of surrogation, of mistaking our map (SLOs) for our territory (reliability). The fundamental insufficiency here is that SLOs cannot protect us from dark debt: “unappreciated, subtle interactions between tenuously connected, distant parts of the system”. 

We see such dark debt in the “atmospheric” conditions that supported the perfect storm that formed in June 2019, when during GCNET-19009, multiple Google Cloud regions were disconnected from the Internet for over three and a half hours, more than 50 times the service’s monthly SLA of 99.99% availability [4]:

Two normally-benign misconfigurations, and a specific software bug, combined to initiate the outage: firstly, network control plane jobs and their supporting infrastructure in the impacted regions were configured to be stopped in the face of a maintenance event. Secondly, the multiple instances of cluster management software running the network control plane were marked as eligible for inclusion in a particular, relatively rare maintenance event type. Thirdly, the software initiating maintenance events had a specific bug, allowing it to deschedule multiple independent software clusters at once, crucially even if those clusters were in different physical locations.

In my view, then, SLOs are a tool we can use to align our organizations to avoid some especially painful ways we might make our users unhappy, like being down for too long or having a too-high tail latency of important endpoints. And indeed, many organizations have adopted SLOs and found that managing the reliability of their systems has become easier (example). Unfortunately, healthy SLOs, no matter their quality, are not enough to certify that the functionality we provide to our users will be safe or reliable in the future. As Cook says, catastrophe is always just around the corner.

Unsurprisingly, there’s no silver bullet here (although you can always consider improving your postmortems). However, a clearer view of the nature of reliability, supports more informed decisions on how to balance tradeoffs, resources, and prioritization as we seek to innovate quickly and reliably. I hope to learn more about how approaches to robustness (defense against known failure modes), such as SLOs, and resilience (unknown failure modes), can compose to improve overall later this week at REdeploy!

Thanks to Rein Henrichs for feedback on this post.

[1] John Allspaw has a pair of interesting blog posts on these topics: Moving Past Shallow Incident Data and Incidents As We Imagine Them Versus How They Actually Happen

[2] Seeking SRE is a great source on the SRE multitudes.  Google also as a blogpost on the topic.

[3] To be precise, these must be achieved with low toil, but toil is orthogonal to our discussion.

[4] This SLA, based on which Google issues refunds to customers, is almost certainly stricter than the internal SLO used by SREs. 

Quick thoughts on short papers: A typology of organisational cultures

In this six page paper from 2004 (based on work dating back to 1988), R Westrum proposes that organizational cultures approach information flow in one of three ways:

The first is a preoccupation with personal power, needs, and glory. The second is a preoccupation with rules, positions, and departmental turf. The third is a concentration on the mission itself, as opposed to a concentration on persons or positions. I call these, respectively, pathological, bureaucratic, and generative patterns.

Westrum provides a superb table of examples. This table definitely spoke to me the first time I read the paper:

This topology has been widely adopted in the DevOps literature. Accelerate reports that generative culture predicts (better) software delivery performance, job satisfaction, and organizational performance (p209).

Beyond the typology itself, I appreciate the clear discussion of how leadership’s preferences influence culture:

The underlying idea is that leaders, by their preoccupations, shape a unit’s culture. Through their symbolic actions, as well as rewards and punishments, leaders communicate what they feel is important. These preferences then become the preoccupation of the organisation’s workforce, because rewards, punishments, and resources follow the leader’s preferences. Those who align with the preferences will be rewarded, and those who do not will be set aside. Most long time organisation members instinctively know how to read the signs of the times and those who do not soon get expensive lessons.

A decade later, Mickey Dikerson suggets that things remain the same in his essay in Seeking SRE:

So, these processes determine the long-term behavior of your company and every system you manage. What do they reward? Ignore what the company says it rewards; instead, look at the list of who was promoted. Behaviors associated with these people will be emulated. Behaviors associated with those left behind will not. This evolutionary pressure will overwhelm any stated intentions of the company leaders.

When all you have is a hammer, everything is a nail. Nonetheless, I feel like the most painful organizational tensions I’ve experienced in my career all had cultural misalignment as a strongly contributing factor. I find these paragraphs a powerful tool for detecting organizational dysfunction.

Finally, there is discussion of bureaucratic culture being the “default value”. This leads to a line of inquiry to long for a QTSP: what other high-impact defaults exist in (software) companies, and where do they come from? For example, how did the Five Whys make it from the Toyota Production System into seemingly everyone’s postmortem templates?

Thanks to Randall Koutnik for recent discussions on Westrum’s topology!

How Complex Systems Fail Strikes Back

Poster by Olly Moss

When I wrote up QTSP on How Complex Systems Fail two weeks ago, I forgot to include other interesting reviews of the paper.

The first, unsurprisingly, is from John Allspaw in 2009 — this is before Allspaw coined “blameless postmortem”. Allspaw rejects the paper and embraces strict adherence to the Toyota Production System embraces the paper:

I don’t think I can overstate how right-on this paper is, with respect to the challenges, solutions, observations, and concerns involved with operating a medium to large web infrastructure.

It is interesting to see “early Allspaw’s” view on topics like the 5 Whys:

I believe that even a rudimentary process of “5 Whys” has value. (Update: I did when I first wrote this. Now, I do not. ) But at the same time, I also think that there is something in the spirit of this paragraph, which is that there is a danger in standing behind a single underlying cause when there are systemic failures involved.

There are probably many worse ways to spend your time than walking parts of the “Allspaw trail”, even a decade removed.

Six years later, the don of paper blogging, Adrian Colyer of The Morning Paper fame, picks up the mantle:

This is a wonderfully short and easy to read paper looking at how complex systems fail – it’s written by a Doctor (MD) in the context of systems of patient care, but that makes it all the more fun to translate the lessons into complex IT systems, including their human operator components.

I think about Cook’s paper often. Recently I’ve been thinking about #18, failure free operations require experience with failure. This is seemingly a paradox — we want to reduce failure, which requires experience from failure. Where does this experience comes from once the failure is reduced?

Some interesting answers might be learning focused postmortems where we can learn from failure indirectly, and chaos engineering experiments where we can learn from failure in controlled conditions. The “resilience in software” community’s focus on these domains begins to come into focus…

Allspawn Pokemon: gotta catch ‘em all, video edition

John Allspaw is the godfather of resilience engineering in software, dating back to his introducing the term and practice of “blameless postmortems” in a 2012 Etsy blog post. Allspaw is a prolific speaker, but there has never been a full timeline for the decades-long “incident” of his public speaking career… until now. I am likely missing some talks, please send me any additions or errors. I’ve skipped videos under fifteen minutes in length and those behind paywalls.

  1. 2009-06-23 10+ Deploys Per Day: Dev and Ops Cooperation at Flickr (w/ Paul Hammond) from Velocity 2009
  2. 2011-06-28 Building Resilience in Web Development and Operations from USI 2011
  3. 2011-11-09 Anticipation: What Could Possibly Go Wrong? from Velocity EU 2011
  4. 2011-11-16 Outages, Post Mortems, and Human Error 101 from Etsy Tech Talk
  5. 2012-04-23 Interview from GOTO Chicago 2013
  6. 2012-06-26 Stronger and Faster (w/ Steve Souders) from Velocity 2012
  7. 2012-09-25 Interview with Jez Humble
  8. 2013-05-14 Owning Attention: Alert Design Considerations from Etsy Tech Talk
  9. 2013-11-13 AMA from Velocity EU 2013
  10. 2013-11-22 Fireside Chat with Andrew Clay Shafer
  11. 2014-01-29 An Evening with John Allspaw on Development and Deployment at Etsy from Data Council
  12. 2014-06-24 Interview from Velocity 2014
  13. 2014-06-24 PostMortem Facilitation: Theory and Practice of “New View” Debriefings Parts One, Two, Three, Four from Velocity 2014
  14. 2015-05-28 Seeing the Invisible: Discovering Operations Expertise from Velocity 2015
  15. 2016-05-25 Common Ground and Coordination in Joint Activity from Papers We Love
  16. 2017-11-15 How Your Systems Keep Running Day After Day: Resilience Engineering as DevOps from DOES 2017
  17. 2018-03-20 Poised To Adapt: Continuous Delivery’s Relationship To Resilience Engineering from PipelineConf 2018
  18. 2018-04-24 Taking Human Performance Seriously in Software from DevOpsDays Seattle 2018
  19. 2018-08-16 In the Center of the Cyclone: Finding Sources of Resilience from Redeploy 2018
  20. 2018-09-12 Interview from PagerDuty Summit 2018
  21. 2018-09-12 Incidents as we Imagine Them Versus How They Actually Are from PagerDuty Summit 2018
  22. 2018-10-15 Problem Detection from Papers We Love
  23. 2019-02-11 Video AMA from PagerDuty 2019
  24. 2019-06-03 Taking Human Performance Seriously In Software from Monitorama PDX 2019
  25. 2019-07-08 Resilience Engineering: The What and How from DevOpsDays DC 2019

Bonus: podcasts

  1. 2016-02-13 PAPod 57 – System Reliability – John Allspaw from PreAccident Investigation Podcast
  2. 2017-03-07 John Allspaw on System Failures: Preventing, Responding, and Learning From Failure from SE-Radio
  3. 2018-09-05 096: Resilience Engineering with John Allspaw from Greater Than Code

Practical postmortem performance, personal prescription

Recently I had beers with a friend and former coworker. As part of our catchup, he heard a two-beer version of my months’ long random walk through complex systems and resilience. I ranted about explained the importance of using postmortems to learn from failure in this setting, and was pleasantly surprised when he pinged me the next day to ask how he might improve his organization’s postmortems.

A slightly edited version of my email in response follows. It is essentially a sloppier, opinionated, concrete subset of the resilience-for-software README.

Improving postmortems to increase learning from failure

If you haven’t read it, I would recommend the Etsy guide as a starting point if you’re redesigning postmortems. You don’t/shouldn’t cargo cult all of it, but you (ed: the friend above) will note a strong contrast with your current process. https://how.complexsystems.fail is a good thing to keep in mind.

Some quick hits:

  • Root causes do not exist, only contributing factors
  • No broken part (from Drift Into Failure by Dekker)
  • Nonlinearity and control vs influence (also from Drift Into Failure)
  • Human error is a symptom, not a cause (from The Field Guide to Understanding ‘Human Error’, also by Dekker)
  • Use incidents to learn about the gap between “work as imagined” vs “work as done”
  • Be aware of the “dashboard trap” described in the Etsy guide

There are tons of resources here; further nodes to explore include:

Good luck and let me know if I can answer any followup questions!

Quick thoughts on short papers: How Complex Systems Fail

How Complex Systems Fail by Richard Cook is one of my favorite papers. The fact that it was written in 1998, without a particular focus on software,  makes it that much better. If you have experience in “operations” – aka “production” — this paper will make you feel seen. 

This paper makes eighteen observations about complex systems and includes commentary on each.  My favorites, grouped by theme (but seriously, this paper is short and amazing, you should read it!):

1. Complex systems are intrinsically hazardous systems.

2. Complex systems are heavily and successfully defended against failure.

3. Catastrophe requires multiple failures – single point failures are not enough.

4. Complex systems contain changing mixtures of failures latent within them.

5. Complex systems run in degraded mode.

6. Catastrophe is always just around the corner.

“Abandon all hope ye who enter here” — or, at least, the hope of a linear, understandable system which you can be confident will avoid catastrophe if only enough best practices are applied.

7.  Post-accident attribution to a ‘root cause’ is fundamentally wrong.

8.  Hindsight biases post-accident assessments of human performance

We want to believe that — and it is so much easier for us if — failure corresponds to a “broken part” somewhere in our system and we can be “done” once we fix it. But that’s flat out not the case. There is no root cause.  Similarly, since failures involve humans — either their action or inaction — it would be easy for us if “human error” was a cause, to be dealt with, perhaps, by mandating further training or imposing punishment.  But it is, instead, a symptom of deeper problems in our complex socio-technical systems.

Sidney Dekker has written great books on these topics if you are interested in learning more.

10. All practitioner actions are gambles.

11. Actions at the sharp end resolve all ambiguity

I wrote recently about the difference between work as imagined and work as done.  We see here that work as done is the ground truth and is made up of operators making a sequence of choices, each of which they perceive — given their knowledge and constraints at the time — to be the best option available. In other words, a sequence of gambles!

16. Safety is a characteristic of systems and not of their components.

Every component may be working correctly and the system can still be broken :-(. Or, as someone who had not yet read this paper might wonder,

Lorin is blogging

In the last week, Lorin Hochstein has posted five new posts on his blog.  What makes this particularly exciting is that, as far as I know, Lorin is one of the few full-time resilience practitioners working in software. He is a member of Netflix’s Cloud Operations and Reliability Engineering (CORE)  team. Surprisingly, not much has been written about CORE. Perhaps the most insightful text is their Senior Resilience Engineering Advocate job post:

The goal of the education and outreach function of the SRE team is to improve reliability and resilience of Netflix services by focusing on the people within the company, since it’s the normal, everyday work of Netflix employees that creates our availability. To cultivate operational excellence, we reveal risks, identify opportunities, and facilitate the transfer of skills and expertise of our staff by sharing experiences.

Following an operational surprise, we seek to understand what the world looked like from the perspective of the people involved. We facilitate interviews, analyze joint activity, and produce artifacts like written narrative documents. Relationship building is a huge part of this role. Someone advocating for resilience engineering within Netflix will help stakeholders realize when this type of work is most effective.


We Think About

Netflix as a socio-technical system is formed from the interaction of people and software. This system has many components and is constantly undergoing change. Unforseen interactions are common and operational surprises arise from perfect storms of events.

Surprises over incidents and recovery more than prevention.We encourage highlighting good catches, the things that help make us better, and the capacity we develop to successfully minimize the consequences of encountering inevitable failure. A holistic view of our work involves paying attention to how we are confronted with surprises every day and the actions we take to cope with them.

This is not the sort of language you typically see in the job description for an SRE! Lorin’s colleague Ryan Kitchens also gave a talk at SRECon19 that touches on CORE’s approach.

Back to those blog posts

Many of the topics Lorin has written on in this flurry of posts have been covered before by folks like Allspaw, Cook, Dekker, or Woods. Lorin’s writing is a great counterpoint to these voices and I think well suited for those new to resilience. He writes plainly and smoothly, and with the perspective of someone who has been in the trenches doing the work for some time.

Lorin’s recent blogposts cover:

I look forward to more blogs from Lorin, even if he ends up slowing down a bit. Lorin is also not the only resilience practitioner blogging nor do they all work at Netflix. I hope to highlight others’ great work in future blog posts.