Embracing SRE as an incomplete approach to reliability

Model error is a fact of life. We operate in a world of stunning complexity, with tens of millions of people producing software and billions consuming it. We build on top of rapidly evolving hardware; today a top-of-the-line cell phone has more resources than a world-class supercomputer did 20 years ago. To iterate quickly in the face of this complexity, we have no choice but to base our decisions on imperfect summaries, abstractions and models.

When we construct models of reliability for our sociotechnical systems, we tend to play shell games with our complexity and ambiguity, hiding it wherever seems least offensive. The Shingo prize-winning Accelerate “asked respondents how long it generally takes to restore service when a service incident occurs” and found that lower mean time to restore (MTTR) predicted (better) organizational performance. But when does a disruption become an “incident”? And when is service “restored” [1]? Complex systems in the real world inevitably run degraded and with latent failures present.

Site reliability engineering—as defined by Benjamin Treynor Sloss, who evolved the discipline at Google beginning when he joined in 2003—has a core tenet of “Pursuing Maximum Change Velocity Without Violating a Service’s SLO” in order to resolve a “structural conflict between pace of innovation and product stability.” There are many interesting threads to pull here. Two that I find fascinating, but which, sadly, are outside the scope of this post are: (1) other ways in which the conflict between innovation and product stability might be solved (perhaps add stability to product manager performance reviews?); and, (2) the wide spectrum of ways in which SRE is implemented in the wild [2]. 

What I believe is more impactful is (Sloss’ definition of) SRE dodging substantial complexity by (implicitly?) arguing that a well-selected, well-monitored collection of SLOs that are all green is sufficient for a system to be reliable [3]. This is a misconception which puts us at risk of surrogation, of mistaking our map (SLOs) for our territory (reliability). The fundamental insufficiency here is that SLOs cannot protect us from dark debt: “unappreciated, subtle interactions between tenuously connected, distant parts of the system”. 

We see such dark debt in the “atmospheric” conditions that supported the perfect storm that formed in June 2019, when during GCNET-19009, multiple Google Cloud regions were disconnected from the Internet for over three and a half hours, more than 50 times the service’s monthly SLA of 99.99% availability [4]:

Two normally-benign misconfigurations, and a specific software bug, combined to initiate the outage: firstly, network control plane jobs and their supporting infrastructure in the impacted regions were configured to be stopped in the face of a maintenance event. Secondly, the multiple instances of cluster management software running the network control plane were marked as eligible for inclusion in a particular, relatively rare maintenance event type. Thirdly, the software initiating maintenance events had a specific bug, allowing it to deschedule multiple independent software clusters at once, crucially even if those clusters were in different physical locations.

In my view, then, SLOs are a tool we can use to align our organizations to avoid some especially painful ways we might make our users unhappy, like being down for too long or having a too-high tail latency of important endpoints. And indeed, many organizations have adopted SLOs and found that managing the reliability of their systems has become easier (example). Unfortunately, healthy SLOs, no matter their quality, are not enough to certify that the functionality we provide to our users will be safe or reliable in the future. As Cook says, catastrophe is always just around the corner.

Unsurprisingly, there’s no silver bullet here (although you can always consider improving your postmortems). However, a clearer view of the nature of reliability, supports more informed decisions on how to balance tradeoffs, resources, and prioritization as we seek to innovate quickly and reliably. I hope to learn more about how approaches to robustness (defense against known failure modes), such as SLOs, and resilience (unknown failure modes), can compose to improve overall later this week at REdeploy!

Thanks to Rein Henrichs for feedback on this post.

[1] John Allspaw has a pair of interesting blog posts on these topics: Moving Past Shallow Incident Data and Incidents As We Imagine Them Versus How They Actually Happen

[2] Seeking SRE is a great source on the SRE multitudes.  Google also as a blogpost on the topic.

[3] To be precise, these must be achieved with low toil, but toil is orthogonal to our discussion.

[4] This SLA, based on which Google issues refunds to customers, is almost certainly stricter than the internal SLO used by SREs. 

Quick thoughts on short papers: “Failure to adapt or adaptations that fail: contrasting models on procedures and safety”

I’ve been consuming a lot of content in the last few months — books, papers, blog posts, talks, podcasts, you name it. I’ve seen exciting aha! moments and spent hours trying to wrap my head around seemingly simple but, in fact, maddeningly complex concepts. In this setting, the right sort of short paper is like a refreshing cold drink on a hot, humid day. These papers argue a core idea of goldilocks size and serve as landmarks and reference points as I sail through seas full of dragons.

The most recent paper of this type I have run into is Failure to adapt or adaptations that fail: contrasting models on procedures and safety by Sidney Dekker, from 2001 (hat tip to John Allspaw for passing it along). This paper has become my go-to for the idea that work as imagined and work as done are not the same. That is,

People at work must interpret procedures with respect to a collection of actions and circumstances that the procedures themselves can never fully specify (e.g. Suchman, 1987). In other words, procedures are not the work itself. Work, especially that in complex, dynamic workplaces, often requires subtle, local judgments with regard to timing of subtasks, relevance, importance, prioritization and so forth.

And:

 Procedures are resources for action. Procedures do not specify all circumstances to which they apply. Procedures cannot dictate their own application. Procedures can, in themselves, not guarantee safety. 

Applying procedures successfully across situations can be a substantive and skillful cognitive activity. 

Safety results from people being skillful at judging when (and when not) and how to adapt procedures to local circumstances. 

And, finally:

There is always a tension between centralized guidance and local practice. Sticking to procedures can lead to ineffective, unproductive or unsafe local actions, whereas adapting local practice in the face of pragmatic demands can miss global system goals and other constraints or vulnerabilities that operate on the situation in question. 

To apply to software engineering, simply replace procedures or guidance with runbooks or documentation, think about the last time someone in your organization went “offroad”, and what the response was.  The beauty of short papers is that they don’t need much more analysis than that. Just read it yourself — and let me know what you think!

What’s up funemployment

I left Lyft in June 2019. In the leadup to my departure, I decided to take a period of “structured funemployment,”  rather than pipeline a job search and quickly get back to W-2 work. A break like this — where I set the agenda and have time to reflect on the past and future — is something I’ve been promising myself I would do the next time I had a break between jobs. Achievement unlocked!

I’m privileged to have the financial/logistical space to take a break like this and especially so to structure this experience around my membership in the South Park Commons community, which I joined in July (h/t Matt for the referral). A group that “brings together talented people to share ideas, explore directions and realize the opportunities that’ll get you there” is exactly what I was looking for. The Commons is an inspiring place, full of brilliant, kind, hard-working people with varying backgrounds and goals.  Spending my day at the Commons is energizing in a way that sitting at home is not.

In this setting, I’ve chosen to focus my attention so far on understanding how complex systems fail and the application of resilience engineering to software. This decision is motivated in large part by my experience with operations, failure and the incident lifecycle during my time at Lyft. It turns out there’s decades of research on why these problems are hard (e.g., why we can’t don’t have nice things!).

I find this domain fascinating and am framing my pursuit as independent research, rather than an intent to launch a startup “in this space” (as they say). It’s… complex, but I don’t think this work aligns well with the sort of mechanics (e.g., metrics, growth) associated with success as a venture-backed company (but reach out if you think otherwise!). One data point here is that John Allspaw, the godfather of “resilience for tech” and former CTO of Etsy, is operating a consultancy, not a startup selling a product. My working hypothesis is that I’ll leave funemployment to work in this area inside a maturing hyper-growth company — think decacorns like Airbnb and Stripe. I still have at least a few months left of funemployment and research, but I’m happy to chat with folks at organizations looking to invest/hire in this area — drop me a note!  

So far, in practice, independent research on resilience engineering means consumption and conversation. Looking at the ever-increasing list of papers, books, and talks I have read/watched and have outstanding, it is definitely the best and worst of unbounded queue times. It is also great to have connected with individuals and communities — primarily on Twitter and in various Slacks — where there’s a healthy and ongoing discourse on resilience and related topics. Plenty of excellent chats over coffee as well — and funemployment means a flexible schedule, so if you’re interested in talking shop, just let me know.

While I’d given myself top marks for participation on Slack and Twitter (@jhscott), I’ve been slower to produce longer form analysis and writing. I’ve started drafting talk proposals for related conferences, with the first being REdeploy, which I’ll be attending in San Francisco in October. I’m also hoping to use this blog to distill what I’ve learned (duh). Look for a writeup of what SREs even do in a world of production ownership where developers hold the pagers in the next few days.

Lets chat

Curiosity and free time is a dangerous and excellent combination. If you’d like to grab coffee or lunch in the SFBA, or chat over Hangouts/Skype/etc, let me know. The best ways to reach me are email — first.last @ gmail.com — or Twitter — @jhscott.