How Complex Systems Fail by Richard Cook is one of my favorite papers. The fact that it was written in 1998, without a particular focus on software, makes it that much better. If you have experience in “operations” – aka “production” — this paper will make you feel seen.
This paper makes eighteen observations about complex systems and includes commentary on each. My favorites, grouped by theme (but seriously, this paper is short and amazing, you should read it!):
1. Complex systems are intrinsically hazardous systems.
2. Complex systems are heavily and successfully defended against failure.
3. Catastrophe requires multiple failures – single point failures are not enough.
4. Complex systems contain changing mixtures of failures latent within them.
5. Complex systems run in degraded mode.
6. Catastrophe is always just around the corner.
“Abandon all hope ye who enter here” — or, at least, the hope of a linear, understandable system which you can be confident will avoid catastrophe if only enough best practices are applied.
7. Post-accident attribution to a ‘root cause’ is fundamentally wrong.
8. Hindsight biases post-accident assessments of human performance
We want to believe that — and it is so much easier for us if — failure corresponds to a “broken part” somewhere in our system and we can be “done” once we fix it. But that’s flat out not the case. There is no root cause. Similarly, since failures involve humans — either their action or inaction — it would be easy for us if “human error” was a cause, to be dealt with, perhaps, by mandating further training or imposing punishment. But it is, instead, a symptom of deeper problems in our complex socio-technical systems.
Sidney Dekker has written great books on these topics if you are interested in learning more.
10. All practitioner actions are gambles.
11. Actions at the sharp end resolve all ambiguity
I wrote recently about the difference between work as imagined and work as done. We see here that work as done is the ground truth and is made up of operators making a sequence of choices, each of which they perceive — given their knowledge and constraints at the time — to be the best option available. In other words, a sequence of gambles!
16. Safety is a characteristic of systems and not of their components.
Every component may be working correctly and the system can still be broken :-(. Or, as someone who had not yet read this paper might wonder,