Simon Frankau's blog

Teleological incident postmortems

I made a Tweet I'm quite proud of, that got double-digit likes! ;) It goes something like this:

Teleological incident postmortem: "The root cause of this incident was the need for this system to be more reliable in the future."

280 characters doesn't do nuance, so I thought I'd write a little more. Even if, to be frank, there's not a lot of nuance.

Despite my advancing years, I only just learnt about teleology. My knowledge of ancient Greek thought is very limited, but I've started working my way through Bertrand Russell's History of Western Philosophy. It's a bit of a doorstop, but surprisingly readable with plenty of snark. I'm sure I'll do a review in due course. Anyway, the idea of teleology is (roughly) to answer the question "Why?" in terms of what it causes in the future. It's a bit like cause and effect, except with the event being the cause, not the effect.

I find it a startling way to think. I'm so used to "Why?" being in terms of cause and effect, that this framing, with its implicit hint of fate, is surprisingly alien. Why on earth would I apply it to incident response, beyond "I have a new hammer, let's hammer everything!"?

Firstly, I think it's an empowering framing. While I think it's not the way it worked for the ancient Greeks, it's good to be able to say "When we look back, what do want to say this event lead to? Let's make it that way!". It brings the right mindset to reviewing an incident, pushing it in the direction of making things better, rather than dwelling too much on the past.

Secondly, thinking about this slightly alien, human-story-centred approach to "Why?" can help remind us of the assumptions baked into our normal "cause and effect" approach to "Why?". Cause-and-effect covers a multitude of attitudes. It can be making up a story as to why a thing happened, or scientifically-grounded causation. The latter works best in situations with repeatable experiments, where the alternatives can be tested. Incident analysis, where we don't have access to the universe where the outage didn't happen, tends to the rather human story-telling end of things, and we should think about that.

In some contexts, sometimes "Why?" makes no sense. It's just trying to impose human values of meaning onto systems that aren't like that. Some events are random, some things are just the way they are. Not everything is about causation, and "Why not?" is just as (in)valid as "Why?". For incidents with human-built systems, this is less of a concern - if an outage happened due to bad lack or coincidences, you can still ask "Why didn't we have another layer to defend against this?". We should still proceed with care.

The lesson to take away, I think, is that there are different kinds of "Why?" for different purposes. It's a fun route back to the same old story. "Who did this?" is a bad "Why?" for improving the system. "What was the trigger?" is a bad "Why?". "What was the (singular) root cause?" is a bad "Why?". "What are all the key things that, had they been different, would have prevented or ameliorated this outage?" is perhaps a better "Why?".

In the end, our "Why?"s need to be driven by our ultimate goal for a PM, which is to improve the reliability of the system. Perhaps, counter to all my intuition, the teleological "Why?" is the actual best "Why?" in this situation!

Posted 2021-07-14.

Full index