Thot leadership: A tale of two^Wthree outages

I'm useless at thought leadership. I am fundamentally incapable of getting my hot takes out before everyone else. Indeed, I only tend to get my thoughts down once everyone else has got bored and wandered off. That's what I'm doing here. I'm not unduly bothered by this; hopefully there's room to move on to more durable truths that stay relevant...

The particular outages that have been sticking in my mind are the Salesforce DNS outage and the Fastly config push outage. They're interesting to compare and contrast...

Salesforce DNS outage

Rather intriguingly, the outage report has been rewritten to remove any reference to a single engineer receiving consequences for the actions they took, but the initial draft... had that. To be clear, the first iteration I saw did not place that as the sole issue - it was mentioned with things like poor automation - but it certainly struck a chord.

A typical Twitter view was that anything like that is beyond the pale and a sign of an irrevocably broken culture. And I just can't quite get on-board with that. I view all outages as systemic issues, but that's not incompatible with having someone being negligent or even malicious. That is a systemic issue - why did you hire this person? Why did you not deal with it earlier? Why aren't your systems resilient against this? But it's also a people issue - you don't just go "Yeah, sure, you're negligent, but that's not your fault, we shouldn't have hired you. It's on us.".

To be clear, almost certainly we do have a screwed-up culture here, not One Bad Apple. I'm being pedantic because I think it's useful to avoid lazy thinking. How would you know the difference between "one bad engineer" and widespread issues? Well, the main thing is whether their pattern of behaviour is normal or not. This emergency change required approval, and was apparently inappropriately approved. Did that happen often? Are there consequences for the approvers? Canarying and staged rollouts were supposed to happen, but didn't - was that unusual?

I guess what I don't like about "blaming the engineer is bad" is that the opposite is not the real fix. You can avoid blaming the engineer and still make no progress. You need to understand the context, the norms, the reasons why alleged processes weren't followed and checks didn't work. The thing that really made me feel bad about the Salesforce outage write-up is that it had no answers to those questions, and moreover didn't even ask them. Take out the engineer-blaming, and it's still a concerning write-up.

As well as culture issues, this was, at heart, a "should have had better automation, with canarying and incremental rollouts" outage, which I think makes a nice comparison with the Fastly outage. So, let's move on.

Fastly config push outage

Despite being very short, I liked the Fastly write-up. For one, it didn't focus on the trigger, but on what happened. Buggy software was rolled out, and then a customer config change triggered it. It sounds simple, but so many incidents start with the immediate trigger (preferably "human error"), and are instantly wrong-footed in terms of managing the systemic issues. As it is, Fastly's response looks systemic.

So far, so boring. What was interesting to me was a Twitter conversation I had while the issue was still poorly understood. The assumption was that this was another operational config push gone wrong, not a customer config push. My correspondent asserted: 1) They wanted a write-up explaining how whoever was doing that operational work screwed up, and 2) They wanted proper multi-stage rollouts, with a human in the loop between steps, to prevent global failure in the future. I found this interesting, because there were little echos of Salesforce.

(My correspondent may disagree with how I'm portraying their views. Oh well. Pretend I'm making up a straw person to argue with, then! :)

Leading with "what the human operator did wrong" is, as I said above, dangerous to systemic thinking that will actually fix the system. Indeed, it's bad enough that people assumed that a config-push outage was an operational error, rather than customer control-path failure. Assuming it's human operator error is as bad as saying "It's always DNS", except it's not even a (misinformed, tedious) joke. Ok, in Salesforce's case it was DNS. *sigh*

The other point, about human inspection of multi-stage rollouts is... a lot more subtle. If you don't trust your rollouts, having a human in the loop and watching sounds good. In practice, it doesn't scale, and it sucks. Humans make mistakes, good automation is much more reliable. So, any decent system will have automated checks anyway. What's the human doing? Watching the machine to make sure its checks are working correctly. That's the approach behind Tesla's autopilot, and people have died because of that.

Get good at trusting your automation. We don't put human checks on the business path - no human is canarying the rollouts of customer-driven configuration. To say that the operational automation needs a human watching it means we're just not holding it to the same standard of software engineering as the application, and that is how you get outages. You don't want to stick humans in the loop by choice - they are just so darned unreliable. Look at all those outages caused by operator error! So, adding human vigilence is a sign of some underlying cognitive dissonance.

Where does this concern about automation come from? Well... *sad face*... there are valid concerns, which brings me on to the third outage, which I never intended to cover when I started writing this, but it's so informative:

Google's June 2019 outage

This outage is a case of automation taking down a large service in a very visible way. Automation makes it possible to do so exceedingly efficiently (and surprisingly!). Is full automation bad? Do we need humans in the loop?

One of the things about automation is it enables scaling. It enables complexity. If every production change were done by hand, we'd have a huge error rate and any large distributed change would be overwhelmed by the overhead of coordinating people. Replace that with robots, and you can keep scaling. You can build a really complicated system, and it'll function (most of the time). If you read the write-up, this is clearly a mega-complex automation system, with many layers and... it got too complex.

People might take away the lesson that automation is dangerous. They'll be tempted to put the much less efficient, much less safe humans back in the loop, and stop treating the problem like software engineering. Don't do that. Trust automation, but build automation that you can trust. Code lets complexity quietly pile up in a way that a human bureaucracy would immediately spot. It's a horrible temptation. Keep the automation simple and clear. Build it so that you can trust it.

Posted 2021-06-14.