Read just about any article on DevOps and you’ll be sure to find liberal sprinklings of terms like “culture” and “collaboration” and “establishing strong feedback loops.” Go deeper and you’ll probably come across some head-scratching stuff like “anti-fragile,” “systems thinking,” and my own personal favorite: “learning from failure.”
There’s nothing especially earth-shattering about learning from failure, is there? After all, it’s generally accepted that many teams gain more knowledge from the dark recesses of failure than from basking in the warm afterglow of success.
(Related: How to build a DevOps culture)
While this common-sense wisdom is incontrovertible, actually applying it is darned hard to do—especially in IT. But why is this? After all, with heaps of white-elephant projects, development calamities and terminal Operations practices, shouldn’t we all be failure gurus, leveraging our collective knowledge for the greater good of the business?
Well, actually, no.
Ironically, the problem for IT professionals today is one of survivorship bias. That is, our tendency to focus singularly on survivors (people, technology or processes) that have made it across the complex software development life cycle and ignore every point of failure along the way. Not such a bad thing you might say—after all, “winners are grinners,” as my old soccer coach used to say.
But this survivalist approach is problematic and somewhat counter to DevOps thinking. Rather than concentrating on successes (after a collective sigh of relief when code makes it to production), we should be looking across our delivery chains for failures and defects—the non-survivors, if you like. Because it’s here at the ugly edge where you’ll really capture the information needed to drive continuous improvement.
I’ve witnessed so many painful examples of IT organizations falling for survivorship bias. I won’t mention names, but they all share a common characteristic: Not only did they fail to understand that what they missed or ignored (the non-survivors) held valuable information, they often failed to recognize that there was any missing information at all.
Here are some real-world examples of survivorship bias that are applicable in an IT context.
The real casualties of war: During World War II, a team of statisticians were tasked with determining where more armor should be applied to planes making it back to base after bombing runs. By examining bullet holes, the logical conclusion was to add more to the areas on the aircraft that received the most damage. This of course was wrong: The armor needed to be applied in areas with the least damage because that’s probably where the aircraft not making it back had been hit.
This is a lesson we can apply in IT, especially when it comes to people. Over many years I’ve repeatedly seen organizations turn to the “heroes” of IT, the “go-to guys” who’ve been through every re-org and upheaval, survivalists with their own sets of specialized tools. The problem, however, is we’re missing key information from our casualties, such as what forced other great developers and operations guys to leave? Was it culture, stress, management or working conditions?
Nothing really ages gracefully: Some years ago I was considering buying a cheap but rundown old cottage. I knew it was going to require a hefty outlay of cash to make it habitable, but hey, it was going to be worth it. The house had withstood the test of time; it was solid, right?
Well, nope. It was a piece of crap when it had been built, and one of only a few from the original design that was still left standing. Luckily an architect friend of mine pointed out my survivorship bias (by calling me an idiot).
In IT, we’ve inherited a fair amount of design crap, haven’t we? Why is it then that we celebrate every painstaking step involved in keeping this heritage up and running, when we should be identifying points of failures in our architecture, processes and tools? So the next time your management team does a round of high-fives after a production deployment, ask them to sit in on a typical release weekend and witness the pain you’ve just been through by manually stitching everything together.
The tall poppy syndrome: For every successful new restaurant, there are 10 failures. For every funky startup, countless non-starters. It’ll be the same with DevOps: For every Netflix, Amazon, Uber or LinkedIn, there’ll be a raft of failures. So stop swooning over impressive-sounding deployment rates, chaos monkeys and live experiments, and start looking at the failures, technical debt and casualties that exist within and across your own teams.
For example, identify why and how defects are leaking into your codebase, what constraints prevent you from split-testing assumptions, and whether performance monitoring in pre-production can be used to spot the “casualties” before they go live. Always remember to do what’s really painful (but critical), such as exposing failures through metrics, because without this visibility, the natural tendency of reverting back to over-celebrating infrequent successes will persist.
Learning from failure is easier said than done. Because we’ve had so much failure to deal with, we naturally gravitate to our successes. But that isn’t the place to learn anything, so start clinically examining why anything fails, from the inception of an idea to full production status. Perhaps then you can survive anything.