When something goes wrong and there is a major production incident in IT, the business usually suffers. Developing, delivering, and managing software systems and applications is complicated. People make mistakes, wrong commands get entered, bugs get introduced, and systems fail. When something like this happens, a cascade of unintended consequences often follows. Once the dust clears, there is usually an investigation or post-mortem to determine the cause of the problem. It might seem easier to assign blame to an individual or team, but in complex IT systems, it’s usually not that simple as there can be many contributing factors. Ultimately, it’s about finding the best way to improve service to the business. Your customers don’t care about whom or what is responsible; they care about better service.
When blame is assigned, it tends to be counterproductive to the main objective of reducing the risk of future errors. Using blame or shame as a form of punishment creates a culture of distrust and discourages communication. By removing blame, you create an environment where failure can be discussed openly and treated as a learning opportunity. Your teams actually become more accountable and your systems more resilient. Besides, casting blame is usually misplaced in complex systems where a number of interrelated events quite frequently cause system outages. Catastrophic system failures are almost never caused by isolated errors committed by individuals. Instead, most outages result from multiple, smaller errors in environments with serious underlying system flaws
For example, let’s take a look at a recent outage of Amazon’s AWS cloud service. This post-mortem revealed that someone finger checked a command to take down some servers while trying to debug a problem. This brought down a larger set of servers than intended and initiated the sequence of events that resulted in several hours of system outage. The servers that were brought down had not been restarted for years and to compound the problem, the system had experienced massive growth over the last several years. The process of restarting these servers and running the required system validation checks took hours. Who is really at fault? The person entering the initial command? The process for running the validation checks? The systems team for not periodically restarting the servers? Taking a systems approach can provide a framework for an analysis of errors and efforts to improve system resiliency. Here are some techniques to help with software post-mortems:
- Agree to expect problems – It’s really valuable for teams and leaders to agree that problems are not a reflection of personal failure, but the expected cost of innovation. By making this public agreement, voices will emerge on current issues and potential problems. These types of discussions need to be encouraged and amplified.
- Distinguish people errors from system errors – People tend to be at the sharp end of the problem. You need to also understand the blunt end, the latent errors that can occur in complex systems due to the many process, organizational, and system layers. In the Amazon example, the sharp end was the person issuing the command, but there were latent errors in the system creating an accident waiting to happen.
- Pre-mortems – Don’t wait for the problem to happen. Prospectively identify error-prone situations or failure modes by mapping current processes and identifying ways in which each step can go wrong.
- Learn from your successes – Conduct post-mortem when you succeed. By studying success you will understand not just why you fail but why and how you succeed.
- Share post-mortem results – Ensure that you have a post-mortem write-up with timeline, root cause, and corrective action. If this was a production incident, you should follow Amazon’s example and publish a public blog with a summary of the post-mortem write-up.
- Remediate and measure – make sure that the corrective actions are implemented and that the telemetry is in place to measure improvement based on these actions. Your systems and teams should be stronger as a result of this error. You need to measure, manage, and share your progress.
In complex systems, human error is inevitable and using punishment as a deterrent is a losing proposition. Create a culture of learning where the reporting of errors is encouraged, analysis of errors to identify latent errors in the system is standard, and people are not punished for making mistakes is essential in finding and fixing defects in complex systems.