More DevOps teams should be employing root cause analysis (RCA) to defects. The advantages are clear and indisputable. RCA metrics on defects can be leveraged to improve software quality by fixing the ineffective areas of the software development process such as requirements, design, code verification, unit testing, test planning, and QA testing. The result is drastic improvements in the overall quality of the software, and that means happy customers and lower development costs. Still not convinced? Let’s dig a little deeper.

What is RCA about? The methodology of RCA grew out of a need to identify the underlying factors that contributed to a system failure or an adverse event of some kind. Using the analogy of a plane crash, the RCA would pull in the black box data along with the plane’s designers, mechanics and pilots with experience flying that particular model. The aim would be to find the cause or causes of the crash and make the necessary changes to prevent it from happening again.

While RCA has traditionally been employed in hardware engineering, it can also work very well with software engineering. For software developers, RCA is about pulling together the businesspeople, engineers, and QA department to figure out why a defect was introduced. This means going back to the original requirements; checking the design, code implementation, test plans and test execution cycles; and identifying the root cause of the defect. The process requires careful analysis and classification of the root cause if it is to work well, but the benefits far outweigh the time spent in categorizing RCA and acting on defects.

Defects cost the U.S. $60 billion
Examining the root causes of a defect can help to establish breakdowns in communication between team members, weak links in processes that can be corrected, or training needs for individuals or groups. It has been more than a decade since the National Institute of Standards and Technology discovered that software defects are costing the U.S. economy US$59.5 billion annually. It found that up to 80% of development costs were being spent on identifying and fixing defects, and yet the end products were still shipping with unidentified defects.

If RCA metrics are employed efficiently, then defects can be traced to the source and processes can be tweaked or improved in order to eliminate them before they float downstream to QA (or worse, to the customer). By investigating and discovering the underlying causes, you can prevent those fires from starting in the first place, instead of focusing on extinguishing them.

It is inevitable that you will encounter some resistance when you track RCA metrics. It is vital to establish that the process is not about playing the blame game. Focus on the idea that the process is at fault, rather than the individual, and everyone needs to pull together in order to improve the process.

Relatively detailed data is a prerequisite, and it is vital that you appoint a judge (usually the project manager) who can rule on the investigation and prevent things from descending into a shouting match. A good level of documentation is important. As the group works backward from the discovery of the defect, whether it was identified by the end customer or the QA tester, or perhaps even earlier by the developer, it’s vital to document all defects and their root cause throughout all phases of the software development process (e.g. requirements review, design review, code review, testing).

It could be that the original requirements were not clear, or maybe the developer applied their own interpretation, or perhaps the QA team missed testing for that potential scenario. Whatever the case, when the cause has been identified, move the focus swiftly to how it could be prevented. The aim is to document an improvement to the process that ensures a similar defect doesn’t arise in the future. For example, a possible fix may be to implement a checklist for the requirements document, or stage a more formal review of the design that includes BA and QA members. Perhaps you need to formally check during implementation that the developers have conducted unit testing or peer code reviews. It may be necessary for QA to write more detailed test cases, or to take part in the code-verification process.

Look beyond the post mortem. Many companies will only engage in RCA for production defects that make it through to customers, or for very serious defects discovered in production, but the process can be usefully applied throughout all phases of software development. The greater your volume of data, the better your chance of identifying patterns that signal a fundamental problem with a process that can potentially be solved.

Several months of data can give you insights that you’ll never get from a few production-critical defects only. It can also be used to tighten processes and improve efficiency for every project going forward, not just the software you happen to be working on right now.

It is the QA department’s job to champion this process. They should lead the investigation and force the necessary changes, backed up by the certainty of statistical analysis. This can lead to serious improvements to major processes, can help prevent software from being released prematurely, and can enable more-effective use of all resources. These actions also work perfectly with the QA department’s ultimate aim, which is to identify all defects by any means necessary and ensure that the customer doesn’t find them.

Lack of time is false economy
The adoption of RCA methodology has been widespread, and its ability to save lives when applied to the healthcare industry or accident investigation is well documented, but there’s no reason it should be confined to these fields. The pressure to deliver high-quality software quickly and roll out new features as soon as possible is a real problem for the software development industry. The old adage about more haste, less speed holds true.

Kaushal Amin is CTO of KMS Technology, a software development and IT services firm based in Atlanta and Ho Chi Minh City, Vietnam.