The complexity of today’s distributed microservices applications makes it tough to track down the root cause when a problem occurs. The time-proven method of drilling down on monitoring dashboards and then digging into logs simply takes too long. Hunting through huge volumes of logs is tedious and interpreting them is difficult. It also requires an enormous amount of skill and experience to understand what the logs mean and to identify the significant factors that relate to the root cause. Worse, this kind of approach ties up the most critical engineering and DevOps resources, preventing them from doing something that could be more valuable to the business.
It’s no wonder machine learning (ML) applied to logs is gaining momentum. It turns out that when an application problem occurs, the patterns in the logs will change in a noticeable way. Using the right approach, the ML can find these anomalous patterns and distill them into a small sequence of log lines that explain the root cause. Imagine the time savings of having to only review 20 log lines curated by the ML, instead of hunting through the many millions of log lines that were generated while the problem took place. Using ML on logs completely revolutionizes the troubleshooting process – speeding up incident resolution time and freeing up key engineers to work on new features instead of fighting fires.
While ML transforms the process of hunting through logs, it does not fully solve the challenge for all users. Even with the best machine learning techniques, there is a last mile problem: a skilled human with the right knowledge of the part of the application or infrastructure that has failed is normally required to interpret the log lines. Think of the possibilities if the reliance on key engineering resources could be eliminated by using AI to interpret those same log lines.
That’s where a natural language model such as OpenAI’s GPT-3 comes in. The log lines, together with an appropriate prompt, are processed by GPT-3 and what is returned is a simple, plain language sentence that summarizes the problem. Engineers at my company have been experimenting with GPT-3 for the past six months, and, although not perfect, the results are nothing short of amazing. Here are a few examples:
- “The memory cgroup was out of memory, so the kernel killed process **** and its child ****.”
- “The file system was corrupted.”
- “The cluster was under heavy load, and the scheduler was unable to schedule the pod.”
In each case, the right engineer could have come to the same conclusion by analyzing the root cause reports for a few minutes. But what the above shows is that we no longer need to rely on having the “right engineer” available as the front line for incident response and resolution. Now, even the most junior member of a team can quickly get a sense of the problem, triage the situation and assign the incident to a suitable engineer to remediate it.
There’s also another impactful use case that plain language summarization opens up to all users – proactive incident detection. The same machine learning technique that can uncover the set of log lines that explain root cause when a problem has occurred can also be used to proactively detect the presence of any kind of problem, even if symptoms have not yet manifested themselves. This approach can often uncover subtle bugs and other conditions early, allowing engineers to improve product quality and proactively fix the issues before they cause more widespread production problems.
For this to work, the ML needs to constantly scan incoming log streams for the presence of anomalous patterns that indicate potential problems. This allows it to catch almost any kind of problem, even new or rare ones that are otherwise hard to detect. However, not all incidents that it detects will be related to problems that you care about. For example, upgrading a microservice can cause a significant changes in log patterns that the machine learning will highlight, however, this is not an actual problem. In order to make the determination of whether a proactively detected problem is important, someone needs to review the small set of anomalous log lines. Generally, this extra effort will be well worthwhile if it prevents a problem from manifesting into a critical incident.
Once again, plain language summarization can be a tremendous help. Instead of the proactive review process being the task of a senior engineering team with the correct level of understanding of the product and logs, it can be carried out by someone of almost any skill level just by glancing at the short English language summaries that are produced. Very quickly, the important “proactive” incidents can be surfaced and dealt with swiftly.
The foundation and usefulness of plain language summarization comes from the machine learning technique that is used to analyze the logs in the first place. If this is not done well, then summarization will not work at all since the underlying data will be flawed. However, if ML can find the right log lines, then making them available in simple summary form significantly increases the usefulness and audience of ML-based log analysis.
Together, unsupervised ML-based root cause analysis and AI-based plain language summarization provide a more complete approach to automated troubleshooting. They unburden development teams of the painful task of hunting through log files when an incident occurs, allowing them to work on far more interesting and important problems.