Every company is going digital today and user experience is everything. However, deployment of dynamic, hybrid cloud infrastructure and the explosion of connected devices creates a lot of challenges in monitoring performance of digital services. Therefore, organizations are still struggling to build end-to-end pipelines that help ensure their applications and the business remain available, reliable and resilient.

“Our customers are somewhere in the journey between on prem and cloud so they have a lot of distributed, multi-cloud applications. For example, if you have a retail application, the systems of engagement could be running in the cloud while the system of record, where the actual data is stored, could be running on prem deep within the data center,” said Sudip Datta, general manager and head of AIOps at Broadcom. “When you’re dealing with such complex distributed applications, managing and monitoring those applications becomes problematic. So, the more automation you have, the better.”

What Is AIOps?

AIOps operationalizes AI in IT. In the era of digital transformation, it is an important link in the overall BizOps chain because it connects business outcomes to the software delivery chain (governed by DevOps).

“Companies have to stay on top of their digital services to make sure that 100% of their customers are satisfied 100% of the time. At the same time, they have to deal with this complexity of cloud and on prem, and with a continuously evolving infrastructure and network. Especially when you consider ephemeral assets like containers, it’s not possible to keep pace with a rule-based approach,” said Datta. “That’s why we have AIOps.”

Essentially, AIOps helps ensure that companies can automatically find and fix application issues before customers notice them or at least shorten meantime to resolution (MTTR) if a noticeable problem occurs.

Observability is Important

Achieving five-nines of service level requires observability, which is the ability to observe outputs, and gain insights from them. This capability is extremely crucial for developers who are working with cloud-native, containerized architectures. Simply monitoring the environment to keep the lights on isn’t enough because the intelligence is limited: it only says whether a component, network or server is up or down.

“It’s not about collecting data, it’s about connecting data to glean insights out of it,” said Datta. “When you are dealing with a lot of components in a distributed, multi-cloud world, you need to connect topology data, metric data, unstructured data logs and traces to glean insights about what is really happening. With AIOps and the observability it provides, you can ideally predict problems before they happen, and in case they do, determine the root cause of the problem and automate the remediation.”

Why SREs are Critical

Site Reliability Engineers (SREs) are administrators with full-stack competencies who keep digital services running at peak performance. Today, most digitally progressive enterprises employ SREs for their mission-critical services.

“If you’re a bank or a retailer offering a bunch of consumer-facing services, who is responsible for the upkeep of the services?” said Datta. “You need a very specialized skillset with deep understanding of the architecture, because slow is the new ‘down. They have to be full stack engineers. And they have to be equipped with the right tool that can track Service Level Objectives (SLOs) and the underlying Service Level Indicators (SLIs).”

AIOps Speeds Issue Resolution

AIOps helps reduce the noise associated with issue resolution. As applications become more distributed and complex, the number of tools used to manage applications, networks and infrastructure grows so it may not be clear whether a service outage was caused by the network or the application. Datta said an average enterprise’s tech stack generates 5,000 to 10,000 alarms per day or more. AIOps uses natural language processing (NLP) and clustering technologies to reduce alarm noise by as much as 90%, giving developers and IT more time to deliver actual value.

“Customers joke about having a meantime to innocence – the time it takes to prove that it’s not my problem, and those responsibility debates are costing them four to five hours,” said Datta. “With AI and ML, we can determine the root cause or the probable root cause of a problem and fix it faster. The whole thing is about accelerating remediation and predicting problems before they happen.”

Developers and IT should understand which technology assets make up a business service and which business services have the highest priority so they can focus their efforts accordingly. In addition, it’s important to know the relationship of individual technology assets, such as what application connects to which database and what database connects to which network.

“It’s all about the data, and the ability to deal with the volume, velocity, variety and veracity,” said Datta. “What’s also critical for AIOps is making sure your solution is open so it can connect with the peer disciplines such as DevOps and BizOps. AIOps isn’t a nice to have, it’s a must have, especially in the modern digital era.”

Learn more at Broadcom’s Sept. 28 AIOps event.

Content provided by Broadcom.