Software continues to grow as the driver of today’s global economy, and how a company’s applications perform is critical to retaining customer loyalty and business. People now demand instant gratification and will not tolerate latency — not even a little bit.
As a result, application performance monitoring is perhaps more important than ever to companies looking to remain competitive in this digital economy. But today’s APM doesn’t look much like the APM of a decade ago. Performance monitoring then was more about the application itself, and very specific to the data tied to that application. Back then, applications ran in datacenters on-premises, and written as monoliths, largely in Java, tied to a single database. With that simple n-tier architecture, organizations were able to easily collect all the data they needed, which was then displayed in Networks Operations Centers to systems administrators. The hard work came from command-line launching of monitoring tools — requiring systems administration experts — sifting through log files to see what was real and what was a false alarm, and from reaching the right people to remediate the problem.
In today’s world, doing APM efficiently is a much greater challenge. Applications are cobbled together, not written in monoliths. Some of those components might be running on-premises while others are likely to be cloud services, written as microservices and running in containers. Data is coming from the application, from containers, Kubernetes, service meshes, mobile and edge devices, APIs and more. The complexities of modern software architectures broaden the definition of what it means to do performance monitoring.
“APM solutions have adapted and adjusted greatly over the last 10 years. You wouldn’t recognize them at all from what they were when this market was first defined,” said Charley Rich, a research director at Gartner and lead author of the APM Magic Quadrant, as well as the lead author on Gartner’s AIOPs market guide.
So, although APM is a mature practice, organizations are having to look beyond the application — to multiple clouds and data sources, to the network, to the IT infrastructure — to get the big picture of what’s going on with their applications. And we’re hearing talk of automation, machine learning and being proactive about problem remediation, rather than being reactive.
“APM, a few years ago, started expanding broadly both downstream and upstream to incorporate infrastructure monitoring into the products,” Rich said. “Many times, there’s a problem on a server, or a VM, or a container, and that’s the root cause of the problem. If you don’t have that infrastructure data, you can only infer.”
Rekha Singhal, the Software-Computing Systems Research Area head at Tata Consultancy Services, sees two major monitoring challenges that modern software architectures present.
First, she said, is multi-layered distributed deployment using Big Data technologies, such as Kafka, Hadoop and HDFS. The second is that modern software, also called Software 2.0, is a mix of traditional task-driven programs and data-driven machine learning models. “The distributed deployment brings additional performance monitoring challenges due to cascaded failures, staggered processes and global clock synchronization for co-relating events across the cluster, she explained. ”Further, a Software 2.0 architecture may need a tight integrated pipeline from development to production to ensure good accuracy for data-driven models. Performance definition for Software 2.0 architectures are extended to both system performance and model performance.”
Moreover, she added, modern applications are largely deployed on heterogeneous architectures, including CPU, GPU, FPGA and ASICs. “We still do not have mechanisms to monitor performance of these hardware accelerators and the applications executing on them,” she noted.
The new culture of APM
Despite these mechanisms for total monitoring not being available, companies today need to compete to be more responsive to customer needs. And to do so, the have to be proactive. Joe Butson, co-founder of consulting company Big Deal Digital, said, “We’re moving to a culture of responding ‘our hair’s on fire,’ to being proactive,” he said. “We have a lot more data … and we have to get that information into some sort of a visualization tool. And, we have to prioritize what we’re watching. What this has done is change the culture of the people looking at this information and trying to monitor and trying to move from a reactive to proactive mode.”
In earlier days of APM, when things in application slowed or broke, people would get paged. Butson said, “It’s fine if it happens from 9 to 5, you have lots of people in the office, but then, some poor person’s got the pager that night, and that just didn’t work because what it meant in the MTTR — mean time to recovery — depending upon when the event occurred, it took a long time to recover. In a very digitized world, if you’re down, it makes it into the press, so you have a lot of risk, from an organizational perspective, and there’s reputation risk.
High-performing companies are looking at data and anticipating what could happen. And that’s a really big change, Butson said. “Organizations that do this well are winning in the marketplace.”
Who’s job is it, anyway?
With all of this data being generated and collected, more people in more parts of the enterprise need access to this information. “I think the big thing is, 10-15 years ago, there were a lot of app support teams doing monitoring, I&O teams, who were very relegated to this task,” said Stephen Elliot, program vice president for I&O at research firm IDC. “You know, ‘identify the problem, go solve it.’ Then the war rooms were created. Now, with agile and DevOps, we have [site reliability engineers], we have DevOps engineers, there are a lot broader set of people that might own the responsibility, or have to be part of the broader process discussion.”
And that’s a cultural change. “In the NOCs, we would have had operations engineers and sys admins looking at things,” Butson said. “We’re moving across the silos and have the development people and their managers looking at refined views, because they can’t consume it all.”
It’s up to each segment of the organization looking at data to prioritize what they’re looking at. “The dev world comes at it a little differently than the operations people,” Butson continued. “Operations people are looking for stability. The development people really care about speed. And now that you’re bringing security people into it, they look at their own things in their own way. When you’re talking about operations and engineering and the business people getting together, that’s not a natural thing, but it’s far better to have the end-to-end shared vision than to have silos. You want to have a shared understanding. You want people working together in a cross-functional way.”
Enterprises are thinking through the question of who owns responsibility for performance and availability of a service. According to IDC’s Elliot, there is a modern approach to performance and availability. He said at modern companies, the thinking is, “ ‘we’ve got a DevOps team, and when they write the service, they own the service, they have full end-to-end responsibilities, including security, performance and availability.’ That’s a modern, advanced way to think.”
In the vast majority of companies, ownership for performance and availability lies with particular groups having different responsibilities. This can be based on the enterprise’s organizational structure, and the skills and maturity level that each team has. For instance, an infrastructure and operations group might own performance tuning. Elliot said, “We’ve talked to clients who have a cloud COE that actually have responsibility for that particular cloud. While they may be using utilities from a cloud provider, like AWS Cloud Watch or Cloud Trail, they also have the idea that they have to not only trust their data but then they have to validate it. They might have an additional observability tool to help validate the performance they’re expecting from that public cloud provider.”
In those modern organizations, site reliability engineers (SREs) often have that responsibility. But again, Elliot here stressed skill sets. “When we talk to customers about an SRE, it’s really dependent on, where did these folks come from?” he said. “Where they reallocated internally? Are they a combination of skills from ops and dev and business? Typically, these folks reside more along the lines of IT operations teams, and generally they have operating history with performance management, change management, monitoring. They also start thinking about are these the right tasks for these folks to own? Do they have the skills to execute it properly?”
Organizations also have to balance that out with the notion of applying development practices to traditional I&O principles, and bringing a software engineering mindset to systems admin disciplines. And, according to Elliot, “It’s a hard transition.”
Compound all that with the growing complexity of applications, running the cloud as containerized microservices, managed by Kubernetes using, say, an Istio service mesh in a multicloud environment.
TCS’ Singhal explained that containers are not permanent, and microservices deployments have shorter execution times. Therefore, any instrumentation in these types of deployment could affect the guarantee of application performance, she said. As for functions as a service, which are stateless, application states need to be maintained explicitly for performance analysis, she continued.
It is these changes in software architectures and infrastructure that are forcing organizations to rethink how they approach performance monitoring from a culture standpoint and from a tooling standpoint.
APM vendors are adding capability to do infrastructure monitoring, which encompasses server monitoring, some amount of log file analyst, and some amount of network performance monitoring, Gartner’s Rich said.Others are adding or have added capabilities to map out business processes and relate the milestones in a business process to what the APM solution is monitoring. “All the data’s there,” Rich said. “It’s in the payloads, it’s accessible through APIs.” He said this ability to bring out visualize data can show you, for instance, why Boston users are abandoning their carts 20% greater than they are in New York over the last three days, and come up with something in the application that explains that.