Application performance monitoring is more important than ever, due to the rising complexity of software applications, architectures and the infrastructure that runs them.
When monitoring tools first were developed, the systems they were looking at were fairly simple — it was a monolithic application, running in a corporate-owned data center, on one network. The idea was to watch the telemetry — Why were response times so low? Why wasn’t the application available? — analyze signals that came in, and find the right person to resolve the issue. And, in a world where ‘instant gratification’ wasn’t yet a thing, users wouldn’t howl if it took some time to resolve the issue. Applications weren’t a driver of business then, they were seen as supporting business.
Today, with the explosion of microservices, containers, cloud infrastructures and devices on which to access applications, the old APM tools aren’t up to the complexity. And users certainly won’t tolerate slow responses or failing shopping carts.
Observability: It’s all about the data
APM: What it means in today’s complex software world
This guide will look at two monitoring software providers who have created solutions coming at the problem from different perspectives, and what they see as necessary to effectively monitor today’s application performance.
Catchpoint CEO Mehdi Daoudi has flipped how the industry should look at monitoring on its head, from two angles. First, legacy APM tools have been obsessed with what’s going on internally — where the bad code is, or what part of the network is slow. Today, organizations need to understand the user experience, and then infer from that where the problem is. Digital experience monitoring, which is what Catchpoint offers, takes an outside-in view of application performance, where others look at internals to try to understand what the customer is experiencing.
Second, Daoudi believes the idea of buying monitoring solutions before understanding what problem the enterprise is trying to solve is backwards. He told SD Times that businesses should first identify the problems that exist in their systems, and then apply tooling to that.
Lightstep CTO and co-founder Daniel “Spoons” Spoonhower said the pain of finding and resolving problems in application performance hasn’t changed in … well, forever. Technologies have changed, organizations have changed, and monitoring tools need to change. He said the promise of APM is to use data to be able to explain what’s happening, so data collection becomes critical, It’s important for today’s monitoring tools to present engineers with context, and should emphasize tracing as a way to get that context and begin to understand the causal relationships and dependencies that are at the root of system problems and failures, he said.
Lightstep takes a decidedly inside-out view of monitoring, but enables integrations with other types of monitoring tools to round out the offering, including the user experience.
Software and systems complexities
Technology has become more complex, as noted above. But just as individual development teams are working on smaller pieces of the overall application puzzle, it’s the setup of those teams — working autonomously on their project, not necessarily concerned with the other parts — that makes it more difficult to get to the root cause of problems.
Daoudi agreed that fragmentation is a major impediment to understanding what’s going on in software performance. He used the image of six blindfolded people and an elephant to describe it. One person grabs its tail and thinks he has a rope. One holds a tusk and thinks it’s a spear of some kind. One touches his massive side and thinks it’s a wall. None of them, though, can grasp that what they’re touching are parts of something much larger. They can’t see that.
“When you think about it, let’s say you and I run this company and we have an e-commerce platform. We’re running it on Google Cloud. Our infrastructure is Google Cloud, we’ve built our services, the shopping cart, inventory, we hook up to UPS to ship T-shirts to people. You have to have an understanding of the environment this is working on, then you have the components of Google Cloud that are not available to you. But when you think about delivering that web page to a user in Portland so they can buy a T-shirt, look how much they have to go through. They have to go through T-Mobile in Seattle, through the internet, and we’re probably using NS-1 for our network, and on our sites we’re tracking some ads and doing A/B testing. The challenge with monitoring is, and why it’s still so hard to capture the full picture of the elephant, is that it’s freaking complex. I can’t make this up. It’s just very complex. There is no other thing.”
Observability is a good start
The goal of monitoring, Daoudi said, is to be able to have an understanding of what’s broken, why it’s broken, and where it’s broken. That’s where observability comes in. Catchpoint defines observability as “a measure of how well internal states of a system can be inferred from knowledge of its external output.” Catchpoint has created observability.com to address this, and, as Daoudi noted, observability is a way of doing things — not a tool.
Spoonhower described observability as giving organizations a way to quickly navigate from effect back to the cause. “Your users are complaining your service is slow, you just got paged because it’s down, you need to be able to quickly — as a developer, as an operator — move from the effect back to what the root cause is, even if there could be tens of thousands or even millions of different potential root causes,” he said. “You need to be able to do that in a handful of mouse clicks.”
And that is why the use of artificial intelligence and machine learning is growing in importance. Today, with the massive amounts of data being collected, it’s unreasonable to believe humans can digest it all and make correct decisions from all the noise coming in. “I think anything that has AI in it is going to be hyped to some extent,” Spoonhower said. “For me, what’s really critical here, and what I think has fundamentally changed in terms of the way APM tools work, is that we don’t expect humans to draw all of the conclusions. There are too many signals, there’s too much data, for a human to sit down and look at a dashboard and use their intuition to try to understand what’s happening in the software. We have to apply some kind of ML or AI or other algorithms to help sift through all the signals and find the ones that are relevant.”
Daoudi said observability is focused on collecting the telemetry and putting it in one place where it can be correlated. “AIOps is a fancy word for what you and I probably remember as event correlation back in the day, right? It’s a set of rules. You need to define the dependencies.. this app runs on this server, or this container … whatever. If you don’t understand, then all of this is just signals, more alerts, more people getting tired of responding at 2 o’clock in the morning to alarms, or not seeing the problem at all.”
Adding to the technical complexity is the fact that teams are changing and being reorganized, and that services aren’t static. Spoonhower said, “Establishing and maintaining service ownership, and understanding what that is, I think, is sort of a double-edged problem, both from a leadership point of view where you’re trying to understand, wait, I know this service here is part of the problem but who do I talk to about that? On the other side, from the teams, what I’ve seen is teams often will get a few services dumped on them that were left over from a reorg or somebody left, and that’s a really stressful position to be in because at some level, they are in control but they don’t have the knowledge to do that, so that’s the other place when an observability tool can come into play, as part of both holding teams accountable and providing them with information that doesn’t necessary need to live on through tribal knowledge. There should be a way when I get paged to quickly get a view of how that service is behaving and how it’s interacting with other services, even if I’m not an expert in the code.”
Collecting data, and putting it in one place to be able to ‘connect the dots’ and see the bigger picture, is what modem monitoring tools are bringing to the table.
“The biggest problem I see with monitoring is not too many alerts; it’s actually missing the whole thing,” Daoudi said. By looking at individual metrics without having a global view of the applications and system, you might detect a tremor someplace but miss a larger earthquake. Or you see a plane engine starting to fail and work to resolve that problem, but miss the fact that external components the engine is dependent upon also failed and resulted in a crash.
Tools are only part of the solution
Both Spoonhower and Daoudi were quick to point out that tools are important for monitoring, but they are just tools.
At the heart of monitoring is the need for organizations to quickly understand why releases are failing or why performance has gone down. Spoonhower said: “I think the pain is that the costs of achieving that are quite high, either in terms of the raw dollars if you’re paying a vendor, if you’re paying for infrastructure to run your own solution; or just the amount of time that it takes an engineer to… they did a deployment, and now they’re going to sit and stare at a dashboard for 20 or 30 minutes. That’s a lot of time when they could be doing something else.”
He lamented the fact that the legacy APM approach is tools-centric. “Even the names, like logs, is not a solution to a problem; it’s a tool in your tool belt,” Spoonhower said. “Metrics … it’s a kind of data, and I think the way we think of it and I think the right way to think of it is, what problems are people trying to solve? They’re trying to understand what the root cause of this outage is, so they can roll it back and go back to sleep. And so, by focusing a little bit more on the workflows, we’ll figure out as a solution what the right data to help you solve the problem is. It shouldn’t be up to you to say, ‘Ahh, this is a metrics problem; I should be using my metrics tool. Or this is a logging problem; use the log tool.’ No. It’s a deployment problem, it’s an incident problem, it’s an outage problem.”
Catchpoint’s Daoudi said people have the unreasonable expectation that they can simply license one tool that can cover every aspect of monitoring. “There is no single tool that does the whole thing,” he said. “The biggest mistake people make is they get the tool first and then they ask questions later. You should ask, ‘What is it that I want my monitoring tools to help me answer?’ and then you start implementing a monitoring tool. What is the question, then you collect data to answer the question. You don’t collect data to ask more questions. It’s an infinite loop.
“I tell customers, before you go and invest gazillions of dollars in a very expensive set of tools, why don’t you just start by understanding what your customers are feeling right now,” Daoudi continued. “That’s where we play a big role, in the sense of ‘let me tell you first how big the problem is. Oh, you have 27% availability. That’s a big problem.’ Then you can go invest in the tools that can show you why you have 27% availability. Buying tools for the sake of buying tools doesn’t help.”
All about the customer
The technology world is playing a bigger role in driving business outcomes, so the systems that are created and monitored must place the customers’ interests above all else. For retailers, for example, customers more often are not getting their first impression of your brand by walking into a store — especially true today with the novel coronavirus pandemic we’re under. They’re getting their first impressions from your website, or your mobile app.
“A lot of people are talking about customer centricity. IT teams becoming more customer centric,” Daoudi explained. “Observability. SRE. But let’s take a step back. Why are we doing all of this? It’s to delight our customers, our employees, to not waste their time. If you want to go and buy something on Amazon, the reason you keep going back to Amazon is that they don’t waste our time. It works, their website is fast, you click add, you click checkout, and off you go.
“And that’s why it’s important to monitor from where your customers are, always,” he continued. “Then you can infer what’s broken from a customer’s perspective. And then, you tie it to all the internals. For example, if I had a pain in my arm right now, and went to a microneurosurgeon, he’d ask, ‘Why are you coming to me? I don’t know what you have. You should go to your regular doctor. Are you ready to have a surgery on your arm? I can take off your finger, you’ll feel better.’ But first, I have a pain, take an X-ray, see what’s wrong, and find the right doctor to take care of it.”