The modern world of application monitoring

Published: July 1st, 2020

Application performance monitoring is more important than ever, due to the rising complexity of software applications, architectures and the infrastructure that runs them.

When monitoring tools first were developed, the systems they were looking at were fairly simple — it was a monolithic application, running in a corporate-owned data center, on one network. The idea was to watch the telemetry — Why were response times so low? Why wasn’t the application available? — analyze signals that came in, and find the right person to resolve the issue. And, in a world where ‘instant gratification’ wasn’t yet a thing, users wouldn’t howl if it took some time to resolve the issue. Applications weren’t a driver of business then, they were seen as supporting business.

Today, with the explosion of microservices, containers, cloud infrastructures and devices on which to access applications, the old APM tools aren’t up to the complexity. And users certainly won’t tolerate slow responses or failing shopping carts.

This guide will look at two monitoring software providers who have created solutions coming at the problem from different perspectives, and what they see as necessary to effectively monitor today’s application performance.

Catchpoint CEO Mehdi Daoudi has flipped how the industry should look at monitoring on its head, from two angles. First, legacy APM tools have been obsessed with what’s going on internally — where the bad code is, or what part of the network is slow. Today, organizations need to understand the user experience, and then infer from that where the problem is. Digital experience monitoring, which is what Catchpoint offers, takes an outside-in view of application performance, where others look at internals to try to understand what the customer is experiencing.

Second, Daoudi believes the idea of buying monitoring solutions before understanding what problem the enterprise is trying to solve is backwards. He told SD Times that businesses should first identify the problems that exist in their systems, and then apply tooling to that.

Lightstep CTO and co-founder Daniel “Spoons” Spoonhower said the pain of finding and resolving problems in application performance hasn’t changed in … well, forever. Technologies have changed, organizations have changed, and monitoring tools need to change. He said the promise of APM is to use data to be able to explain what’s happening, so data collection becomes critical, It’s important for today’s monitoring tools to present engineers with context, and should emphasize tracing as a way to get that context and begin to understand the causal relationships and dependencies that are at the root of system problems and failures, he said.

Lightstep takes a decidedly inside-out view of monitoring, but enables integrations with other types of monitoring tools to round out the offering, including the user experience.

Software and systems complexities
Technology has become more complex, as noted above. But just as individual development teams are working on smaller pieces of the overall application puzzle, it’s the setup of those teams — working autonomously on their project, not necessarily concerned with the other parts — that makes it more difficult to get to the root cause of problems.

“If I’m just sitting by myself in my garage running hundreds of microservices, [monitoring] is probably not that much worse,” Spoonhower said. “I think the thing that happened is that microservices allowed these teams to work independently, so now you’re not just doing one release a week; your organization is doing 20 or 30 releases a day. … I think it’s more about the layers of distinct ownership where you as an individual services owner can only control your one service. That’s the only thing you can really roll back. But you’re dependent on all these other things and all of these other changes that are happening at the same time — changes in terms of users, changes in terms of the infrastructure, other services, third-party providers — and the gap where tools are really falling down has more to do with the organizational change than it has to do with the fact that we’re running in Docker containers.”

Daoudi agreed that fragmentation is a major impediment to understanding what’s going on in software performance. He used the image of six blindfolded people and an elephant to describe it. One person grabs its tail and thinks he has a rope. One holds a tusk and thinks it’s a spear of some kind. One touches his massive side and thinks it’s a wall. None of them, though, can grasp that what they’re touching are parts of something much larger. They can’t see that.

“When you think about it, let’s say you and I run this company and we have an e-commerce platform. We’re running it on Google Cloud. Our infrastructure is Google Cloud, we’ve built our services, the shopping cart, inventory, we hook up to UPS to ship T-shirts to people. You have to have an understanding of the environment this is working on, then you have the components of Google Cloud that are not available to you. But when you think about delivering that web page to a user in Portland so they can buy a T-shirt, look how much they have to go through. They have to go through T-Mobile in Seattle, through the internet, and we’re probably using NS-1 for our network, and on our sites we’re tracking some ads and doing A/B testing. The challenge with monitoring is, and why it’s still so hard to capture the full picture of the elephant, is that it’s freaking complex. I can’t make this up. It’s just very complex. There is no other thing.”

Observability is a good start
The goal of monitoring, Daoudi said, is to be able to have an understanding of what’s broken, why it’s broken, and where it’s broken. That’s where observability comes in. Catchpoint defines observability as “a measure of how well internal states of a system can be inferred from knowledge of its external output.” Catchpoint has created observability.com to address this, and, as Daoudi noted, observability is a way of doing things — not a tool.

Spoonhower described observability as giving organizations a way to quickly navigate from effect back to the cause. “Your users are complaining your service is slow, you just got paged because it’s down, you need to be able to quickly — as a developer, as an operator — move from the effect back to what the root cause is, even if there could be tens of thousands or even millions of different potential root causes,” he said. “You need to be able to do that in a handful of mouse clicks.”

And that is why the use of artificial intelligence and machine learning is growing in importance. Today, with the massive amounts of data being collected, it’s unreasonable to believe humans can digest it all and make correct decisions from all the noise coming in. “I think anything that has AI in it is going to be hyped to some extent,” Spoonhower said. “For me, what’s really critical here, and what I think has fundamentally changed in terms of the way APM tools work, is that we don’t expect humans to draw all of the conclusions. There are too many signals, there’s too much data, for a human to sit down and look at a dashboard and use their intuition to try to understand what’s happening in the software. We have to apply some kind of ML or AI or other algorithms to help sift through all the signals and find the ones that are relevant.”

Daoudi said observability is focused on collecting the telemetry and putting it in one place where it can be correlated. “AIOps is a fancy word for what you and I probably remember as event correlation back in the day, right? It’s a set of rules. You need to define the dependencies.. this app runs on this server, or this container … whatever. If you don’t understand, then all of this is just signals, more alerts, more people getting tired of responding at 2 o’clock in the morning to alarms, or not seeing the problem at all.”

Adding to the technical complexity is the fact that teams are changing and being reorganized, and that services aren’t static. Spoonhower said, “Establishing and maintaining service ownership, and understanding what that is, I think, is sort of a double-edged problem, both from a leadership point of view where you’re trying to understand, wait, I know this service here is part of the problem but who do I talk to about that? On the other side, from the teams, what I’ve seen is teams often will get a few services dumped on them that were left over from a reorg or somebody left, and that’s a really stressful position to be in because at some level, they are in control but they don’t have the knowledge to do that, so that’s the other place when an observability tool can come into play, as part of both holding teams accountable and providing them with information that doesn’t necessary need to live on through tribal knowledge. There should be a way when I get paged to quickly get a view of how that service is behaving and how it’s interacting with other services, even if I’m not an expert in the code.”

Collecting data, and putting it in one place to be able to ‘connect the dots’ and see the bigger picture, is what modem monitoring tools are bringing to the table.

“The biggest problem I see with monitoring is not too many alerts; it’s actually missing the whole thing,” Daoudi said. By looking at individual metrics without having a global view of the applications and system, you might detect a tremor someplace but miss a larger earthquake. Or you see a plane engine starting to fail and work to resolve that problem, but miss the fact that external components the engine is dependent upon also failed and resulted in a crash.

Tools are only part of the solution
Both Spoonhower and Daoudi were quick to point out that tools are important for monitoring, but they are just tools.

At the heart of monitoring is the need for organizations to quickly understand why releases are failing or why performance has gone down. Spoonhower said: “I think the pain is that the costs of achieving that are quite high, either in terms of the raw dollars if you’re paying a vendor, if you’re paying for infrastructure to run your own solution; or just the amount of time that it takes an engineer to… they did a deployment, and now they’re going to sit and stare at a dashboard for 20 or 30 minutes. That’s a lot of time when they could be doing something else.”

He lamented the fact that the legacy APM approach is tools-centric. “Even the names, like logs, is not a solution to a problem; it’s a tool in your tool belt,” Spoonhower said. “Metrics … it’s a kind of data, and I think the way we think of it and I think the right way to think of it is, what problems are people trying to solve? They’re trying to understand what the root cause of this outage is, so they can roll it back and go back to sleep. And so, by focusing a little bit more on the workflows, we’ll figure out as a solution what the right data to help you solve the problem is. It shouldn’t be up to you to say, ‘Ahh, this is a metrics problem; I should be using my metrics tool. Or this is a logging problem; use the log tool.’ No. It’s a deployment problem, it’s an incident problem, it’s an outage problem.”

Catchpoint’s Daoudi said people have the unreasonable expectation that they can simply license one tool that can cover every aspect of monitoring. “There is no single tool that does the whole thing,” he said. “The biggest mistake people make is they get the tool first and then they ask questions later. You should ask, ‘What is it that I want my monitoring tools to help me answer?’ and then you start implementing a monitoring tool. What is the question, then you collect data to answer the question. You don’t collect data to ask more questions. It’s an infinite loop.

“I tell customers, before you go and invest gazillions of dollars in a very expensive set of tools, why don’t you just start by understanding what your customers are feeling right now,” Daoudi continued. “That’s where we play a big role, in the sense of ‘let me tell you first how big the problem is. Oh, you have 27% availability. That’s a big problem.’ Then you can go invest in the tools that can show you why you have 27% availability. Buying tools for the sake of buying tools doesn’t help.”

All about the customer
The technology world is playing a bigger role in driving business outcomes, so the systems that are created and monitored must place the customers’ interests above all else. For retailers, for example, customers more often are not getting their first impression of your brand by walking into a store — especially true today with the novel coronavirus pandemic we’re under. They’re getting their first impressions from your website, or your mobile app.

“A lot of people are talking about customer centricity. IT teams becoming more customer centric,” Daoudi explained. “Observability. SRE. But let’s take a step back. Why are we doing all of this? It’s to delight our customers, our employees, to not waste their time. If you want to go and buy something on Amazon, the reason you keep going back to Amazon is that they don’t waste our time. It works, their website is fast, you click add, you click checkout, and off you go.

“And that’s why it’s important to monitor from where your customers are, always,” he continued. “Then you can infer what’s broken from a customer’s perspective. And then, you tie it to all the internals. For example, if I had a pain in my arm right now, and went to a microneurosurgeon, he’d ask, ‘Why are you coming to me? I don’t know what you have. You should go to your regular doctor. Are you ready to have a surgery on your arm? I can take off your finger, you’ll feel better.’ But first, I have a pain, take an X-ray, see what’s wrong, and find the right doctor to take care of it.”

Article Tags

APM, Catchpoint, LightStep, observability, Understandability

About David Rubinstein

David Rubinstein is editor-in-chief of SD Times.

View all posts by David Rubinstein

Cookie	Duration	Description
cf_use_ob	past	Cloudflare sets this cookie to improve page load times and to disallow any security restrictions based on the visitor's IP address.
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
CookieLawInfoConsent	1 year	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
JSESSIONID	session	The JSESSIONID cookie is used by New Relic to store a session identifier so that New Relic can monitor session counts for an application.
PHPSESSID	session	This cookie is native to PHP applications. The cookie is used to store and identify a users' unique session ID for the purpose of managing user session on the website. The cookie is a session cookies and is deleted when all the browser windows are closed.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
__atuvc	1 year 1 month	AddThis sets this cookie to ensure that the updated count is seen when one shares a page and returns to it, before the share count cache is updated.
__atuvs	30 minutes	AddThis sets this cookie to ensure that the updated count is seen when one shares a page and returns to it, before the share count cache is updated.
__cf_bm	30 minutes	This cookie, set by Cloudflare, is used to support Cloudflare Bot Management.

Cookie	Duration	Description
__gads	1 year 24 days	The __gads cookie, set by Google, is stored under DoubleClick domain and tracks the number of times users see an advert, measures the success of the campaign and calculates its revenue. This cookie can only be read from the domain they are set on and will not track any data while browsing through other sites.
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_S6PB8V57DG	2 years	This cookie is installed by Google Analytics.
_gat_gtag_UA_846073_1	1 minute	Set by Google to distinguish users.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
_jsuid	1 year	This cookie contains random number which is generated when a visitor visits the website for the first time. This cookie is used to identify the new visitors to the website.
at-rand	never	AddThis sets this cookie to track page visits, sources of traffic and share counts.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
iutk	5 months 27 days	This cookie is used by Issuu analytic system to gather information regarding visitor activity on Issuu products.
uvc	1 year 1 month	Set by addthis.com to determine the usage of addthis.com service.
vuid	2 years	Vimeo installs this cookie to collect tracking information by setting a unique ID to embed videos to the website.
WMF-Last-Access	1 month 14 hours 26 minutes	This cookie is used to calculate unique devices accessing the website.

Cookie	Duration	Description
__Host-GAPS	2 years	This cookie allows the website to identify a user and provide enhanced functionality and personalisation.
_pxhd	session	Used by Zoominfo to enhance customer data.
IDE	1 year 24 days	Google DoubleClick IDE cookies are used to store information about how the user uses the website to present them with relevant ads and according to the user profile.
loc	1 year 1 month	AddThis sets this geolocation cookie to help understand the location of users who share the information.
mc	1 year 1 month	Quantserve sets the mc cookie to anonymously track user behaviour on the website.
test_cookie	15 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
__gpi	1 year 24 days	No description
__Secure-YEC	1 year 1 month	No description
_heatmaps_g2g_100754890	10 minutes	No description
_techvalidate_session	session	No description
cf_7166_id	20 years	No description
cf_7166_person_last_update	session	No description
f5avraaaaaaaaaaaaaaaa_session_	session	No description available.
GoogleAdServingTest	session	No description
Gyazo_cfwoker	7 years 2 months 17 days 7 hours	No description
incap_ses_451_2783402	session	No description
incap_ses_769_2783402	session	No description
loglevel	never	No description available.
m	2 years	No description available.
nlbi_2783402	session	No description
prism_252377639	1 month	No description
TS011605d9	session	No description
ustream-guest	session	No description available.
visid_incap_2783402	1 year	No description
xtc	1 year 1 month	No description

AI

AI and Software Development

Observability

Guide to Observability

CI/CD

A guide to CI/CD

Cloud Native

Cloud Native Content

Data

A Guide to Data

Test

Security Testing

Mobile

Mobile Testing

API

Sponsored by Parasoft

Performance

Load & Performance Testing

DevSecOps

A Guide to DevSecOps

Enterprise Security

A Guide to Security

Supply Chain Security

Supply Chain Security

Dev Manager

Dev Managers Content

Agile

A Guide To Agile

Value Stream

A Guide To Value Stream

Productivity

A Guide To Productivity

DevOps

DevOps Content

API

Gravitee.io

AI

AI and Software Development

Value Stream Management

A Guide To Value Stream

The modern world of application monitoring

Article Tags

Subscribe to SDTimes

About David Rubinstein

Related Articles

Catchpoint adds OpenTelemetry-based real-user monitoring for mobile devices

Grafana 12 is now available with new observability as code features, Dynamic Dashboards, and more

Instabug launches new observability features to connect business outcomes with app performance, user experience

O11y like a B.O.S.S – The modern observability stack