If you do a Google search for the phrase “observability tools,” it’ll return about 3.3 million results. As observability is the hot thing right now, every vendor is trying to get aboard the observability train. But observability is not as simple as buying a tool; it’s more of a process change — a way of collecting data and using that data to provide better customer experiences.
“Right now there’s a lot of buzz around observability, observability tools, but it’s not just the tool,” said Mehdi Daoudi, CEO of digital experience monitoring platform Catchpoint. “That’s the key message. It’s really about how can we combine all of these data streams to try to paint a picture.”
If you go back to where observability came from — like many other processes, it originated at Google — its original definition was about measuring “how well internal states of a system can be inferred from knowledge of its external outputs,” said Daoudi.
Daoudi shared an example of observability in action where one of Catchpoint’s customers was seeing a trend where customers complained a lot on Mondays and Tuesdays, but not on Sundays. The server load was the same, but the services were slower. Through observability, the company was able to determine that backup processes that only run on weekdays were the culprit and were impacting performance.
“Observability is about triangulation,” said Daoudi. “It’s about being able to answer a very, very complex question, very, very quickly. There is a problem – where is the problem? The reason why this is important is because things have gotten a lot more complex. You’re not dealing with one server anymore, you’re dealing with hundreds of thousands of servers, cloud, CDNs, a lot of moving parts where each one of them can break. And so not having observability into the state of those systems, that makes your triangulation efforts a lot harder, and therefore longer, and therefore has an impact on the end users and your brand and revenue, etc.”
This is why Daoudi firmly believes that observability isn’t just a set of tools. He sees it as a way of working as a company, being aligned, and being able to have a common way to collect data that is needed to answer questions.
The industry has standardardized on OpenTelemetry as the common way of collecting telemetry data. OpenTelemetry is an open source tool used for gathering metrics, logs, and traces — often referred to as the three pillars of observability.
The three pillars are often referenced in the industry when talking about observability, but Ben Sigelman, CEO and co-founder of monitoring company Lightstep, believes that observability needs to go beyond metrics, logs, and traces. He compared the three pillars to Steve Jobs announcing the first iPhone back in 2007. Jobs started off the presentation by announcing a widescreen iPod with touch controls, a “revolutionary” mobile phone, and a breakthrough internet communications device, making it seem as though they were three separate devices.
“These are not three separate devices,” Jobs went on to clarify. “This is one device, and we are calling it iPhone.” Sigelman said the same is true of telemetry. Metrics, logs, and traces shouldn’t be known as the three pillars because you get all three at once and it’s one thing: telemetry.
Michael Fisher, group product manager at AIOps company OpsRamp, broke observability data down further into two signals: symptomatic signals and causal signals. Symptomatic signals are what an end user is experiencing, such as page latency or a 500 Internal Server Error on a website. Causal signals are what cause those symptomatic signals. Examples include CPU, network, and storage metrics, and “things that may be an issue, but you’re not sure because they’re not being tied to any symptom that an end user might be facing.”
Monitoring tools tend to focus mostly on the causal signals, Fisher explained, but he recommends starting with symptomatic signals and working towards causal signals, with the end state being a unit of the two.
“When something is going wrong [the developer] can search that log, they can search that trace and they can tie it back to the piece of code that’s having an issue,” said Fisher. “The operations team, they may just see the causal symptoms, or maybe there is no causal symptom. Maybe the application is running fine but users are still complaining. Tying those two together is kind of a key part of this shift towards observability. And that’s why I talk about observability as a development principle because I think starting with the symptomatic signals with the people who actually know is a huge paradigm shift for me because I think some of the people you talk to or ITOps teams you talk to is that monitoring is their wheelhouse, whereas many modern shops, OpsRamp included, much more monitoring actually happens on the development team side now.”
Providing good end user experience is the ultimate goal of observability. With monitoring, you might only be focusing on those causal signals, which might mean you miss out on important symptomatic signals where the end user is experiencing some sort of service degradation or trouble accessing your application.
“When I talk about using observability to drive end-user outcomes, I’m really talking about focusing on observing the things that would impact end users and taking action on them before they do because traditionally this focus on monitoring has been at a much lower level, layer 3, I care about my network, I care about my switches,” said Fisher. “I’ve talked to customers where that’s all they care about, which is fine but you start to realize those things really matter less once you move up the stack and you have a webpage or you have a SaaS application. The end user will never tell you that their CPU is high, but they will tell you that your webpage is taking 10 seconds to load and they couldn’t use your tool. If an end user can’t use your tool who gives a damn about anything else?”
It’s important that observability not just stay in the hands of developers. In fact, Bernd Greifeneder, CTO of monitoring company Dynatrace, believes that if developers just do observability on their own, then it’s nothing more than a debugging tool. “The reason then for DevOps and SREs needs to come into play is to help with a more consistent approach because these days multiple teams create different microservices that are interconnected and have to interplay. This is sort of a complexity challenge and also a scale challenge that needs to be solved. This is where an SRE and Ops team have to help with standing up proper observability tooling or monitoring if you will, but making sure that all the observability data comes together in a holistic view,” he said.
SRE and Ops teams can help make sure that the observability data that the developers are collecting has the proper analytics on top of it. This will enable them to gain insights from observability data and use those insights to drive automation and further investments into observability. “IT automation means higher availability, it means automatic remediation when services fail, and ultimately means better experiences for customers,” Greifeneder said.
When looking into the tools to put on top of your observability data to do those analytics, Tyler McMullen, CTO of edge cloud platform Fastly recommends constantly experimenting to see what works for your team. He explained that often these observability vendors charge a lot of money, and teams might fall into the trap of buying a solution, putting too much observability data into it, and being shocked when they’re charged a lot of money to do so.
“Are the pieces of information that we’re plugging into our observability, are they actually working for us? If they’re not working for us, we definitely shouldn’t have them in there,” said McMullen. “On the other hand, you only really find out whether or not something is useful after it becomes useful. Figuring out what you need in advance is I think, one of the biggest problems with this thing. You don’t want to put too much in. On the other hand, if you put too little in you don’t know whether or not it is useful.” As a result, your team will need to do lots of experimenting to discover the right process and the right balance.
Daoudi added that it’s also important to answer the question of why you’re doing observability before looking into products. “Like every new thing that when a company goes and decides to implement something, you start with why? Why do you need to implement observability? Why do you need to implement SREs? Why do you need to implement an HR system? If you don’t define the ‘why’ then what typically happens is first it’s a huge distraction to your company and also a lot of resources being wasted and then the end result might not be what you’re looking for,” he said.
And of course, it’s important to remember that observability is more of a process, so looking for a tool that will do observability for you won’t work. The tooling is really about analytics on the observability data you’ve gathered.
“I really don’t think observability is a tool,” said Daoudi. “If there was such a thing as go to Best Buy, aisle 5, or Target, or Walmart and buy an observability tool for like $5 million, it ain’t going to work because if your company is not functioning and aligned, and your processes and everything isn’t aligned around what observability is supposed to do, then you’re just going to have shelfware in your company.”