Troubleshooting microservices: Challenges and best practices

Published: January 3rd, 2022

When people hear ‘microservices’ they often think about Kubernetes, which is a declarative container orchestrator. Because of its declarative nature, Kubernetes treats microservices as entities, which presents some challenges when it comes to troubleshooting. Let’s take a look at why troubleshooting microservices in a Kubernetes environment can be challenging, and some best practices for getting it right.

To understand why troubleshooting microservices can be challenging, let’s look at an example. If you have an application in Kubernetes, you can deploy it as a pod and leverage Kubernetes to scale it. The entity is a pod that you can monitor. With microservices, you shouldn’t monitor pods; instead, you should monitor services. So you can have a monolithic workload (a single container deployed as a pod) and monitor it, but if you have a service made up of several different pods, you need to understand the interactions between those pods to understand how the service is behaving. If you don’t do that, what you think is an event might not really be an event (i.e. might not be material to the functioning of the service).

When it comes to monitoring microservices, you need to monitor at the service level, not the pod level. If you try to monitor at the pod level, you’ll be fighting with the orchestrator and might get it wrong. I recognize that “You should not be monitoring pods” is a bold statement, but I believe that if you’re doing that, you won’t get it right the majority of the time.

Common sources of issues when troubleshooting microservices

Network, infrastructure, and application issues are all commonly seen when troubleshooting microservices.

Network

Issues at the network level are the hardest ones to debug. If the problem is in the network, you need to look at socket-layer stats. The underlying network has sockets that connect point A to B, so you need to look at round-trip time at the network level, see if packets are being transmitted, if there’s a routing issue, etc.

Infrastructure

One way infrastructure issues can manifest is as pod restarts (crash looping in Kubernetes). This can happen for many reasons. For example, if you have a pod in your service that can’t reach the Kubernetes data store, Kubernetes will restart it. You need to track the status of the pods that are backing the service. If you see several or frequent pod restarts, it becomes an issue.

Another common infrastructure issue is the Kubernetes API server being overloaded and taking a long time to respond. Every time something needs to happen, pods need to talk to the API server—so if it’s overloaded, it becomes an issue.

A third infrastructure issue is related to the Domain Name System (DNS). In Kubernetes, your services are identified by names, which get resolved with a DNS server. If those resolutions are slow, you start to see issues.

Application

There are several common application issues that can lead to restarts and errors. For example, if your service load balancing isn’t happening, say because there’s a change in your URL or the load balancer isn’t doing something right, you could be overloading a single pod and causing it to restart.

If your URLs are not constructed properly, you’ll get a response code “404 page not found.” If the server is overloaded, you’ll get a 500 error. These are application issues that manifest as infrastructure issues.

Best practices for troubleshooting microservices

Here are two best practices for effectively identifying and troubleshooting microservice issues.

1. Aggregate data at the service level

You need to use a tool that provides data (i.e. a log) that is aggregated at the service level, so you can see how many pod restarts, error codes, etc. occurred. This is different from the approach most DevOps engineers use today, where every pod restart is a separate alert, leading engineers to be buried in alerts that might just be normal operations or Kubernetes correcting itself.

Some DevOps engineers might wonder if service mesh can be used to aggregate data in this way. While service mesh has observability tools baked in, you need to be careful because many service meshes sample due to the large amount of data involved; they provide you raw data and give you labels to aggregate the data yourself. What you really need is a tool that gives you just the data you need for the service, as well as service-level reporting.

2. Use machine learning

When trying to identify and troubleshoot microservice issues, you need to monitor how each pod belonging to your service is behaving. This means monitoring metrics like latency, number of process restarts, and network connection errors. There are two ways to do this:

Set a threshold — For example, if there are more than 20 errors, create an alert. This is a bit of a naive approach in a dynamic system like Kubernetes, particularly with microservices.

Baselining — Use machine learning to study how a metric behaves over time, and build a machine learning model to predict how that metric will behave in the future. If the metric deviates from its baseline, you will receive an alert specifying which parameters led the machine learning algorithm to believe there was an issue.

I advise against trying to set a threshold—you’ll be flooded with alerts and this will cause alert fatigue. Instead, use machine learning. Over time, a machine learning algorithm can start to alert you before an issue arises.

Article Tags

microservices

About Brendan Creane

Brendan Creane is head of engineering at cloud-native security provider Tigera

View all posts by Brendan Creane

Cookie	Duration	Description
cf_use_ob	past	Cloudflare sets this cookie to improve page load times and to disallow any security restrictions based on the visitor's IP address.
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
CookieLawInfoConsent	1 year	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
JSESSIONID	session	The JSESSIONID cookie is used by New Relic to store a session identifier so that New Relic can monitor session counts for an application.
PHPSESSID	session	This cookie is native to PHP applications. The cookie is used to store and identify a users' unique session ID for the purpose of managing user session on the website. The cookie is a session cookies and is deleted when all the browser windows are closed.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
__atuvc	1 year 1 month	AddThis sets this cookie to ensure that the updated count is seen when one shares a page and returns to it, before the share count cache is updated.
__atuvs	30 minutes	AddThis sets this cookie to ensure that the updated count is seen when one shares a page and returns to it, before the share count cache is updated.
__cf_bm	30 minutes	This cookie, set by Cloudflare, is used to support Cloudflare Bot Management.

Cookie	Duration	Description
__gads	1 year 24 days	The __gads cookie, set by Google, is stored under DoubleClick domain and tracks the number of times users see an advert, measures the success of the campaign and calculates its revenue. This cookie can only be read from the domain they are set on and will not track any data while browsing through other sites.
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_S6PB8V57DG	2 years	This cookie is installed by Google Analytics.
_gat_gtag_UA_846073_1	1 minute	Set by Google to distinguish users.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
_jsuid	1 year	This cookie contains random number which is generated when a visitor visits the website for the first time. This cookie is used to identify the new visitors to the website.
at-rand	never	AddThis sets this cookie to track page visits, sources of traffic and share counts.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
iutk	5 months 27 days	This cookie is used by Issuu analytic system to gather information regarding visitor activity on Issuu products.
uvc	1 year 1 month	Set by addthis.com to determine the usage of addthis.com service.
vuid	2 years	Vimeo installs this cookie to collect tracking information by setting a unique ID to embed videos to the website.
WMF-Last-Access	1 month 14 hours 26 minutes	This cookie is used to calculate unique devices accessing the website.

Cookie	Duration	Description
__Host-GAPS	2 years	This cookie allows the website to identify a user and provide enhanced functionality and personalisation.
_pxhd	session	Used by Zoominfo to enhance customer data.
IDE	1 year 24 days	Google DoubleClick IDE cookies are used to store information about how the user uses the website to present them with relevant ads and according to the user profile.
loc	1 year 1 month	AddThis sets this geolocation cookie to help understand the location of users who share the information.
mc	1 year 1 month	Quantserve sets the mc cookie to anonymously track user behaviour on the website.
test_cookie	15 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
__gpi	1 year 24 days	No description
__Secure-YEC	1 year 1 month	No description
_heatmaps_g2g_100754890	10 minutes	No description
_techvalidate_session	session	No description
cf_7166_id	20 years	No description
cf_7166_person_last_update	session	No description
f5avraaaaaaaaaaaaaaaa_session_	session	No description available.
GoogleAdServingTest	session	No description
Gyazo_cfwoker	7 years 2 months 17 days 7 hours	No description
incap_ses_451_2783402	session	No description
incap_ses_769_2783402	session	No description
loglevel	never	No description available.
m	2 years	No description available.
nlbi_2783402	session	No description
prism_252377639	1 month	No description
TS011605d9	session	No description
ustream-guest	session	No description available.
visid_incap_2783402	1 year	No description
xtc	1 year 1 month	No description

AI

AI and Software Development

Observability

Guide to Observability

CI/CD

A guide to CI/CD

Cloud Native

Cloud Native Content

Data

A Guide to Data

Test

Security Testing

Mobile

Mobile Testing

API

Sponsored by Parasoft

Performance

Load & Performance Testing

DevSecOps

A Guide to DevSecOps

Enterprise Security

A Guide to Security

Supply Chain Security

Supply Chain Security

Dev Manager

Dev Managers Content

Agile

A Guide To Agile

Value Stream

A Guide To Value Stream

Productivity

A Guide To Productivity

DevOps

DevOps Content

API

Gravitee.io

AI

AI and Software Development

Value Stream Management

A Guide To Value Stream

Troubleshooting microservices: Challenges and best practices

Common sources of issues when troubleshooting microservices

Network

Infrastructure

Application

Best practices for troubleshooting microservices

1. Aggregate data at the service level

2. Use machine learning

Article Tags

Subscribe to SDTimes

About Brendan Creane

Related Articles

Modernizing your approach to governance, risk and compliance

vFunction’s latest capabilities aim to improve microservices governance, reduce technical debt

KubeMQ updates its Dashboard to become a complete command center for managing microservices

Jakarta EE 10 released with microservices capabilities