Optimize continuous delivery with continuous reliability

Published: August 10th, 2022

The 2021 State of DevOps report indicates that greater than 74% of organizations surveyed have Change Failure Rate (CFR) greater than 16% (the report provides a range from 16% to 30%). Of these, a significant proportion (> 35%) likely have CFRs exceeding 23%.

This means that while organizations seek to increase software change velocity (as measured by the other DORA metrics in the report), a significant number of deployments result in degraded service (or service outage) in production and subsequently require remediation (including hotfix, rollback, fix forward, patch etc.). The frequent failures potentially impair revenue and customer experience, as well as incur significant costs to remediate.

Most customers whom we speak to are unable to proactively predict the risk of a change going into production. In fact, the 2021 State of Testing in DevOps report also indicates that greater than 70% of organizations surveyed are not confident about the quality of their releases. A smaller, but still significant, proportion (15%) “Release and Pray” that their changes won’t degrade production.

Reliability is a key product/service/system quality metric. CFR is one of many reliability metrics. Other metrics include availability, latency, thruput, performance, scalability, mean time between failures, among others. While reliability engineering in software has been an established discipline, we clearly have a problem ensuring reliability.

In order to ensure reliability for software systems, we need to establish practices that plan for, specify, engineer, measure and analyze reliability continuously along the DevOps life cycle. We call this “Continuous Reliability” (CR).

Key Practices for Continuous Reliability

Continuous Reliability derives from the principle of “Continuous Everything” in DevOps. The emergence (and adoption) of Site Reliability Engineering (SRE) principles has led to CR evolving to be a key practice in DevOps and Continuous Delivery. In CR, the focus is to take a continuous proactive approach at every step of the DevOps lifecycle to ensure that reliability goals will be met in production.

This implies that we are able to understand and control the risks of changes (and deployments) before they make it to production.

The key pillars of CR are shown in the figure below:

CR is not, however, the purview of site reliability engineers (SREs) alone. Like other DevOps practices, CR requires active collaboration among multiple personas such as SREs, product managers/owners, architects, developers, testers, release/deployment engineers and operations engineers.

Some of the key practices for supporting CR (that are overlaid on top of the core SRE principles) are described below.

1) Continuous Testing for Reliability

Continuous Testing (CT) is an established practice in Continuous Delivery. However, the use of CT for continuous reliability validation is less common. Specifically for validation of the key reliability metrics (such as availability, latency, throughput, performance, scalability), many organizations still use waterfall-style performance testing, where most of the testing is done in long duration tests before release. This not only slows down the deployment, but does an incomplete job of validation.

Our recommended approach is to validate these reliability metrics progressively at every step of the CI/CD lifecycle. This is described in detail in my prior blog on Continuous Performance Testing.

2) Continuous Observability

Observability is also an established practice in DevOps. However, most observability solutions (such as Business Services Reliability) focus on production data and events.

What is needed for CR is to “shift-left” observability into all stages of the CI/CD lifecycle, so that reliability insights can be gleaned from pre-production data (in conjunction with production data). For example, it is possible to glean reliability insights from patterns of code changes (in source code management systems), test results and coverage, as well as performance monitoring by correlating such data with past failure/reliability history in production.

Pre-production environments are more data rich than production environments (in terms of variety); however, most of the data is not correlated and mined for insights. Such observability requires us to set up “systems of intelligence” (SOI, see figure below) where we continuously collect and analyze pre-production data along the CI/CD lifecycle to generate a variety of reliability predictions as and when applications change (see next section).

3) Continuous Failure, Risk Insights and Prediction

An observability system in pre-production allows us to continuously assess and monitor failure risk along the CI/CD lifecycle. This allows us to proactively assess (and even predict) the failure risk associated with changes.

For example, we set up a simple SOI for an application (using Google Analytics) where I collected code change data (from the source code management system) as well as history of escaped defects (from past deployments to production). By correlating such data (gradient boosted tree algorithm), I was able to establish an understanding of what code change patterns resulted in higher levels of escaped defects. In this case, I found a significant correlation between code churn and defects leaked (see figure below).

We were then able to use the same analytics to predict how escaped defects would change based on code churn in my current deployment (see inset in the figure above).

While this is a very simple example of reliability prediction using a limited data set, we can do continuous failure risk prediction by exploiting a broader set of data from pre-production, including testing and deployment data.

For example, in my previous article on Continuous Performance Testing, I discussed various approaches for performance testing of component-based applications. Such testing generates a huge amount of data that is extremely difficult to process manually. An observability system can then be used to collect the data to establish baselines of component reliability and performance, and in turn used to generate insights in terms of how system reliability may be impacted by changes in individual application components (or other system components).

4) Continuous Feedback

One of the key benefits of an observability system is to be able to provide quick and continuous feedback to the development/test/release/SRE teams on the risk associated with changes and provide helpful insights on how to address them. This would allow development teams to proactively address these risks before the changes are deployed to production. For example, development teams can be alerted as soon as performing a commit (or a pull request) of the failure risks associated with the changes they have made. Testers can get feedback on the tests that are the most important to run. Similarly, SREs can get early planning insights into the level of error budgets they need to plan for the next release cycle.

Next up: Continuous Quality

Reliability, however, is just one dimension of application/system quality. It does not, for example, fully address how we maximize customer experience that is influenced by other factors such as value to users, ease of use, and more. In order to get true value from DevOps and Continuous Delivery initiatives, we need to establish practices for predictively attaining quality – we call this “Continuous Quality.” I will discuss this in my next blog.

Article Tags

Broadcom, continuous delivery

About Shamim Ahmed

Shamim Ahmed is Broadcom CTO for DevOps Solution Engineering

View all posts by Shamim Ahmed

Cookie	Duration	Description
cf_use_ob	past	Cloudflare sets this cookie to improve page load times and to disallow any security restrictions based on the visitor's IP address.
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
CookieLawInfoConsent	1 year	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
JSESSIONID	session	The JSESSIONID cookie is used by New Relic to store a session identifier so that New Relic can monitor session counts for an application.
PHPSESSID	session	This cookie is native to PHP applications. The cookie is used to store and identify a users' unique session ID for the purpose of managing user session on the website. The cookie is a session cookies and is deleted when all the browser windows are closed.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
__atuvc	1 year 1 month	AddThis sets this cookie to ensure that the updated count is seen when one shares a page and returns to it, before the share count cache is updated.
__atuvs	30 minutes	AddThis sets this cookie to ensure that the updated count is seen when one shares a page and returns to it, before the share count cache is updated.
__cf_bm	30 minutes	This cookie, set by Cloudflare, is used to support Cloudflare Bot Management.

Cookie	Duration	Description
__gads	1 year 24 days	The __gads cookie, set by Google, is stored under DoubleClick domain and tracks the number of times users see an advert, measures the success of the campaign and calculates its revenue. This cookie can only be read from the domain they are set on and will not track any data while browsing through other sites.
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_S6PB8V57DG	2 years	This cookie is installed by Google Analytics.
_gat_gtag_UA_846073_1	1 minute	Set by Google to distinguish users.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
_jsuid	1 year	This cookie contains random number which is generated when a visitor visits the website for the first time. This cookie is used to identify the new visitors to the website.
at-rand	never	AddThis sets this cookie to track page visits, sources of traffic and share counts.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
iutk	5 months 27 days	This cookie is used by Issuu analytic system to gather information regarding visitor activity on Issuu products.
uvc	1 year 1 month	Set by addthis.com to determine the usage of addthis.com service.
vuid	2 years	Vimeo installs this cookie to collect tracking information by setting a unique ID to embed videos to the website.
WMF-Last-Access	1 month 14 hours 26 minutes	This cookie is used to calculate unique devices accessing the website.

Cookie	Duration	Description
__Host-GAPS	2 years	This cookie allows the website to identify a user and provide enhanced functionality and personalisation.
_pxhd	session	Used by Zoominfo to enhance customer data.
IDE	1 year 24 days	Google DoubleClick IDE cookies are used to store information about how the user uses the website to present them with relevant ads and according to the user profile.
loc	1 year 1 month	AddThis sets this geolocation cookie to help understand the location of users who share the information.
mc	1 year 1 month	Quantserve sets the mc cookie to anonymously track user behaviour on the website.
test_cookie	15 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
__gpi	1 year 24 days	No description
__Secure-YEC	1 year 1 month	No description
_heatmaps_g2g_100754890	10 minutes	No description
_techvalidate_session	session	No description
cf_7166_id	20 years	No description
cf_7166_person_last_update	session	No description
f5avraaaaaaaaaaaaaaaa_session_	session	No description available.
GoogleAdServingTest	session	No description
Gyazo_cfwoker	7 years 2 months 17 days 7 hours	No description
incap_ses_451_2783402	session	No description
incap_ses_769_2783402	session	No description
loglevel	never	No description available.
m	2 years	No description available.
nlbi_2783402	session	No description
prism_252377639	1 month	No description
TS011605d9	session	No description
ustream-guest	session	No description available.
visid_incap_2783402	1 year	No description
xtc	1 year 1 month	No description

AI

AI and Software Development

Observability

Guide to Observability

CI/CD

A guide to CI/CD

Cloud Native

Cloud Native Content

Data

A Guide to Data

Test

Security Testing

Mobile

Mobile Testing

API

Sponsored by Parasoft

Performance

Load & Performance Testing

DevSecOps

A Guide to DevSecOps

Enterprise Security

A Guide to Security

Supply Chain Security

Supply Chain Security

Dev Manager

Dev Managers Content

Agile

A Guide To Agile

Value Stream

A Guide To Value Stream

Productivity

A Guide To Productivity

DevOps

DevOps Content

API

Gravitee.io

AI

AI and Software Development

Value Stream Management

A Guide To Value Stream

Optimize continuous delivery with continuous reliability

Article Tags

Subscribe to SDTimes

About Shamim Ahmed

Related Articles

How AI further empowers value stream management

The state of strategic portfolio management

Broadcom adds on-premises version of its enterprise agility platform Rally

ValueOps Insights provides unified view of analytics for software value planning and delivery