To build resilient systems, embrace the chaos

Published: April 6th, 2020

It shouldn’t be news to you to hear that software needs to be tested rigorously before being pushed to production. Over the years, countless testing methodologies have popped up, each promising to be the best one. From automated testing to continuous testing to test-driven development, there is no shortage of ways to test your software.

While there may be variations in these testing methods, they all still rely on some form of human intervention. Humans need to script the tests, which means they need to know what they’re testing for. This presents a challenge in complex environments when a number of factors could combine to produce an unintended result — one for which testers wouldn’t have thought to test.

This is where chaos engineering comes in, Michael Fisher, product manager at OpsRamp explained. Chaos engineering allows you to test for those “unknown unknowns,” he said.

According to Shannon Weyrick, vice president of architecture at NS1, chaos engineering is “the practice of intentionally introducing failures in systems to proactively identify points of weakness. Weyrick explained that aside from identifying weaknesses in a system, chaos engineering allows teams to predict and proactively mitigate problems before they turn into problems that could impact the business.

Matthew Fornaciari, CTO and co-founder of Gremlin, added that “traditional methods of testing are much more about testing how the underlying sections of the code functions. Chaos engineering focuses on discovering and validating how the system functions as a whole, especially under duress.”

Chaos engineering is considered to be part of the testing phase, but Hitesh Patel, senior director of product management at F5, believes that the core of chaos engineering goes back to the development phase. It is all about “designing software and systems in an environment that is mimicking what is really happening in the real world,” he said. This means that as a developer is writing code, they’re thinking about how failures will be injected into it down the line and as a result, they’re building more resilient systems.

“Right now, chaos engineering is more about setting that expectation when you’re building the software or when you’re building the system that failures are going to happen and that you need to design for resiliency and bake that in at the beginning of a product or software life cycle rather than trying to add that on later,” said Patel.

The history of chaos engineering
The software development industry tends to latch onto practices and methodologies developed and successfully used at large tech companies. This happened with SRE, which originated at Google, and it’s also the case with chaos engineering.

The practice first originated at Netflix almost 10 years ago when they built a tool called Chaos Money that would randomly disable production instances. “By running Chaos Monkey in the middle of a business day, in a carefully monitored environment with engineers standing by to address any problems, we can still learn the lessons about the weaknesses of our system, and build automatic recovery mechanisms to deal with them. So next time an instance fails at 3 am on a Sunday, we won’t even notice,” Netflix wrote in a blog post.

Since then, they have created an entire “Simian Army” of tools that they say keep their cloud “safe, secure, and highly available.” Examples of tools in this Simian Army include Conformity Monkey, which finds and removes instances that don’t adhere to best practices; Latency Monkey, which introduces artificial delays to see how services respond to service degradation; and Chaos Gorilla, which simulates an outage of an entire AWS availability zone.

“With the ever-growing Netflix Simian Army by our side, constantly testing our resilience to all sorts of failures, we feel much more confident about our ability to deal with the inevitable failures that we’ll encounter in production and to minimize or eliminate their impact to our subscribers,” Netflix said.

Since then, several companies have adopted chaos engineering as part of their testing process, and it has even spawned companies like Gremlin, which provides chaos-engineering-as-a-service.

Smaller companies can benefit
While chaos engineering originated at Netflix, a large company with a complex infrastructure and environment, Patel believes that in a lot of ways, smaller companies will find it easier to implement chaos engineering. Larger companies are going to have more complex compliance, auditing, and reporting requirements. “All of those things factor in when you’re trying to do what I would call a revolutionary change in how you operate things,” said Patel. Overall, there is less red tape to cut through at smaller and medium-sized companies.

“There’s fewer people involved and I think it’s easier for a two-person team to get into a room and say ‘right, this is the right thing for the business, this is the right thing for our customers, and we can get started faster’,” said Patel.

Weyrick doesn’t entirely agree with the idea that smaller means easier. Today, even small and medium-sized applications can be complex, increasing the surface area for those unpredictable weaknesses, he explained. He believes that microservice architectures in particular are inherently complex because they involve a number of disparate, interconnected parts and are often deployed in complex and widely distributed architectures.

Fornaciari recalled being on the availability team at Amazon in 2010 as they were doing a massive migration from a monolithic to a microservices architecture. The point of the move was to decouple systems and allow teams to own their respective functions and iterate independently, and in that sense, the migration was a success.

But the migration also led the team to learn the hard way that introducing the network as a dependency between teams introduced a new class of errors. “Days quickly turned into a never-ending deluge of fire fighting, as we attempted to triage the onslaught of new issues,” said Fornaciari. “It was then that we realized the only way we were ever going to get ahead of these novel failures was to invest heavily in proactive testing via Chaos Engineering.”

Fornaciari believes that as companies start to go through what Amazon went through ten years ago, chaos engineering will be “the salve that allows those companies to get ahead of these failures, as their systems change and evolve.”

According to Weyrick, if possible, teams should try to implement chaos engineering early on in an application’s life so that they can build confidence as they scale the application.

“The depth of the chaos experiments involved may start simple in smaller companies, and grow over time,” said Weyrick.

Patel also recommends starting small. He recommends starting with a non-critical application, one that isn’t going to get your company into the news or get you dragged up to your boss’ boss if things go awry. Once an application is selected, teams should apply chaos engineering to that application end to end.

He emphasized that the most important part of this process early on is “building the muscle,” which he said is all about the people, not the technology. “Technology is great, but at the end of the day, it’s people who are using these things and putting them together,” said Patel. “And what you need to do is build the muscle in the people that are doing this. Build that subject matter expertise and do that in a safe environment. Do that in a way that they can mess up a little bit. Because nothing works right the first time when you’re doing this stuff…People can build the muscle and learn how to do these things, learn the subject matter expertise, gain confidence, and then start applying that in a broader manner. And that’s where I think a tie in with leadership comes in.”

According to Patel, having support from the top of the business will be crucial in helping companies prioritize where to apply chaos engineering. “[They’re] not just giving you aircover, but also saying we’re going to apply this in a way that makes sense to our business and to our user experience and matches where we want to go from a strategic standpoint,” said Patel. “So you’re not just applying the technology in areas that no one is going to notice. You’re applying it where you can derive the biggest customer benefit.”

Fornaciari added: “As companies grow their applications and the supporting infrastructure, they’ll undoubtedly introduce more failure modes into their system. It’s unavoidable. That’s why we call chaos engineering a practice — it’s something that must continually grow and evolve with the underlying systems.”

Embracing risk
Fisher also added that organizations will need to shift their mindsets from one of “avoiding risks at all costs” to “embracing risk to generate a greater outcome to their users.” This can be a massive cultural shift, especially for those larger, more risk-averse companies, or companies who haven’t already adopted some form of DevOps.

“The team needs to evolve from the legacy belief that production is a golden environment that should be touched as little as possible and handled with kid gloves, lest outages occur,” said Weyrick. “Chaos engineering adopts a very different mindset: that in today’s world, this legacy belief actually creates fragile systems that fail at the first unexpected and unavoidable real world problem. Instead, we can build systems that consistently prove to us that they can survive unexpected problems, and rest easier in that confidence.”

The idea of purposefully trying to break things can be especially difficult for more traditional IT managers who are used to the idea of gatekeeping changes to the production environment,” explained Kendra Little, DevOps advocate at Redgate Software. “Your inclination is, well we have to find a way to be able to test this before it gets to production,” she said. “So it’s kind of this reactionary viewpoint of as soon as I find something, I need to be able to write a test to be able to make sure that never happens again… I mean I used to very much have that perspective as an IT person, and then at a certain point, I and the higher ups in my organization as well began to realize, we can’t just be reactionary anymore. Failure is inevitable. Our system is complex enough and we need to be able to change it rapidly. We can’t just gate keep things out of there. We have to be able to change the system quickly. And there are just so many moving parts in the system and so many external factors that can impact us.”

Best practices for chaos engineering
According to Shannon Weyrick, vice president of architecture at NS1, there are three main best practices that should be followed when using chaos engineering.

Get buy-in to the chaos mindset across the team: Purposefully injecting failures into a system will require a shift in mindset. He recommends teams investigate the practice, understand the ramifications, and introduce it in small ways for legacy projects and directly for new projects. “Ensure your team knows how to run successful experiments, and minimize the blast radius to reduce or remove potential impact to customers when failures occur,” said Weyrick.
Make the experiments real: The goal of chaos engineering is to increase reliability by exploring unpredictables through experiments. To get the most out of chaos engineering, teams should conduct their experiment using the most realistic data and environments possible. He also noted that it’s important to conduct experiments on the production system because it will always contain unique and hard-to-reproduce variables.
Be sure people are part of your system: It’s important to remember that infrastructure and software are not the only parts of a system. “Before conducting chaos experiments, remember that the operators who maintain the system should be considered a part of that system, and therefore be a part of the experiments,” said Weyrick.

Do Chaos engineering on your databases too
Kendra Little, DevOps advocate at Redgate Software, brought up the point that chaos engineering is not just for software applications. It is a practice that can be applied to databases too.

Little believes that the approach to testing databases with chaos engineering remains the same as the approach one would take when testing a regular software application. A big difference, however, is that people tend to be more scared of it when it’s a database instead of an application.

“When we think about testing in production with databases it’s very terrifying because if something happens to your data, your whole company is at risk,” she said. But with chaos engineering, what you’re really doing is doing controlled testing. She explained that with this process you’re not just dropping tables or releasing things that could put your company out of business.

It’s also important to note that we’ve reached a point in database and infrastructure complexity where it’s not possible to replicate your production environment accurately, Little explained. “If we don’t have a way to learn about how to manage our databases and to learn how our code behaves in databases and production, then in many cases we’re not gonna have anywhere we can learn it. So it is, I think, just as relevant in databases.”

Article Tags

chaos engineering, testing

About Jenna Barron

Jenna Barron is News Editor of SD Times.

View all posts by Jenna Barron

Cookie	Duration	Description
cf_use_ob	past	Cloudflare sets this cookie to improve page load times and to disallow any security restrictions based on the visitor's IP address.
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
CookieLawInfoConsent	1 year	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
JSESSIONID	session	The JSESSIONID cookie is used by New Relic to store a session identifier so that New Relic can monitor session counts for an application.
PHPSESSID	session	This cookie is native to PHP applications. The cookie is used to store and identify a users' unique session ID for the purpose of managing user session on the website. The cookie is a session cookies and is deleted when all the browser windows are closed.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
__atuvc	1 year 1 month	AddThis sets this cookie to ensure that the updated count is seen when one shares a page and returns to it, before the share count cache is updated.
__atuvs	30 minutes	AddThis sets this cookie to ensure that the updated count is seen when one shares a page and returns to it, before the share count cache is updated.
__cf_bm	30 minutes	This cookie, set by Cloudflare, is used to support Cloudflare Bot Management.

Cookie	Duration	Description
__gads	1 year 24 days	The __gads cookie, set by Google, is stored under DoubleClick domain and tracks the number of times users see an advert, measures the success of the campaign and calculates its revenue. This cookie can only be read from the domain they are set on and will not track any data while browsing through other sites.
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_S6PB8V57DG	2 years	This cookie is installed by Google Analytics.
_gat_gtag_UA_846073_1	1 minute	Set by Google to distinguish users.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
_jsuid	1 year	This cookie contains random number which is generated when a visitor visits the website for the first time. This cookie is used to identify the new visitors to the website.
at-rand	never	AddThis sets this cookie to track page visits, sources of traffic and share counts.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
iutk	5 months 27 days	This cookie is used by Issuu analytic system to gather information regarding visitor activity on Issuu products.
uvc	1 year 1 month	Set by addthis.com to determine the usage of addthis.com service.
vuid	2 years	Vimeo installs this cookie to collect tracking information by setting a unique ID to embed videos to the website.
WMF-Last-Access	1 month 14 hours 26 minutes	This cookie is used to calculate unique devices accessing the website.

Cookie	Duration	Description
__Host-GAPS	2 years	This cookie allows the website to identify a user and provide enhanced functionality and personalisation.
_pxhd	session	Used by Zoominfo to enhance customer data.
IDE	1 year 24 days	Google DoubleClick IDE cookies are used to store information about how the user uses the website to present them with relevant ads and according to the user profile.
loc	1 year 1 month	AddThis sets this geolocation cookie to help understand the location of users who share the information.
mc	1 year 1 month	Quantserve sets the mc cookie to anonymously track user behaviour on the website.
test_cookie	15 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
__gpi	1 year 24 days	No description
__Secure-YEC	1 year 1 month	No description
_heatmaps_g2g_100754890	10 minutes	No description
_techvalidate_session	session	No description
cf_7166_id	20 years	No description
cf_7166_person_last_update	session	No description
f5avraaaaaaaaaaaaaaaa_session_	session	No description available.
GoogleAdServingTest	session	No description
Gyazo_cfwoker	7 years 2 months 17 days 7 hours	No description
incap_ses_451_2783402	session	No description
incap_ses_769_2783402	session	No description
loglevel	never	No description available.
m	2 years	No description available.
nlbi_2783402	session	No description
prism_252377639	1 month	No description
TS011605d9	session	No description
ustream-guest	session	No description available.
visid_incap_2783402	1 year	No description
xtc	1 year 1 month	No description

AI

AI and Software Development

Observability

Guide to Observability

CI/CD

A guide to CI/CD

Cloud Native

Cloud Native Content

Data

A Guide to Data

Test

Security Testing

Mobile

Mobile Testing

API

Sponsored by Parasoft

Performance

Load & Performance Testing

DevSecOps

A Guide to DevSecOps

Enterprise Security

A Guide to Security

Supply Chain Security

Supply Chain Security

Dev Manager

Dev Managers Content

Agile

A Guide To Agile

Value Stream

A Guide To Value Stream

Productivity

A Guide To Productivity

DevOps

DevOps Content

API

Gravitee.io

AI

AI and Software Development

Value Stream Management

A Guide To Value Stream

<img height="0" src="/images/sd-premium-pill.svg" data-eio="l">To build resilient systems, embrace the chaos

Article Tags

Subscribe to SDTimes

About Jenna Barron

Related Articles

Snyk announces new DAST solution for securing APIs and web apps

How tech giants like Netflix built resilient systems with chaos engineering

5 common assumptions in load testing—and why you should rethink them

BrowserStack adds Private Devices offering to enabling testing across variety of secured devices

To build resilient systems, embrace the chaos