It shouldn’t be news to you to hear that software needs to be tested rigorously before being pushed to production. Over the years, countless testing methodologies have popped up, each promising to be the best one. From automated testing to continuous testing to test-driven development, there is no shortage of ways to test your software.
While there may be variations in these testing methods, they all still rely on some form of human intervention. Humans need to script the tests, which means they need to know what they’re testing for. This presents a challenge in complex environments when a number of factors could combine to produce an unintended result — one for which testers wouldn’t have thought to test.
This is where chaos engineering comes in, Michael Fisher, product manager at OpsRamp explained. Chaos engineering allows you to test for those “unknown unknowns,” he said.
Chaos Engineering: Finding Outages Before They Become Failures
The first 5 chaos experiments to run on Kubernetes
According to Shannon Weyrick, vice president of architecture at NS1, chaos engineering is “the practice of intentionally introducing failures in systems to proactively identify points of weakness. Weyrick explained that aside from identifying weaknesses in a system, chaos engineering allows teams to predict and proactively mitigate problems before they turn into problems that could impact the business.
Matthew Fornaciari, CTO and co-founder of Gremlin, added that “traditional methods of testing are much more about testing how the underlying sections of the code functions. Chaos engineering focuses on discovering and validating how the system functions as a whole, especially under duress.”
Chaos engineering is considered to be part of the testing phase, but Hitesh Patel, senior director of product management at F5, believes that the core of chaos engineering goes back to the development phase. It is all about “designing software and systems in an environment that is mimicking what is really happening in the real world,” he said. This means that as a developer is writing code, they’re thinking about how failures will be injected into it down the line and as a result, they’re building more resilient systems.
“Right now, chaos engineering is more about setting that expectation when you’re building the software or when you’re building the system that failures are going to happen and that you need to design for resiliency and bake that in at the beginning of a product or software life cycle rather than trying to add that on later,” said Patel.
The history of chaos engineering
The software development industry tends to latch onto practices and methodologies developed and successfully used at large tech companies. This happened with SRE, which originated at Google, and it’s also the case with chaos engineering.
The practice first originated at Netflix almost 10 years ago when they built a tool called Chaos Money that would randomly disable production instances. “By running Chaos Monkey in the middle of a business day, in a carefully monitored environment with engineers standing by to address any problems, we can still learn the lessons about the weaknesses of our system, and build automatic recovery mechanisms to deal with them. So next time an instance fails at 3 am on a Sunday, we won’t even notice,” Netflix wrote in a blog post.
Since then, they have created an entire “Simian Army” of tools that they say keep their cloud “safe, secure, and highly available.” Examples of tools in this Simian Army include Conformity Monkey, which finds and removes instances that don’t adhere to best practices; Latency Monkey, which introduces artificial delays to see how services respond to service degradation; and Chaos Gorilla, which simulates an outage of an entire AWS availability zone.
“With the ever-growing Netflix Simian Army by our side, constantly testing our resilience to all sorts of failures, we feel much more confident about our ability to deal with the inevitable failures that we’ll encounter in production and to minimize or eliminate their impact to our subscribers,” Netflix said.
Since then, several companies have adopted chaos engineering as part of their testing process, and it has even spawned companies like Gremlin, which provides chaos-engineering-as-a-service.
Smaller companies can benefit
While chaos engineering originated at Netflix, a large company with a complex infrastructure and environment, Patel believes that in a lot of ways, smaller companies will find it easier to implement chaos engineering. Larger companies are going to have more complex compliance, auditing, and reporting requirements. “All of those things factor in when you’re trying to do what I would call a revolutionary change in how you operate things,” said Patel. Overall, there is less red tape to cut through at smaller and medium-sized companies.
“There’s fewer people involved and I think it’s easier for a two-person team to get into a room and say ‘right, this is the right thing for the business, this is the right thing for our customers, and we can get started faster’,” said Patel.
Weyrick doesn’t entirely agree with the idea that smaller means easier. Today, even small and medium-sized applications can be complex, increasing the surface area for those unpredictable weaknesses, he explained. He believes that microservice architectures in particular are inherently complex because they involve a number of disparate, interconnected parts and are often deployed in complex and widely distributed architectures.
Fornaciari recalled being on the availability team at Amazon in 2010 as they were doing a massive migration from a monolithic to a microservices architecture. The point of the move was to decouple systems and allow teams to own their respective functions and iterate independently, and in that sense, the migration was a success.
But the migration also led the team to learn the hard way that introducing the network as a dependency between teams introduced a new class of errors. “Days quickly turned into a never-ending deluge of fire fighting, as we attempted to triage the onslaught of new issues,” said Fornaciari. “It was then that we realized the only way we were ever going to get ahead of these novel failures was to invest heavily in proactive testing via Chaos Engineering.”
Fornaciari believes that as companies start to go through what Amazon went through ten years ago, chaos engineering will be “the salve that allows those companies to get ahead of these failures, as their systems change and evolve.”
According to Weyrick, if possible, teams should try to implement chaos engineering early on in an application’s life so that they can build confidence as they scale the application.
“The depth of the chaos experiments involved may start simple in smaller companies, and grow over time,” said Weyrick.
Patel also recommends starting small. He recommends starting with a non-critical application, one that isn’t going to get your company into the news or get you dragged up to your boss’ boss if things go awry. Once an application is selected, teams should apply chaos engineering to that application end to end.
He emphasized that the most important part of this process early on is “building the muscle,” which he said is all about the people, not the technology. “Technology is great, but at the end of the day, it’s people who are using these things and putting them together,” said Patel. “And what you need to do is build the muscle in the people that are doing this. Build that subject matter expertise and do that in a safe environment. Do that in a way that they can mess up a little bit. Because nothing works right the first time when you’re doing this stuff…People can build the muscle and learn how to do these things, learn the subject matter expertise, gain confidence, and then start applying that in a broader manner. And that’s where I think a tie in with leadership comes in.”
According to Patel, having support from the top of the business will be crucial in helping companies prioritize where to apply chaos engineering. “[They’re] not just giving you aircover, but also saying we’re going to apply this in a way that makes sense to our business and to our user experience and matches where we want to go from a strategic standpoint,” said Patel. “So you’re not just applying the technology in areas that no one is going to notice. You’re applying it where you can derive the biggest customer benefit.”
Fornaciari added: “As companies grow their applications and the supporting infrastructure, they’ll undoubtedly introduce more failure modes into their system. It’s unavoidable. That’s why we call chaos engineering a practice — it’s something that must continually grow and evolve with the underlying systems.”
Fisher also added that organizations will need to shift their mindsets from one of “avoiding risks at all costs” to “embracing risk to generate a greater outcome to their users.” This can be a massive cultural shift, especially for those larger, more risk-averse companies, or companies who haven’t already adopted some form of DevOps.
“The team needs to evolve from the legacy belief that production is a golden environment that should be touched as little as possible and handled with kid gloves, lest outages occur,” said Weyrick. “Chaos engineering adopts a very different mindset: that in today’s world, this legacy belief actually creates fragile systems that fail at the first unexpected and unavoidable real world problem. Instead, we can build systems that consistently prove to us that they can survive unexpected problems, and rest easier in that confidence.”
The idea of purposefully trying to break things can be especially difficult for more traditional IT managers who are used to the idea of gatekeeping changes to the production environment,” explained Kendra Little, DevOps advocate at Redgate Software. “Your inclination is, well we have to find a way to be able to test this before it gets to production,” she said. “So it’s kind of this reactionary viewpoint of as soon as I find something, I need to be able to write a test to be able to make sure that never happens again… I mean I used to very much have that perspective as an IT person, and then at a certain point, I and the higher ups in my organization as well began to realize, we can’t just be reactionary anymore. Failure is inevitable. Our system is complex enough and we need to be able to change it rapidly. We can’t just gate keep things out of there. We have to be able to change the system quickly. And there are just so many moving parts in the system and so many external factors that can impact us.”
Best practices for chaos engineering
According to Shannon Weyrick, vice president of architecture at NS1, there are three main best practices that should be followed when using chaos engineering.
- Get buy-in to the chaos mindset across the team: Purposefully injecting failures into a system will require a shift in mindset. He recommends teams investigate the practice, understand the ramifications, and introduce it in small ways for legacy projects and directly for new projects. “Ensure your team knows how to run successful experiments, and minimize the blast radius to reduce or remove potential impact to customers when failures occur,” said Weyrick.
- Make the experiments real: The goal of chaos engineering is to increase reliability by exploring unpredictables through experiments. To get the most out of chaos engineering, teams should conduct their experiment using the most realistic data and environments possible. He also noted that it’s important to conduct experiments on the production system because it will always contain unique and hard-to-reproduce variables.
- Be sure people are part of your system: It’s important to remember that infrastructure and software are not the only parts of a system. “Before conducting chaos experiments, remember that the operators who maintain the system should be considered a part of that system, and therefore be a part of the experiments,” said Weyrick.
Do Chaos engineering on your databases too
Kendra Little, DevOps advocate at Redgate Software, brought up the point that chaos engineering is not just for software applications. It is a practice that can be applied to databases too.
Little believes that the approach to testing databases with chaos engineering remains the same as the approach one would take when testing a regular software application. A big difference, however, is that people tend to be more scared of it when it’s a database instead of an application.
“When we think about testing in production with databases it’s very terrifying because if something happens to your data, your whole company is at risk,” she said. But with chaos engineering, what you’re really doing is doing controlled testing. She explained that with this process you’re not just dropping tables or releasing things that could put your company out of business.
It’s also important to note that we’ve reached a point in database and infrastructure complexity where it’s not possible to replicate your production environment accurately, Little explained. “If we don’t have a way to learn about how to manage our databases and to learn how our code behaves in databases and production, then in many cases we’re not gonna have anywhere we can learn it. So it is, I think, just as relevant in databases.”