Chaos engineering has been gaining a lot of traction over the last few years as it moved from its origins at Netflix to more and more companies across the industry. Many development teams use it to prevent downtime by trying to break their systems on purpose so that they can improve those systems before they cause problems down the line.
Given the resilient nature of serverless computing, based on agreements of uptime and availability by the cloud providers, it might seem that chaos engineering is one method of testing that wouldn’t be practical in serverless. But Emrah Samdan, vice president of product for Thundra, believes that serverless computing and chaos engineering actually go really well together.
Because the cloud vendor guarantees availability and scalability, when doing chaos engineering in serverless environments, the goal is not necessarily to bring down the system, but to find application-level failures, such as those caused by lack of memory or time. “The purpose of chaos experiments is not to take the whole software down but to learn from failures by injecting small, controllable failures,” Samdan said.
RELATED CONTENT: To build resilient systems, embrace the chaos
Some of the most common examples of chaos engineering in serverless that Samdan sees are injecting latency into serverless functions to check that timeouts work properly, and injecting failures into third-party connections.
Samdan noted that the step of chaos engineering of defining the status state is an important first step, but one that is often overlooked. “People just want to break things, but the first step is actually to understand how they actually work, what are the ups and downs of the system, what are the limits, how resilient is your system already,” he said.
He believes that determining this baseline is even more important in serverless environments. This is because what is considered normal for serverless can be very different from what is considered normal in other systems. For example, in serverless, both latency and the number of executions are very important, which isn’t as true in other systems.
Because of this, it is important that an engineering team have proper observability in place. “Chaos engineering experiments are all about asking questions to understand what actually happened during the experiment. You cannot achieve this by keeping an eye on metric charts, as they are designed to answer known questions. In order to ask questions about the unknowns of the distributed system, you need to have all three pillars of observability — logs, metrics, and traces — together and integrated. I see the adoption of correct observability still continues and we see more and more companies using modern tools for this purpose. I frankly believe that we’ll see more and more companies stepping into chaos engineering as modern observability becomes more widespread,” Samdan said.
For those looking to get started with doing chaos experiments in serverless environments, Samdan recommends starting small and starting in the staging environment. Rather than throttling all serverless functions, he advises throttling or injecting latency into one or two downstream services. “It’s not only about testing failures on your system, it’s also about testing how your team will react to these failures. So starting small is actually very encouraging to persevere for more comprehensive experiments,” Samdan said.
Like adopting any new methodology, changing culture is the biggest challenge. Chaos engineering needs to be initiatives and sponsored by higher-level folks in the company, Samdan believes. “Teams should be able to work in harmony by planning, running and evaluating the game days. We should always keep in my mind that chaos experiments are not for criticizing colleagues for the weaknesses in their modules. It’s more about fixing those weaknesses before customers get impacted and letting those colleagues grow as a result of the experiments,” said Samdan.
Samdan also advised developers to remember that chaos engineering isn’t a silver bullet for finding each and every failure. It works best when used to complement other testing methodologies like unit tests and integration tests. “However, chaos engineering taps into a very different point than other tests. It tests the resiliency of other parts of your system when one part is having some problems due to latency or any type of failures. Considering the distributed systems serverless paradigm implies, running chaos experiments become a no-brainer to reveal the hidden traps before customers reveal them on production,” he said.