Gremlin wants to make it safer to experiment in production with the release of Status Checks. The new capability automatically verifies systems are healthy and ready for chaos engineering. 

“More and more, companies want to do Chaos Engineering. And not only do it, but automate it. But they are concerned if they have attacks triggering automatically, it may perform a chaos attack at a bad time (say when a system is already experiencing an outage!). This is a huge concern,” Matt Schillerstrom, product manager at Gremlin, told SD Times in an email. “This is a huge safety improvement, in that it drastically mitigates the chances you break your own systems and impact customers while doing chaos engineering.”

To build resilient systems, embrace the chaos
The first 5 chaos experiments to run on Kubernetes 

Previously, companies would try to address safety concerns by running experiments in stage environments, then applying those findings to product. However, Gremlin explained this approach is limited and doesn’t accurately mirror what can happen in production. “Without status checks, it’s very difficult to automate chaos engineering in production. Because then you are unleashing chaos without knowing if the infrastructure is ready — or you have to check manually if it’s ready,” Schillerstrom wrote.

With Status Checks, chaos engineering can be built right into CI/CD pipelines. It comes with third-party tool integration for PagerDuty, Datadog, New Relic and more. If a monitoring tool reports an active incident, Status Check will prevent the chaos attack, according to the company. 

“It’s very important to note that Gremlin doesn’t advocate for ‘chaos’ — the term chaos engineering can be a little misleading. We advocate for hypothesis-driven testing, in order to tame chaos. To better understand our systems in order to prevent chaos. It does no one any good to be attacking infrastructure that’s already under stress,” wrote Schillerstrom.