Chaos engineering is not a new concept, but it has become more important as systems have become more complex. With applications and deployments broken down into smaller pieces, and networks being thrown between everything, the possibilities for things to go wrong are greater than ever.
RELATED CONTENT: Learn to harness chaos to build resilient systems
The conversation around chaos engineering has now evolved from what and why to how. To meet that state of the industry, Kolton Andrus and his team at Gremlin are again hosting their Chaos Conf, to be held this year Sept. 25-27 at the Regency Ballroom in San Francisco. SD Times is a media sponsor of the event.
Andrus is one of the early practitioners of chaos engineering, which evolved from his work at Amazon, where he and his team were responsible for making sure the retail giant’s website was up and running, and performing well. “Chaos engineering at its heart is really thoughtful, planned experiments to understand how the system behaves when things go wrong,” Andrus explained.
He used an analogy of vaccination, whereby a little bit of harm is injected into a system in order to build an immunity, in order to be stronger. “And it’s not that by injecting that we will instantly become immune and there will never be problems again,” he said. “What we’re doing is training that system to respond to that foreign agent, or that harm, so that when it reoccurs, it can more quickly be quenched and the system can go back to steady state.”
At Amazon, there was no tolerance for outages, he said. “When Amazon.com goes down, it’s TechCrunch, it’s Twitter, it’s the world’s on fire, it’s all hell breaks loose. We did everything we could find to harden our systems. It’s not one approach. We did every single thing that we could, and that wasn’t enough. Just doing everything reactive, having better incident management, getting on calls faster, even training those teams, that didn’t get us far enough. That got us to 3, 3 ½ 9’s, and we were still having hours of outages a year. We had to find a better way.”
Things were still slipping through, and the Amazon team was still being caught by surprise by certain failures. One part of correcting that is failure testing, he noted. Another side is doing it with all the teams that have a stake in the system. An outage can affect many different systems, and if you’ve only tested the top one, or the top three, the organization can’t be prepared. So, Andrus said, the need to get the whole company behind it and investing in it is the driving factor behind Gremlin.
Teams that are new to chaos engineering typically want to set it up manually, using graphs, dashboards and logs to build confidence not in just the tooling or the process, but in the systems as well. Once they have that confidence, they can look to automate certain things. “OK, now we understand how our autoscaling rules work, so we’re going to add some automation to verify that on a regular basis, perhaps as part of a deployment. Now we understand that our hosts are stateless, and that we can lose a host at any time, so now we’ll add some automation to verify that.”
One of Andrus’ “soap-box topics” in chaos engineering is to use some sort of an automated approach as a social forcing function as well. “People think, ah, we’re going to randomly break things and see what happens. That’s not really what the intent is. The intent is to understand the system and to harden it. But for teams that don’t have time, that aren’t investing the effort, one approach is, we are going to do this to you, just like will happen in production. If you don’t prepare, you’ll feel the pain. I don’t consider that automation as much as social influence.”
While monitoring and alerting are effective tools for finding issues and incidents, those tools are also part of the system to be configured, and where things can go wrong. Andrus said Gremlin customers are finding value in being able to validate that their alerting and monitoring are working as they should. Andrus described “the number of outages I’ve been on that were 20 minutes longer than they needed to be because somebody didn’t get paged when they should have. The number of times we started diving into, you know, the West region, because somebody had their dashboard pinned wrong, and really the problem was in the East region.”
“What I’ve discovered is that chaos engineering is like runtime validation for your system,” he added. “We’ve done a lot of build-time testing, but really, we have to operate that system, and thread pools, timeouts, autoscaling groups, the alerting, the monitoring, the people, and whether the people know what to do when they get paged, those are all important pieces of making a system run smoothly. In today’s world, they’re underserved. That’s how I’ll state it. And people need to invest the time in making sure all those pieces work well, if they expect to have a system that runs smoothly.”
The upcoming Chaos Conf will feature only speakers who are practitioners, who have been paged in the middle of the night or who have worked to keep systems reliable. “One of my frustrations in the industry as a whole is that sometimes, people have a very academic approach to things like this, and they’re not on call, and they haven’t actually been paged, but they have strong opinions about how to operate systems in that world. So we care about having practitioners and the focus has been, let’s tell the story. What has worked for you? How have you been able to have success? What’s interesting is, some of that is the technology side, and some of it is different tooling or projects, and how do we leverage that. But a big part of it is, once we’ve overcome the hurdle of solving the technology and figuring out how we’re going to run the experiments, it’s how do you get everyone to do it. And if you have 100 teams, how do you incentivize those teams to do that work? And I’ll tell you, that’s one of the questions. It’s still out there. Getting machines to do what we want? It’s hard but it’s solvable. Getting people to do what you want? Much more difficult.”