Chaos engineering is not just for single servers anymore. These days, Netflix kicks entire regions of its servers offline, just to align priorities for developers. Casey Rosenthal, engineering manager for the traffic team and the chaos team at Netflix, explained the philosophies and practices behind the company’s development and testing practices at the STARWEST testing conference’s second keynote.
One of the biggest problems Netflix faces is the fact that its network and application architecture now includes hundreds of services and more than 80 development teams. Rosenthal said Netflix does not have a chief architect because it would be impossible for a single human to understand all of the microservices and how they interact.
(Related: How Uber embraces the chaos)
As a result, Rosenthal and his teams are always looking for ways to accelerate the development of new features. “At Netflix, we’ve tried to optimize for feature velocity as well. When we make architectural decisions, we consider, ‘Is this technology decision going to enable us to increase or decrease the velocity at which we can add new features?’ Having loosely coupled teams, and allowing them to make smaller changes to smaller surfaces, allows us to innovate quicker,” he said.
Rosenthal said that complexity is increasing for developers. “Putting software into production today is more complicated than it was a few years ago,” he said.
“The rate of that increasing complexity is increasing, so much so that what a single person could accomplish in terms of running and deploying is less practical today. That goes across the spectrum, from deployment considerations, to the database, to distributed systems. Services at scale are getting more complicated to build, which is an opportunity for us as test engineers.”
Chaos Monkey evolves
Netflix’s testing practices, however, are rather different from the traditional world of manual testing with pre-delineated results. Within the Netflix tangle of microservices, testers don’t have a single result they can hope to find that will mean the system works properly.
Instead, Rosenthal described the reasoning behind the creation of chaos engineering at Netflix. This practice began about five years ago when the company introduced the open-source tool Chaos Monkey. This server software goes around a distributed network of applications and pseudo-randomly turns servers off.
“We took the pain of servers disappearing and we brought that forward and automated it. We make sure they frequently disappear during business hours,” instead of randomly at 3 a.m., said Rosenthal. “This has been great. We don’t have a chief architect or a vice president of engineering who can dictate to all our engineers, ‘You shall all build your services to be resilient.’ Chaos Monkey created great alignment for all those engineering teams to figure out how to make their services resilient without us having to tell them how to do it. They just figured it out.”
That push toward chaos engineering doubled down after Christmas in 2012, when Netflix had a major outage. “We didn’t have active/active [replication] setup, so we couldn’t move those users to the West Coast,” said Rosenthal. Because of this crisis, he said, Netflix “built Chaos Kong to save us from regional outages. It turns off entire regions. That provided the alignment for all our services to build themselves in such a way that when one goes down, we’re able to route traffic with no interruption for our subscribers.”
Though these practices and applications may sound dangerous, the result has been that Netflix’s teams are aligned on the requirements for their microservices, even without a manager pushing down demands from above. Thus far, there has only been one hiccup with Chaos Monkey in production.
“We do not have outages caused by servers disappearing anymore,” said Rosenthal. “In fact, the only blip we had from Chaos Monkey is when it terminated itself and caused some interesting logging issues.”
He went on to describe the principles of chaos engineering, which include tips and practices that can ensure chaos engineers are doing good with their inherently destructive work.
Another way the Netflix teams are able to keep their minds wrapped around the enormous number of microservices and their interactions is through the open-source application Vizceral, which presents a visual representation of what’s going on at a high level within the Netflix Amazon deployments.
Vizceral ignores the content of requests coming to the server, and instead just shows all the traffic as dots going between circular nodes in almost real time. Developers, testers and administrators can easily see what a normal state looks like in this visualization, and when errors start increasing, they can proxy traffic to other servers and relieve the load.
Rosenthal advocated for the use of such abstracted visualizations to give stakeholders better views of what’s going on inside complex systems. Large amounts of data are thrown away while the basic routing information and flow of data overall is represented, because that additional information would overload the user.
For this same reason, he advocated for automation in testing and chaos engineering, as the needs of a major microservices environment cannot be met by hand.