For most business websites and applications, reliability is something that’s handled a bit by the developers, a bit by Q/A, and a bit by operations. At Google, however, reliability is a way of life—at least, it is for the company’s site reliability engineers (SREs).
SREs are a unique bunch, and Google released a good deal of new information and interviews with them this past week. The company also gave us access to Todd Underwood, an SRE director at Google, so we could discuss some of the more holistic ideals presented in the new book “Site Reliability Engineering: How Google Runs Production Systems.”
Underwood said that SREs see reliability as a first-level priority, but they also have the goal of automating enough stability work to keep the number SREs at Google low.
(Related: Google adds new solutions for developers)
Underwood had some advice for teams looking to enact some of the Google processes in their organizations: “The initial step is taking seriously that reliability and manageability are important. People I talk to are spending a lot of time thinking about features and velocity, but they don’t spend time thinking about reliability as a feature.”
He said that one successful practice at Google is to allow applications to fail gracefully. While Gmail is quite pretty now, if things go wrong on the Gmail servers, the old HTML interface is available as an emergency backup for users. While this might degrade usability, it keeps the reliability and stability of Gmail firm.
“Availability is a feature, and the most important feature,” said Underwood. “If you’re not available, you don’t have users to evaluate your other characteristics and features. Your organization needs to choose to prioritize reliability. That can even mean they choose someone to make things more manageable and more reliable, and empowering them to get those things done.”
Underwood said that many of the things Google does internally aren’t applicable to other organizations, but there are a few tips and practices that can be generally applied. “For distributed applications, we recommend running some kind of Paxos-consistent system,” he said. “We have a whole [paper] on distributed consensus. It seems like a computer science nerdy thing, but really if you want to have distributed processes and know which ones are where, it’s not possible without Paxos in place,” he said, referring to the distributed consensus algorithm.
“Your software has to have unit tests. You have to have a system for doing integration and load tests. If you’re not doing unit tests and load tests, it’s unlikely you’ll be successful,” said Underwood. He also advocated “Canarying. Canarying we take super seriously. You pick a strategy by which you can run a live copy of your new application for new traffic. If you have 10 Web servers, you can run one with the new Websites and see if you have higher error rates, higher clickthroughs, or a lower or higher rate of successful transactions.”
Finally, said Underwood, “It comes down to monitoring. You need a couple different kinds of scales of monitoring. You need a long scale and you need short-scale monitoring down to the hours, to minutes to seconds,” he said.