How Google runs production systems

For most business websites and applications, reliability is something that's handled a bit by the developers, a bit by Q/A, and a bit by operations. At Google, however, reliability is a way of life—at least, it is for the company's site reliability engineers (SREs). SREs are a unique bunch, and Google released a good deal