Business today is software. Every company, whether it realizes it or not, is a software company. For online retailers, banks, investment firms, news organizations, insurance companies—just about every business, in fact—applications are their outward face to their clients.
But applications today are complex. No longer monolithic in nature, apps today involve API calls to various web services outside the protection of your own servers and are built in parts and placed in containers to scale. A company’s entire web traffic might pass through an Akamai server to speed up performance.
To deal with the speed in which today’s world works, the notion of DevOps was born: break down the silos between development and operations, have the teams work together to provision servers, test applications, and deploy and monitor them for quick fixes based on user feedback.
(Related: The normalization of DevOps)
Yet when organizations look to hire, say, a DevOps engineer, they don’t know what they’re getting. It’s a vague term. Are you a developer with little experience monitoring apps in production? Are you a sysadmin who scripts a bit but wouldn’t do a deep dive into code? Enter the Site Reliability Engineer.
As Dritan Suljoti, cofounder and chief product officer at Catchpoint Systems, explained it to me, think of DevOps as the broader term, like “finance.” The SRE roles, then, could be “banker,” or “accountant,” or “trader.” These are roles that are clearly defined and understood. “When you say CIO or CFO, it’s clear what they do,” he said. “Just like that, the SRE role is very well defined.”
The SRE role grew out of Google some 10 years ago, long before DevOps was even a light in the eyes of IT. At first, DevOps was a role, but as the industry embraced the buzzword as a way to capture the need for developers and operations teams to work more closely together to gain efficiencies and keep applications up and running longer, it has morphed. “It turned from a role to a culture movement,” Suljoti said. “It has started to lose what is tangible. Comparing DevOps and SRE is like having a belief versus a tangible, rules-driven thing. It’s like people can read the same chapter of the Bible and get different meanings from it.”
So, why has DevOps captured the industry’s imagination more so than SRE? “DevOps is now a symbol for change,” Suljoti said “SRE is more tactical. LinkedIn, Microsoft, Google…all have SREs. It has grown a lot.”
And what is the SRE role? “It’s an engineer who develops features and fixes bugs, who can bring reliability back to the system. The overall concept is the same: Ensure the system is reliable.”
Ah, but how do you define “reliable?” How much downtime is acceptable? Is any? What’s the acceptable response time for a page load? These are decisions organizations must make for themselves, Suljoti said, but the role itself remains the same: Under whatever standards the organization sets, the SRE must make sure his or her applications perform to those standards.
“There is a correlation between reliability and an organization’s bottom line,” Suljoti said. “The site reliability people are protecting the brand and the revenue.”
Site reliability. Isn’t that the job of everyone in IT, from the developers who write code, to the sys-admins who need to keep the application up and running well? Aren’t all developers site reliability engineers?
Then I went looking around. Job listings seeking software or site reliability engineers abound on the Internet, but look very much like advertisements for software developers, with the emphasis on reliability. The difference, it appears to me, is that these reliability engineers don’t so much write code as they do troubleshoot problems in deployed software.
In a software life-cycle scenario, your developers create the applications that are deployed, and the reliability engineers take over once the application is live. When problems are found, the reliability engineers jump in to solve them, leaving the developers to continue to innovate and create new business value.
When used in this way, SREs seem most valuable. They become a critical cog in the DevOps wheel, keeping things running smoothly while the rest of the team is freed up to continue to drive the business forward.