There’s been an explosion of interest in SRE over the last 18 months and a lot of this has been from companies that are looking at scaling their DevOps or DevSecOps initiatives to look at the reliability concerns of their customers. 

Vendors are recognizing this and a lot of general software interfaces (GSIs) and Managed service providers (MSPs) are offering some form of SRE-as-a-service, according to Brent Ellis, senior analyst at Forrester.

Since the role emerged at Google in 2003 to build reliable and high-quality services while reducing costs, it has since evolved, according to Narayanan Raghavan, senior director of site reliability engineering at Red Hat.

“I think the core SRE function, in many ways, becomes a foundation and then you build on top of it. So as the teams that focus on SRE capabilities start to mature, you get into ‘how do I get into robust CI/CD practices?’” Raghavan said. “How do I build capabilities for my development teams to onboard quickly and easily because it then makes my life easier as an SRE, it makes the developers’ lives easier because they don’t have to worry about things like observability, logging, metrics, alerting. They don’t need to think about disaster recovery, incident management, or incident rehearsals.”

For SRE to work in an organization, other teams also need to be receptive to the input that SREs offer and the level of role and this responsiveness differs based on the maturity of the organization. This level of engagement can be divided into three different buckets, according to Raghavan. 

One is that toil for SREs should become tech debt for development almost immediately so as to avoid a separate quote prioritization process. 

The second is that when developers actually start to architect a component that’s completely new, they need to pull in the SREs and engage with SREs up front, according to Raghavan. This is so the SREs can participate and think about how to scale that particular component. In mature organizations, this becomes an important bucket in which developers start to engage out of their own volition instead of being told that they have to do something. 

Then, the third bucket is that as the SRE practice matures and is creating the building blocks that matter to all teams (observability, logging, metrics, and alerting) it’s also engaging development teams up front. 

“That becomes important because it’s the development teams that are then adopting those self- service capabilities that SREs are putting out,” Raghavan said. 

SREs can also lead things like blameless post-mortems in which they’ll look to get to the bottom of what caused the problem. They won’t blame any person, but will look at the processes or the technology that enabled that to take place, according to Daniel Betts, senior director analyst at Gartner.

“If you want to get full value from your SRE, try not to use them as a developer resource,” Betts said. “They should be more of like a reliability focused engineer who’s looking at the overall picture of what’s going on across the product or service that you have.” 

SREs often come in at the beginning of the product life cycle and work to help the product team or the platform engineering teams build a product that is very reliable and robust, that meets the customers’ needs, he added. From there, they can perform tasks across the whole development life cycle. 

“They can be involved throughout the life cycle to the point where the actual product is highly automated and incredibly reliable. It’s now running that product quite maturely and it has very effective automation, monitoring, and observability in place,” Betts said. “The SRE may actually just be keeping an eye on or looking after that product from a standpoint of the dashboards or monitoring tools or observability tools to see if it’s doing what we expect it to do. It doesn’t need that much attention anymore. They can now focus on other solutions to help with the automation and improvement of those.”

Unleash the SRE from within

With potential hiring freezes and budget cuts looming, organizations often try to look for to-be SREs already within their company. 

“The perfect SRE is a myth. That perfect SRE would get bored a month, two months down the road, they’d say ‘been there, done that, give me something else, give me something new, I want to learn something different.’ So I am generally looking for people with potential,” Red Hat’s Raghavan said. “And when I say potential, these are people that are, in some cases, traditional software engineers.” 

These software engineers would already have a systems mindset with which they can think about systems at scale and approach problems that way. A good pool of potential SREs can also exist with systems engineers that can understand software engineering principles.

“So I am from a hiring practice perspective looking for people that fall in that bucket specifically, because then I know that I can invest in them. And as I invest in them, and as they learn the space, they invest back into the company and back in the team,” Raghavan said. “So I am not looking for a perfect fit. I’m in fact, looking for people who are, in many ways eager to learn, can understand technology and understand how to pick up different spaces quickly.”

It’s also important to assign new SREs to a production process early on and to have a mentor guide them.

Gartner’s Betts sees that some organizations that want to start an SRE practice just wind up rebranding an existing I.T. operations team or person in that role which is the wrong approach. 

“An SRE is giving value not just by focusing on things like incident problems, operational improvements, monitoring, and being able to have better insights,” Betts said. “It’s also looking at how we can take some of that software engineering or engineering mindsets to the world of infrastructure operations and look at how we can have reusable modules, efficient infrastructure delivery, efficient response to incidents, and being able to scale capacity.” 

In their day to day work, SREs are often embedded into a product team like a development product team where they’ll act as a reliability consultant to inform the team of expectations around reliability in the organization, help to look for some of the toil, and will look to automate some of those practices as part of the backlog in that product team, according to Betts. 

“In the early maturity stages, having a completely decentralized model makes a lot of sense, because you’re a lot more nimble and agile. But as the product matures, having a more central function to think about reliability at scale becomes important,” Red Hat’s Raghavan continued.

SRE…the social butterfly?

One skill set that often goes overlooked for this role is soft skills, which should instead be called ‘critical skills’, according to Gartner’s Betts.

SREs need to be great communicators because part of the job function is to communicate effectively, both in terms of data that they see with service level objectives (SLOs), budgets, and other things. They also need to show that they can empathize with customers and talk about specific things that are impacting customers’ experience. The SREs are often the ones interacting with customers, partners, development teams, product managers, and more.

“So if you’re talking to maybe a product owner or a strategy person, you take it to a higher level, you’re talking to someone that’s in the team, as an engineer or a developer, you need to get maybe down into the depths and talk a little bit more detail with them,” Betts said. 

Red Hat’s Raghavan added that these soft skills are even more important for an SRE than the technical skills. This is because technical skills are trainable, but it’s often much harder to find people with both soft skills and technical skills. 

“That mindset and the ability to articulate that is absolutely vital for a reliability engineering function, because then we start to look at if something really matters to the customer, you should probably be looking at the specific causes that matter and therefore the symptoms that show up to the customer and what it is that we need to get alerted on,” Raghavan said. 

To read more, click here.