I recently moderated a webinar in which Sam Guckenheimer, a Microsoft product owner and group product planner, discussed their development journey to DevOps. It was a “lessons learned” presentation that any company looking to benefit from DevOps can use.
To sum it up: The first decade of doing agile development uncovered some key areas. First, your development team needs to be autonomous, yet aligned with the business in terms of goals and quality. Second, managing technical debt is important, as debt interferes with quality and poses a potential risk. Third, monitor the flow of customer value; that is, make sure you listen to the customer to minimize reworking code. Make sure the stakeholder is satisfied and has input, from idea to production.
(Related: How to plug into DevOps)
These are largely viewed today as essential to agile development. But Guckenheimer said the Microsoft developer division found four more “essentials” to moving from agile development into DevOps.
He started with the backlog, saying it is merely a set of hypotheses and beliefs rather than a form of requirements. The hypotheses need to be turned into experiments, which provide the data that can either substantiate or refute the hypotheses. “Learning leads to the next set of hypotheses. It’s not a random walk; it’s not a set guesswork; it’s a set of rigorous tests of beliefs using data to match against the beliefs,” he said.
Guckenheimer shared this story: “We had a signup experience for Visual Studio Team Services. We asked for your identity, we asked you to set up a project, and then we had all these other tiles that informed you about what else you could do. And we were not at all happy with the conversion rate through this screen. So we set up an experiment where we diverted some of the traffic to a new experience. We had a Web version and we have an experience where you sign up through the Visual Studio IDE. In the IDE, we had an instant 7x bump on the new experience for signups from the IDE. On the Web, we had about a 50% bump. But we discovered by the comparison to the IDE that…we were pumping a bunch more traffic to the Web, essentially routing a large volume of unqualified leads in. We want to turn off that extra traffic, and when we did, we get to a 3x improvement on the Web. And of course we went to the new experience based on those results.”
We could do that, he said, because we gather evidence in production. That’s the next essential habit. He said the Microsoft team gathered business-related data (conversions), but also gathered data for troubleshooting, as well as performance metrics, live-site issues that affect end users, and the number of times needed to communicate and mitigate issues. Precise alerting is key to fast issue detection on live sites, he noted, but it’s something people struggle with. You need a production-first mindset.
“We had these Tier 1 people in our service delivery team who were triaging lots of alerts,” said Guckenheimer. “They were looking at 40 or 50 alerts and trying to find which one is the real source of the problem. Based on that, they would contact the responsible individual, the developer on the feature crew who was responsible. He needs to be on the bridge in five minutes if it’s work hours, and 15 minutes if it’s outside of work hours.
“We replaced that with a health model to find…the root cause via machinery. We set up an auto-dialer that would contact the developer responsible or the DRI from the feature crew on call that week. The result is we had a 40x improvement in the month we rolled it out, and reduced escalations by 50%, all by shifting from manual triaging of alerts to putting in place an automation. This comes from a mindset where you say I’m going to make everyone’s life better, even though it’s a you-build-it, you-run-it world.”
The seventh habit is managing infrastructure as a flexible resource. “We needed to shift the backlog from doing feature work first to doing live site work first. We needed a canary instance,” said Guckenheimer, where we work with our closest customers before we toggle the automation to let it move forward to the next rings of deployment.
Microsoft differentiates the idea of deployment from exposure with Feature Flags. “All code is deployed, but features not ready for wide usage will have the flag down on them so no users see it, or only select people can see it,” he said. “We have runtime control over how it gets exposed, so we can do experimentation. Customers can lift the Feature Flags on their accounts so they can participate in that experimentation. We can work with dark launches in production and build up experimentation and refinement, but not hide off in a branch and build up debt that we then have to remediate.”