Admit it. You think of operations and that dated stereotype of a bearded guy in a dusty data center still pops up. And even though “development” isn’t that much of an improvement over its predecessor, “programming,” it beats the utilitarian sound of “operations,” right? Perhaps you have even begun to believe the hype: The cloud will eliminate most of your deployment concerns.
“I can’t tell you how ripped I get when people say things like this: ‘Cloud computing means getting rid of ops,’ ” said John Allspaw back in 2009. “If by ‘ops’ you mean people in data centers racking servers, installing OSes, running cables, replacing broken hardware, etc., then sure, cloud computing aims to relieve you of those burdens.”
Now senior vice president of technical operations at Etsy, Allspaw is the author of “The Art of Capacity Planning: Scaling Web Resources,” a title that belies his gift for making the concepts in the book a zillion times more interesting than you might think. A seminal figure in the DevOps movement, he is perhaps best known for his 2009 O’Reilly Velocity conference presentation describing continuous delivery (10 deploys a day) at Flickr.
From a developer-centric standpoint, DevOps is all about operations matching the breathtaking productivity of agile coders who have achieved symbiotic nirvana with the business and/or customers. Really, the only thing slowing developers down are those obtuse IT admins. Worst of all, companies are trying to staff DevOps teams and engineers—titles that continuous delivery guru Jez Humble and others said are missing the point.
“Why does the community hate the term ‘DevOps team’? The objection fundamentally is that DevOps teams could be another silo,” said Eric Minick, technical evangelist for UrbanCode. Maker of the AnthillPro build, deploy, test and release-automation framework, as well as the uDeploy application-deployment automation tools, it’s fair to say UrbanCode sees DevOps through developer-colored glasses.
“Politically, especially where agile has gone well, development has established a pretty good relationship with the business. So ops has a bit of an existential threat,” Minick said.
Election 2012: Gameday prep
While this existential debate was raging on Twitter and in conferences around the world, a tweet on Nov. 12, 2012, by Scott VanDenPlas put everything in perspective:
“4Gb/s, 10k req/s, 2k nodes, 3 datacenters, 180TB and 8.5 billion req. Design, deploy, dismantle in 583 days to elect the President. #madops”
Agree with his title or not, VanDenPlas spent May 2011 through November 2012 making history as “DevOps Director” at Obama for America. Was President Obama’s reelection campaign technology operation as big of an achievement as, say, Amazon’s massive, multi-year service-oriented architecture makeover? Technically, no, but in every other respect—visibility, time and money constraint, star-studded staffing, and ability to affect the lives of 300 million Americans—it was a game-changer.
“Obama 2012 didn’t have the magic of hope and change. What it did have was a relentless focus on operational excellence and massive scale,” according to the Engage Research report “Inside the Cave: Obama’s Digital Campaign.” Reports in The Atlantic, Ars Technica, InformationWeek and more gushed over the work of the elite team of 50 Silicon Valley-and-beyond experts under the Obama campaign’s CTO Harper Reed. Head-to-head comparisons between Obama’s and Romney’s technology showed a clear victory: While candidate Mitt Romney’s outsourced effort failed or degraded on Election Day due to poorly planned PDF mailings, Obama for America had undergone ruthless testing on Oct. 21 and ran smoothly through election day.
“ ‘Game Days’ were disaster-preparedness exercises where DevOps simulated nightmare scenarios, such as a catastrophic database failure or Amazon’s East Coast data center going offline,” said Reed in published reports. “It’s not enough to have it in a manual. The lesson of DevOps is that you actually have to practice and practice disaster recovery scenarios until you have them down cold.
“We knew what to do. We had a runbook that said if this happens, you do this, this and this.”
And it wasn’t just disaster they were averting: Traffic surges were fast and furious. An hour of downtime could cost millions in lost donations, with the campaign’s donation peak being US$3 million collected in one hour.
It’s an overstatement to say DevOps was the key determinant of victory in the president’s tripartite strategy combining digital (content), technology (apps and infrastructure) and data analytics. But companies are increasingly discovering that VanDenPlas had a point when he tweeted, post election: “Operational efficiency is an enormous strategic advantage.”
Streaming past agile
“When I start seeing banks, enterprises, financial institutions—brand names my grandmother had heard of—saying they’re doing continuous delivery, that’s a big change,” said Steve Brodie, who recently left application life-cycle management vendor Serena to head up Electric Cloud, a maker of build and deployment tools. “It’s no longer just these websites; it’s an aspirational model for the enterprise, looking at how they can adopt this and move up the maturity curve.”
But the view from Electric Cloud and UrbanCode is still through dev goggles. What’s interesting is that as agile has become mainstream (not necessarily perfectly implemented everywhere, but well understood and documented), there’s not much new to say about it, especially when all the emphasis is on driving speed downstream or enabling continuous delivery.
Enter Christopher Brown, CTO of Opscode. He’s a man who casually deploys “gestalt” and “phenotype” in conversation; his excitement crackles over the phone as he describes his Seattle-based company’s impending second annual user conference. Facebook production engineer Phil Dibowitz will reveal how the world’s biggest social network uses Opscode Private Chef to manage Web-tier infrastructure configuration. Disney, General Electric, Nordstrom and Riot Games are among the more than 30 companies that will present on their use of Chef in IT operations, cloud architecture and application development.
When asked if DevOps is emerging solely as a response to rapid application development, Brown scoffed. “The notion that software engineering is pushing downwards? It’s the opposite: IT pushes upwards. Operations is driving big IT,” he said.
While some DevOps tool vendors have a history in development, he said, “We came at it from another direction: the notion that the folks in operations are becoming more and more important in business because of the rate of change in business. You see orders-of-magnitude growth, and suddenly operations is at the forefront of business.”
Infrastructure as code
Operations may be in the limelight, but is it a job for people, or just tools? When Patrick Debois coined the term DevOps around 2009, he was trying to elevate the status of multidisciplinary sys-admin coders who regularly release software to customers. Do these admin coders exist?
“When we go into potential deals, we try to identify a champion. That champion is almost always a coder on the infrastructure side of the house,” said Brown. “I know the phenotype exists because the Velocity conference was designed for this person.”
What’s more, Brown was eager to dispel the stereotype of the old-school, low-skill admin. With the redefinition of infrastructure as code that’s underway in tools like Chef and Puppet, “We tell these ops guys, ‘You’re writing code too. This notion that you’re not a software guy is wrong. You version and deploy in exactly the same fashion the developers do.’ ”
Brown pointed to luminaries such as Theo Schlossnagle, CEO of OmniTI. “Theo has a Ph.D. in computer science,” he said. “You ask him what he does and he says, ‘I’m a sys admin.’ He identifies with that pack.”
Another thinker is Mark Burgess, CTO and principal author of the open-source configuration management system CFEngine, which celebrates its 20th anniversary in 2013. As professor of network and system administration at Oslo University College until 2011, he focused on automation and policy-based management. In the 1990s, he described idempotent, autonomous desired state management (“convergence”), and in the 2000s he explored cooperative systems (“promise theory”). In his USENIX series of articles from 2006 to 2007, he explored “Configuration Management, Models and Myths” and, like Allspaw, shined a light on the nuance and skill that operations entail.
“There’s lots of math, elements of distributed systems theory, scaling out concurrent Web operations, thread affinity and cache coherency on the processor… The skillset required to leverage all this is actually staggering,” said Brown—with humility to match any developer.
Who, me? DevOps?
Humble gets it, of course, and has collaborated with Allspaw, Patrick Dubois and others in the DevOps space. But even he railed in an October 2012 blog post against labeling or hiring so-called DevOps teams, arguing that it would increase handoffs to functional silos. He moderated his advice, however, with the following caveat: “I lied when I said there’s no such thing as a DevOps team. For developers to take responsibility for the systems they create, they need support from operations to understand how to build reliable software that can be continuously deployed to an unreliable platform that scales horizontally… Somebody needs to support the developers in this, and if you want to call the people who do that the ‘DevOps team,’ then I’m OK with that.”
In Humble’s estimation, developers today should learn how to do packaging, deployment and post-deployment support. But the expectation, as he described, sounds a bit like the old joke about a lonely man’s classified ad: “Looking for a good woman who’s kind-hearted, cooks well and has a speedboat. Send picture of the boat.”
No longer a wallflower, operations is experiencing an explosion of possibility. “I think of it as an expansion in the job roles going from basic systems administration to really what we call infrastructure engineering,” said Jesse Robbins, founder of Opscode, in a May 2012 O’Reilly Radar interview. “The road ahead for everybody who builds and maintains infrastructure or applications in software looks like building a very powerful software platform.”
And the DevOps job titles? They are popping up on the job boards. PagerDuty is a San Francisco-based startup whose advertisement for a senior DevOps engineer reads in part, “You’ve pulled back the covers and know how this Internet thing works end to end. Networks, servers, protocols, operating systems, services, databases, query optimization, disks: to you nothing is a ‘black box.’ If necessary, you can debug performance problems through the whole stack.”
The ad continues: “You subscribe to the theory that operations engineering is about writing software to manage machines, and the thought of SSHing directly to a machine to adjust a conf file strikes you as… distasteful.” Among the critical concepts PagerDuty wants its DevOps engineers embracing are continuous integration servers, push-button deploys, time-series data stores, metrics dashboards and centralized logging.
“That’s a pretty bad smell, someone advertising for a DevOps position. If I were responding to that ad, I’d have some pointed questions about how they see that role,” said UrbanCode’s Minick, who also noted that skillsets like those PagerDuty is hiring for are difficult to find.
“Am I going to be the DevOps person on the team, or facilitating with others? Companies are struggling to hire those people. What I propose is, you’re not going to hire those people. You’re going to need to form a crack team of them internally, and then help the rest of your organization get on board.”
Respect for resource patterns
Part of getting on board should be sampling the air at a DevOps conference like Velocity, or perusing the Burgess’ writings, where you’ll find fascinating rambles about Chomsky’s hierarchy and language theory, and Alvin Toffler’s adhocracy. In the aforementioned USENIX article series, Burgess lamented that we still think of IT management as the creation of “golden master servers that are to be worshipped by hundreds, perhaps thousands of clones. Adhocracy is not the default doctrine in computer administration.”
Allspaw waxed poetic about capacity planning, turning it into almost a Zen discipline: “Capacity planning is a term that to me means paying attention. Web applications can fail in all sorts of dramatic ways, and you’re not going to foresee all of them. What you can do, however, is make use of what you do know about what happens in your real world on a regular basis. Things like, ‘My database can do X queries per second before it keels over.’ Or ‘My cache can only keep Y minutes worth of changing objects,’ ” he said in an interview on the blog High Scalability.
Observing behavior meticulously leads to educated forecasts, according to Allspaw. System-level statistics for CPU, memory, network, storage and application-level metrics for users, posts, photos, videos, page views, sales and the like are the clues that feed operations’ omniscience.
According to Amazon CTO Werner Vogels in a 2006 ACM Queue article, “Giving developers operational responsibilities has greatly enhanced the quality of the services, both from a customer and a technology point of view. The traditional model is that you take your software to the wall that separates development and operations, and throw it over and then forget about it. Not at Amazon. You build it, you run it. This brings developers into contact with the day-to-day operation of their software. It also brings them into day-to-day contact with the customer. This customer feedback loop is essential for improving the quality of the service.”
And service is definitely the operative word: Visiting the Amazon.com gateway triggers calls to more than 100 services to collect data and dynamically construct the page. In a horizontally scaled, unreliable new world, there’s truth to this idea that the customer is no longer just the business stakeholder or the user story; it is in fact that cloudy beast comprising billions of connections and requests.
Some have wondered if the concurrency concerns of massively scaled applications, similar to those observed by C++ expert Herb Sutter eight years ago, could portend a resurgence in Erlang, the language developed by Ericsson for telecom applications.
“If you look at the back end of Chef—the activity of starting a connection, transferring a payload, receiving acknowledgement and shutting down the connection—it looks a lot like telephony,” said Opscode’s Brown. “It’s built with Erlang, which is a beautifully fault-tolerant concurrency architecture.”
Aiming for continuous maturity
Regardless of who’s driving the business, IT or development, DevOps makes sense as the ultimate extension of agile. But if continuous delivery is the aspiration for enterprises inspired by the Amazons, Facebooks, Flickrs and Etsies out there, it’s still a ways off. Helpfully, vendors have defined maturity models to motivate those contemplating push-button deployments.
“We developed a deployment-maturity model and identified four steps: Manual, script, automation, and continuous application delivery,” said Doron Gerstel, CEO of Noliosoft, which makes application release-automation software. “Most of the solutions that we know of can get you up to level-three automation, which is basically to automate the way you are doing things now on a single environment. Level four—continuous application delivery—is like a conveyor belt that connects all the stages in the release pipeline, modeling the application in such a way that it can be applied to any environment. It needs to be all the time in motion, like the car industry.”
Humble’s company, ThoughtWorks, offered its own model: focus on practices, not tools. In a Forrester white paper commissioned by the company, the company laid out five stages that dovetail with the Carnegie Mellon University Capability Maturity Model. Level-two automation involves an adaptive delivery process with relatively short iterations, some automated testing and scripted builds, and time-boxed releases.
At level three, teams follow one of Humble’s key practices: “trunk-based development with continuous integration of all changes.” At level four (quantitative management), a deployment pipeline is in place, rejecting bad changes, and “delivery teams prioritize keeping code trunk deployable over doing new work.” By stage five (optimization), teams fine-tune cycle time in order to learn from customers, and “continuous deployment capability enables business innovation/experimentation.”
And there it is, full circle: the business learning from operations. Will it pan out for non-Web-facing enterprises? Do the lessons of Amazon and Obama for America apply to every enterprise? These are still early days, but the energy is right. Watch those ops guys: They’re moving fast.
Not quite in the cloud
Would DevOps exist without cloud computing? In a Forrester survey on continuous delivery commissioned by ThoughtWorks and released in March, 44% of IT leaders reported automating software deployment, and 41% provided developers with access to self-provisioned resources, like a public or private cloud. Only 17% of respondents deployed daily to cloud services, be they private or public infrastructure or platform providers; 24% did so weekly. Nearly 40% rarely or never deployed to the cloud.
“The bulk of our customers are deploying to their own private environment,” said Electric Cloud’s Brodie, who pointed out that more enterprise customers are deploying for testing. “In some percentage it’s a private cloud, and in a small percentage of customers it’s the public cloud. These tend to be websites, e-commerce or retailers, or ISVs that have SaaS-based applications.
“Many customers, when they move to DevOps, are trying to reduce time to market by automating the entire process from check-in through build and release. FamilySearch.org used to take three months to get updates out. They moved to continuous delivery, and now they can deploy changes from check-in to live in under 10 minutes.”
Electric Cloud contended that there’s a cost to being slow. Its return-on-investment calculator walks through some averages: Initial deployment failure may happen 10% to 20% of the time. There are typically four developments a year. It may take four to eight person-hours to deploy a Java update, but there are also manual updates that take hundreds of steps. Then there are time-to-market, customer retention and uptime advantages. The costs the deployment tool will save by defining thresholds for deployment failures and speeding the process, the company claimed, can be staggering.
Brodie, who recently left Serena, characterized Electric Cloud as having a unique advantage as an end-to-end tool. He laid out the landscape as starting with developer application life-cycle management: Companies such as Serena, Atlassian or Rally. Then there are infrastructure tools such as VMware and AWS. Puppet, Chef and CFEngine take care of configuration management. Finally there’s IT management, such as Service Now. “We tie in all these elements,” he said.
While UrbanCode’s uDeploy gets 90% of its customers from the operations side, the company also attracts developers familiar with AnthillPro. UrbanCode’s technical evangelist Eric Minick explained: “The production deployment is considerably more complex and fundamentally different than what development does. You don’t get to drop all the data and start from a clean database, because data is our business. Where a lot of teams slip up with a continuous-delivery effort that originated with development is, they build it all on dev-style deployment, and the gap is just too far.” A better approach is to start with the complex and work back toward the simple: start production-style, then automate QA, then incorporate development.
According to Nolio’s Gerstel, Electric Cloud and UrbanCode both “come from build solution and want to expand it. They have a large installed base in build integration that they want to upsell to continuous or release automation. Nolio was designed foremost as a deployment-automation tool for large enterprise.”
Nolio’s prestigious financial customers are surprisingly agile: Lombard Odier, a Swiss bank founded in 1796, deploys applications 20 to 30 times a week, with complex apps taking two hours each. Using Nolio, the company hopes to reduce that time to 20 minutes.
Once an application is live, however, the story doesn’t end. “Just because your application is up doesn’t mean it actually works,” said Matt Watson, CEO and founder of Stackify, a year-old startup aiming to put application monitoring in the hands of developers, not operations.
“The biggest problem we saw is developers don’t have access to the servers. Most of time, they get no access or admin access. Lack of access makes it very difficult to troubleshoot problems”—things such as checking that queues aren’t piled up, or verifying whether a file is actually exported and delivered.
A SaaS-based tool that runs on Windows Azure, Stackify troubleshoots remotely where other tools are too painful to use. “I think there are too many tools, and that’s part of the problem,” said Watson. “You have multiple teams, and each uses five different tools: Cacti, Gomez, Nagios, Splunk or Logly, Pingdom, Puppet… The problem is, they don’t know how to troubleshoot them. Our end goal is to encompass a lot of this functionality in this one system. We see ourselves as that last mile.”
As for the cloud? Watson is blunt: “DevOps to me is the concept of developers and operations working together. And that has nothing to do with the cloud.”