It was only a few short months ago that a vulnerability in the Xen Hypervisor resulted in problems for Amazon’s EC2. We called it the Amazonian apocalypse then, and the time has come for its sequel.
Last time, as you may recall, servers were chunked into sections and given windows during which they needed to be rebooted. This time, the same thing is happening. Amazon says that it’s only a problem for 10% of EC2 instances, as it was last time. But we do know that it’s another Xen bug. Most likely, its some combination of five bugs that were listed on the Xen security advisory site yesterday. Those vulnerabilities won’t be made public until March 10, however.
That’s also the deadline to reboot your EC2 instance, if you are, in fact, one of the unlucky less than 10% of EC2 instances that are affected. Amazon contacted us to inform us that not all of those affected will even need to reboot their servers by hand. They indicated that this was the case last time as well.
(UPDATE: Amazon contacted us Monday morning to let us know that they’ve figured out a workaround that allows them to take care of this reboot for 99.9% of affected customers. That means you probably will not have to reboot now.)
It’s a very difficult thing for us to gauge from the outside, but I can confirm that last time, despite it being only an under-10% problem, I personally knew of admins that were up over night redeploying servers. This was, in almost all cases, entirely their own fault: They’d designed their systems in a way that made things break after patching, or that required a lot of manual labor for rebooting their EC2 instances.
Additionally, this less than 10% issue is not exclusive to single customers. Commenters online are already chiming in with their server numbers: a Hacker News commenter casted aspersions on how accurate that 10% number really is:
Been there, done that. AWS re:Boot in September 2014 showed us how good it was to invest in Ansible roles for all parts of our infrastructure. Still, a lot of hassle for Ops Team, especially that it was done during DevOps Days Warsaw 😉 AWS also said “10%” then, but for us it was 81 out of ~300 instances.
What is sad is that we learned about it from Hacker News and not from AWS, even when we have premium support and our own account manager. :/
Let’s see how many of us did their homework after previous “Xen update,” and how much “10%” is now 😉
That’s by no means a scientific sampling of the disgruntled, but there are further comments on that same thread on Hacker News that indicate the pain was not entirely a myth for users during September’s reboot.
All of this scrambling to patch and reboot is a phenomenon that’s rather interesting for the Internet. It’s not something we’ve experienced before as a society: large portions of the world having to do something at the same time, digitally. When the whole of Amazon’s cloud infrastructure required patching, for example, the world’s population of cloud users were up in shifts for weeks all working on their own projects, and yet, somehow, all collaborating on a larger project to keep everyone safe.
There was a similar phenomenon on the Internet last night: the dress. There have been fast-spreading memes on the Internet before, but the dress was unlike anything I’d seen before. It went viral, spread to everyone, completely took over Twitter and Facebook for about 12 hours, and now it’s gone.
These are the first real glimpses we’ve gotten as a society of what our world will look like in a few decades. The Internet has the ability to push ideas and information to everyone on the planet within a few seconds. We’ve never actually exploited this capability as a whole, but we’re starting to do so accidentally. The dress is one facet of this.
Can you imagine what will happen when a Xen vulnerability actually gets out into the wild before Amazon can patch? I envision a herculean, worldwide reboot effort undertaken in dire emergency, and reaching something like 80% completion in 24 hours. This is the sort of thing the Internet can enable humanity to accomplish.
And we’re getting there, one meme, one vulnerability at a time.