The outage of Amazon Web Services’ Simple Storage Service (S3) yesterday impacted websites and applications across the eastern United States, but according to AWS’ health dashboard, Amazon S3 has fully recovered and is operating normally.
The AWS breakdown was a four-hour outage caused by high error rates with S3 in US-EAST-1, which caused websites completely held by those servers to go down, according to Shawn Moore, CTO of web experience platform provider Solodev. He estimated that 20% of the Internet was impacted, yet there were many businesses hosted on Amazon that did not experience issues.
“The difference is that the ones who have fully embraced Amazon’s design philosophy to have their website data distributed across multiple regions were prepared,” said Moore. “This is a wakeup call for those hosted on AWS and other providers to take a deeper look at how their infrastructure is set up and emphasizes the need for redundancy—a capability that AWS offers, but it’s now being revealed how few were actually using.”
For those in the industry, the AWS outage is proof that all technology can fail eventually, even the ones that are “too big to fail,” like Amazon, said Moore.
Other industry leaders are suggesting that the latest Amazon outage might foreshadow a widespread systemic failure of the Internet, said Phil Tee, CEO of Moogsoft, who added that this would of course pose “massive risks to the health of modern businesses and have negative impacts on the global economy.”
“The only hope for modern CEOs lies in the widespread application of data science to ‘algorithmisize’ and safeguard the operations underpinning their digital services, as human beings alone are not able to understand and cope with the ever-changing complexity inherent in operating cloud-resident businesses,” said Tee.
While Twitter users called the outage a “digital snow day for grownups,” as many as tens of thousands of businesses were impacted, according to Doron Pinhas, CTO of Continuity Software.
At a certain scale, businesses cannot afford to rely on one single provider, especially when it comes to utilities, he said. There are a few solutions that enable distributing risk among multiple service providers, including a mix of private and public cloud, he said.
“For mission-critical applications, it would be highly advised to consider distributing workloads between multiple providers,” said Pinhas. “At the very least, and especially if you are tied in too deep into a single vendor ecosystem, you must consider distributing workloads and especially data across sufficiently disparate geolocations, [like] regions and availability zones.”
Additional stats from web monitoring site Apica showed just how significant this issue was for companies that rely on websites and apps for traffic and sales. For instance, it found that 54 out of the top 100 Internet retailers were affected with a decrease of 20% or greater in performance metrics. Three websites went down completely: Express, Lululemon and One Kings Lane. Apica also found that the average website load time went from a few seconds to more than 30 seconds.
Some of the top online websites’ load times increased by triple digits, like the Disney Store, which was 1,165% slower, and Target, which was 991% slower. Amazon/Zappos, Apple, Best Buy, Costco and Walmart were not affected by the outage. Some of these companies affected were able to spot-check and evaluate the performance of their site while it was down, which allowed them to generate a workaround or quickly use images from a local server.