Amazon Web Services

    What the AWS Outage Can Teach You About IT Survival Strategies

    October 6, 2015

    By Wyndham Sellers

    An extensive outage on the AWS platform ignited outrage among consumers in September, many of whom experienced lengthy downtimes on their favorite websites and apps. Users of these services made their suffering known on social media, but the outage’s impact on those companies that rely on AWS has yet to be fully determined. This begs the question: how can your company prevent such an outage from happening to you?

    What Happened?

    According to TechRepublic, the outage reportedly began around 6am EST on Sunday when “some of the internet’s biggest sites and apps were intermittently unavailable after more than 20 services on the AWS platform began failing.”

    The disruption of service originated from data centers in Northern Virginia, referred to as the US-EAST-1 region. AWS concluded that the root cause began when “metadata service responses exceeded the retrieval and transmission time allowed by storage servers.”

    The issues were significant enough to affect more than a dozen of the web’s most popular sites and apps, including Netflix, Product Hunt, Buffer, Reddit, IMDB, Airbnb, Amazon Instant Video, and Tinder. As you might expect, this resulted in many bewildered and annoyed consumers turning to Twitter and other social platforms that remained intact to voice their frustration with AWS.

    After a solid 6-8 hour-long downtime, many of the affected sites and apps began seeing restored function just after 12 p.m. EST, while some sites remained down until after 2 p.m. EST.

    What Was the Impact?

    With an outage this extensive, having affected multiple popular companies and services, the obvious question is, what kind of damage can these companies expect?

    The outage that occurred Sunday was not an isolated event for AWS. There have been numerous reported incidents when AWS’s service suddenly stopped, most notably in 2013 when the cloud provider experienced an outage that took out Instagram, Airbnb and Vine.

    Buzzfeed estimated that Amazon lost “about $1,110 per second as a result of the downtime. Even when you use that estimate (which is surely lower than today’s figure) and consider the fact that the outage lasted 6-8 hours, this puts the cost of the AWS outage in the potential billions. What’s more, that estimate doesn’t take into account the impact of lost brand loyalty that each affected service experienced during the outage.

    What’s Your Survival Strategy?

    So the question now becomes, how does your company survive, or at least offset the effects of an outage like this? Netflix came out ahead during the outage, mostly due to a practice it calls “chaos engineering” which utilizes “software that deliberately attempts to wreak havoc on its systems” so that their Tech Ops teams can prepare for failure.

    While the AWS outage on Sunday debilitated many services, it also serves as an important lesson in the value of designing and planning for failure in your IT infrastructure.

    (Main image credit: GBPublic_PR/flickr)

    Category: capacity-planning