Spoiler Alert: The Cloud Can Be Brought Down
A service outage in Amazon’s popular AWS platform was a sobering reminder that even the biggest and best cloud providers can suffer from service-disrupting disaster.
In early March, Amazon experienced an outage of their S3 storage containers that lasted several hours.
The outage impacted websites like Giphy, Business Insider, Trello, and Quora, while having a narrower effect on particular services of other websites, like Slack’s file-sharing function, as well as various publishers’ image hosting services. It even affected IoT-enabled light bulbs, thermostats, and other “smart” hardware.
In addition to loss of service for users, the outage also caused a loss of visibility into their individual networks. Users lost access to Amazon’s dashboard, which shows them the status and availability of servers within their infrastructure.
What Caused the Outage
Over the course of several hours, Amazon worked to resolve and restore full function to S3 clients, starting with its own ability to accurately display service functions from its dashboard. It wasn’t until three hours later that company announced all services and functions had been restored to users and that the S3 service was “operating normally.”
But what caused this outage that seemed to impact such a large swath of the internet?
Shortly after declaring the issue resolved, Amazon released a statement detailing the error that caused the outage. It turns out that a simple typo was at the heart of the issue.
“At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process,” the statement read. “Unfortunately, one of the inputs to the command was entered incorrectly, and a larger set of servers was removed than intended.”
In order to avoid similarly disastrous errors in the future, the company said it was “making several changes as a result of this operational event…and auditing our other operational tools to ensure we have similar safety checks.”
The changes included modifying the tool the team used to remove capacity so that it would occur more slowly and with more safeguards in place.
Lesson Learned: Be Prepared
AWS is one of the largest public cloud providers in the world, and this most recent outage is only further evidence that, regardless of a provider’s size or pedigree, disaster can strike at any given time.
Given all the manual work involved in patching, debugging, and troubleshooting, the chance for human error is always present. As such, companies should always have a backup plan for data recovery even if they think their network is reliable.
Companies can lose millions of dollars a year as a result of unexpected downtime and outages, whether they are using an on-premises network environment or hosting all of their services in the public cloud.
The first step to ensuring your network stays online—while maintaining the vital visibility required to make proactive changes—is by implementing infrastructure management tools.
Software like the Vityl Suite from TeamQuest can help mitigate the risk of losing business critical services and reroute backup resources as efficiently as possible to keep your services online.
Vityl Adviser, in particular, provides powerful analytics so that IT teams can get an at-a-glance view of the health of their infrastructure and make proactive decisions to keep it online.