Amazon Web Services suffered an extensive outage on Tuesday in its S3 storage component. As AWS explained in its blog, the outage was caused by human error, during a de-bugging of the S3 billing system. Using an established SOP, an AWS engineer, was tasked with manually shutting down a portion of the S3 storage system. As the cloud gods would have it, he entered the command incorrectly resulting in a larger number of servers being removed. Two other S3 subsystems: indexing and placement also went down as a result.
This required a re-start of the system and while that was happening, AWS could not service requests.
This AWS outage was felt throughout the internet. Some of the most popular and frequently accessed websites and apps went down, and could no longer be accessed. The fact that AWS is the biggest cloud provider and serves most of the top-notch service providers, made its effects even more noticeable.
image from http://blog.catchpoint.com/2017/03/01/aws-s3-outage-impact/
According to business insider, the outage hurt 54 of the top 100 internet retailers, S&P 500 companies lost $150 million. Cyence reported that US financial services companies lost an estimated $160 million.
examples of outages from http://blog.catchpoint.com/2017/03/01/aws-s3-outage-impact/
It should come as a surprise that AWS is still subject to something as prevalent as human error. Seen through the lens of business continuity, the recent outage has exposed the trust that most DevOps professionals put into AWS, by restricting their resources to one DC in one cloud provider. At the very least, enterprises for whom, uptime and business continuity is of the utmost importance, do need to consider disaster recovery solutions, that are not dependent on the cloud provider itself.
AWS status updates from http://blog.catchpoint.com/2017/03/01/aws-s3-outage-impact/
In the aftermath of this outage, DevOps need to go back to the drawing board and re-think their cloud strategy. The importance of Data replication and redundancy, can no longer be underestimated. Services that need to maximize uptime, need to spread their resources more widely.
Localized outages are a fact of life for all cloud providers and while they might not have the effect that this recent outage had, the ones unlucky enough to be caught in a localized outage without resources outside of it, have to bear the full brunt.
How you could have avoided downtime
AWS has its own tool for that. It’s called Route53. Route 53 is no different to several other DNS based failover and load balancing tools on the market today. Route 53, like all DNS failover solutions has a ttl problem. DNS query responses are usually cached for a time before going back to the authoritative server for an update. This can lead to situations where a cached IP address that was up and running when it was queried for the first time, goes down and cannot be accessed the second-time round. This can only be remedied once the TTL expires and the resolver can go up to the authoritative server once again and ask for another IP address.
DNS failover can take anywhere from 30 seconds to 30 minutes to re-route traffic during an outage.
Anycast works differently to DNS based failover solutions. DNS uses a routing scheme called unicast. In unicast routing, servers advertise different IP addresses.
Anycast uses a routing scheme which allows all servers to advertise a single IP address. This allows much more flexibility in terms of manipulating traffic during outages or downtime. Internet traffic meant for one server can be re-routed to another in less than 10 seconds.
Anycast failover for AWS involves advertising the same IP address from multiple AWS regions. During an outage, like the one suffered by AWS, the BGP route announcements for the region which is suffering from downtime are withdrawn. At the same time BGP routes are announced for the active region. Internet traffic is thus directed towards the region which is live, in a matter of seconds.