Learning from the Failures of Amazon Cloud
on June 14th, 2011 at 12:14 pmDue to some serious failure in Amazon data centre in Northern Virginia in United States, Amazon Web Services (AWS) suffered a major outage in the last week of April, affecting thousands of websites that rely on AWS. Many start-ups and web companies — including some companies in India — who hosted their services in the US East region were affected due to this outage. Major sites that went down include Foursquare and Quora. This is an embarrassing situation for Amazon, the undisputed leader in cloud computing today. With cloud computing getting increasing attention, startups and enterprise customers are increasingly deploying their services on AWS for their Web-scale computing infrastructure. This AWS outage has raised lots of questions in the minds of users as well as potential cloud users regarding the reliability of the cloud. Naturally, every CIO is concerned. What lessons can we learn from such disasters? Is cloud something that we can bet our business on? The short answer is a resounding yes. Interestingly, other marquee customers like Netflix and SimpleGeo also use AWS to offer their services; but their services were not affected.
Businesses should be fully aware of the limitations of cloud and should have a design that can sustain such failures. Such failures bring into sharp focus the need for architecture and design that must be addressed clearly. Moving an application to cloud does not merely imply re-locating the server to the data centre of the cloud service provider. Such an approach is all right for applications that are not “mission critical” and whose services can be down for several hours without any major impact. But a large number of services are moving to the cloud to address the core issues of: High availability; ability to scale, that too Web-scale (orders of magnitude increase in usage and/or users); consistency of performance. Though the problem of Availability Zone interdependencies partly lies with AWS infrastructure, many companies failed to recognise that the other part of the problem lies with their developers too. Every cloud expert will acknowledge that the mantra for successful architecture in the cloud is ‘Design for Failure’. Yet, many companies did not adhere to that golden rule in this recent outage. The reasons could be ranging from a lack of technical awareness for configuring high availability to the cost of operating a complex global high availability setup in AWS.
There are a couple of approaches to architecting high availability systems on AWS cloud. One approach is to have businesses running their applications across multiple Availability Zones. In this setup, the application is distributed across multiple zones on AWS (like US East 1A and US East 1B). The failure in a particular zone will redirect the traffic to a different zone that is stable.
This is a cost effective solution (in comparison to the second approach of distributing applications across multiple “Availability Regions”). However this approach may not be sufficient when the entire availability zone goes down, as it happened in the recent outage. The second approach is to run the application across multiple “Availability Regions”. In this setup, the application is hosted on multiple regions on AWS (like US East, US West, Europe and Singapore).
It is possible to have geo-distributed traffic and high availability, across continents in this setup. This setup is recommended for companies that need high level of scalability, load balancing and global user access requirements. In the event of failure at one region, the traffic can be redirected to other stable regions. This approach would have addressed the current AWS outage scenario, and those companies that were unaffected in the April outage used this design principle. The two approaches mentioned here offer a glimpse of the possibilities with regards to designing for failure. Of course, the design can be extended further (to involve multiple clouds or hybrid solutions) taking the business needs into consideration. A cloud architecture should factor some key points: We should avoid single point of failure; The system should not compromise on scalability and the application should be available at all times; The infrastructure setup should align closely to the load requirements.
Any failover setup will result in additional cost. However, it is atrade-off between higher costs of infrastructure vis-à-vis the eventual higher price of unhappy and/or lost customers! In short, April outage of AWS services will bring the focus of cloud computing research and deployment to the importance of architecture and design. We should remember that we can design systems taking into consideration multiple possibilities of failures; and, if well designed, nothing will really fail completely. With this approach, getting to cloud should be a lot more comfortable journey and one that has significant business benefits. It may be worthwhile keeping in mind what one of the key designers said “failure is a first class citizen in our design. We discuss failure all the times, including our lunch hour and build systems with failure in mind”. That is the real lesson we can learn from April outage of AWS.