Friday, April 22, 2011

Amazon EC2 Collapse and designing for cloud computing

As I'm sure most tech geeky folks now know, Amazon EC2 had a massive outage yesterday. This affected numerous online applications and web sites totally unrelated to My favorite new word is currently cloudpocalypse.

Many folks have decided that "cloud computing" is the next golden hammer that will solve any and all computing problems. I hate to rain on their parade (pun intended), but at least from my perspective, "cloud computing" primarily provides the ability to acquire cheap and fast computing infrastructure, have it online quickly, and scale it massively. EC2 is GREAT for that, as a matter of fact, there probably isn't anything nearly as powerful and complete on the market right now.

Note that I didn't include the word "reliable" anywhere in my value proposition. Don't trust a cloud provider to be reliable, to quote Amazon's own CTO "Everything Fails all the time". The important thing to consider when designing software for the cloud is how you deal with failure.

Many folks with traditional datacenters try to deal with failure by flogging sysadmins, developers, and vendors every time something goes wrong and generally spending a lot of time pointing fingers. This is pretty unhealthy and actually creates more problems than it solves, but when using amazon as a platform it's just not going to be an option.

What does this mean?

It means when designing for a cloud provider, you might need a plan "B" or a plan "C" when something goes wrong. Many traditional shops have a "disaster recovery" site which is physically separated from the main site. How do you do this in the cloud? Likely it means having alternate providers or real life physical servers that you have control over as a disaster recovery option.

Moreover it means that your applications should be designed in such a way that they can still function when they suddenly are running on a different machine in a different location or with a different set of components. If you rely on physical files (outside your app) being present for your application to function... you'll fail. If you rely on a database having certain up to date information... you'll fail, if you rely on "you own" IP address (within your application code) you'll fail.

In short, designing for cloud computing means designing for failure. Arguable ALL software design should be done this way, but I think cloud computing requires elevating the importance of this quality.


Anonymous said...

Designing for failure is fine - but who is responsible for fixing things after a failure? If it's you, then you control your own destiny. If it's Amazon - or any external provider - they control it. The whole thing comes down to the cost of controlling ones own fate. Episodes like this just re-enforce the belief that we are a long, long way from the crossover point: it doesn't cost that much more to control your own fate!

me said...

That's a good point, I've been in shops where they didn't have contingency plans because "the vendor is contractually obligated" to an SLA. Folks were just happy to have someone to blame and didn't really want to think about the real problem.

us vpn said...

Thank you for sharing this article on Amazon EC2.