Yesterday Amazon’s new S3 service served up nothing but service unavailable messages for nearly 7 hours.
I give Amazon full credit for hopping on their user forums last night and letting us know that they were working on it and letting us know when it was fixed. At the same time I’m a little frustrated that such an outage occured so early on in the history of the service. The whole point of S3 is to treat storage like a utility, metered in gigabyte-hours and gigabytes of data transfered, much like you would treat your water or electricity service.
How mad would you be if the power company turned off your power for several hours without warning, or if you woke up in the morning to find that you couldn’t take a shower? Pretty mad I imagine. I was just a little bit annoyed last night because my flickr backup wasn’t working. I couldn’t have retrieved anything from S3 if I had wanted to, but thankfully I didn’t need (or want) to.
What if I were building out a Carson-style startup using S3 for storage? That would have been 7 hours of downtime for my app too. Hopefully the beta testers weren’t too pissed off. Hopefully I wasn’t showing a demo of it to anyone.
Now might be a good time to read the Amazon Web Services Licensing Agreement and specifically the section on Amazon S3. You’ll note that there aren’t any guarantees about availability or uptime. You can’t count the nines in their SLA.
I know that Amazon strives to keep S3 and their other web services up as much as possible, and over time they have done an extellent job at it. S3 is still very young and I’m sure that they’re tweaking and improving the service on the fly all the time.
This incident is by means an indication of long term stability. Just remember that there are no guarantees.
Update: Amazon continues to keep communication channels open and are taking strides to make sure that this doesn’t happen again. David Barth writes:
A short note to let you know that we are taking the outage this weekend very seriously, and that once things calm down here we will post something to this thread letting you know what steps we will be taking in the future to ensure this doesn’t happen again.
Update: David Barth gives us a more detailed update:
We were taking the low-load Saturday as an opportunity to perform some maintenance on the storage system, specifically on some very large (>100 million objects) buckets in order to obtain better load-balancing characteristics. Normally this procedure is entirely transparent to users and bucket owners. In this case, the re-balancing caused an internal transit link to become flooded, this cascaded into other network problems, and the system was made unavailable.
Read the full post for more on what Amazon is doing to prevert further outages.