Why Does the Cloud Stop Computing? Lessons from Hundreds of Service Outages
Why Does the Cloud Stop Computing? Lessons from Hundreds of Service Outages Haryadi S. Gunawi, Mingzhe Hao, Agung Laksono, Anang D. Satria, and Riza O. Suminto Jeffry Adityatama, and Kurnia J. Eliazar University of Chicago Surya University Abstract Not only do outages hurt customers, they also cause fi- We conducted a cloud outage study (COS) of 32 popular In- nancial and reputation damages. Minutes of service down- ternet services. We analyzed 1247 headline news and public times can create hundreds of thousands of dollar, if not post-mortem reports that detail 597 unplanned outages that multi-million, of loss in revenue [29, 36, 89]. Company’s occurred within a 7-year span from 2009 to 2015. We ana- stock can plummet after an outage [111]. Sometimes, re- lyzed outage duration, root causes, impacts, and fix proce- fundsmust be givento customersas a formof apology[118]. dures. This study reveals the broader availability landscape As rivals always seek to capitalize an outage [2], millions of of modern cloud services and provides answers to why out- users can switch to another competitor, a company’s worst ages still take place even with pervasive redundancies. nightmare [62]. There is a large body of work that analyzes the anatomy Categories and Subject Descriptors C.4 [Computer Sys- of large-scale failures (e.g., root causes, impacts, time to re- tems Organization]: Performance of Systems: Reliability, covery). Some work focus on specific component failures Availability, Serviceability such as server machines [158], virtual machines [126], net- work components[134, 157], storage subsystems [132, 141], software bugs [137] and job failures [129, 133, 146].
[Show full text]