What Bugs Live in the Cloud? a Study of 3000+ Issues in Cloud Systems

What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems HaryadiS.Gunawi,MingzheHao, JeffryAdityatama,KurniaJ. Eliazar, Tanakorn Leesatapornwongsa, Thanh Do∗∗ Agung Laksono, Jeffrey F. Lukman, and Tiratat Patana-anake Vincentius Martin, and Anang D. Satria University of University of Chicago∗ Wisconsin–Madison Surya University Abstract tectures have emerged every year in the last decade. Unlike single-server systems, cloud systems are considerably more We conduct a comprehensive study of development and de- complex as they must deal with a wide range of distributed ployment issues of six popular and important cloud systems components, hardware failures, users, and deployment sce- (Hadoop MapReduce, HDFS, HBase, Cassandra, ZooKeeper narios. It is not a surprise that cloud systems periodically ex- and Flume). From the bug repositories, we review in total perience downtimes [7]. There is much room for improving 21,399 submitted issues within a three-year period (2011- cloud systems dependability. 2014). Among these issues, we perform a deep analysis of This paper was started with a simple question: why are 3655 “vital” issues (i.e., real issues affecting deployments) cloud systems not 100% dependable? In our attempt to pro- with a set of detailed classifications. We name the product of vide an intelligent answer, the question has led us to many our one-year study Cloud Bug Study database (CBSDB) [9], more intricate questions. Why is it hard to develop a fully with which we derive numerous interesting insights unique to reliable cloud systems? What bugs “live” in cloud systems? cloud systems. To the best of our knowledge, our work is the How should we properly classify bugs in cloud systems? Are largest bug study for cloud systems to date. there new classes of bugs unique to cloud systems? What types of bugs can only be found in deployment?Why existing 1 Introduction tools (unit tests, model checkers, etc.) cannot capture those bugs prior to deployment? And finally, how should cloud de- 1.1 Motivation pendability tools evolve in the near future? To address all the important questions above, cloud out- As the cloud computing era becomes more mature, vari- age articles from headline news are far from sufficient ous scalable distributed systems such as scale-out computing in providing the necessary details. Fortunately, as open frameworks [18, 40], distributed key-value stores [14, 19], source movements grow immensely, many cloud systems scalable file systems [23, 41], synchronization services [13], are open sourced, and most importantly they come with and cluster management services [31, 48] have become publicly-accessible issue repositories that contain bug re- a dominant part of software infrastructure running behind ports, patches, and deep discussions among the developers. cloud data centers. These “cloud systems” are in constant This provides an “oasis” of insights that helps us address our and rapid developments and many new cloud system archi- questions above. ∗All authors are listed in institution and alphabetical order. ∗∗Thanh Do is now with Microsoft Jim Gray Systems Lab. 1.2 Cloud Bug Study Permission to make digital or hard copies of all or part of this work for per- sonal or classroom use is granted without fee provided that copies are not This paper presents our one-year study of development made or distributed for profit or commercial advantage and that copies bear and deployment issues of six popular and important cloud this notice and the full citation on the first page. Copyrights for components systems: Hadoop MapReduce [3], Hadoop File System of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to (HDFS) [6], HBase [4], Cassandra [1], ZooKeeper [5], and redistribute to lists, requires prior specific permission and/or a fee. Request Flume [2]. Collectively, our target systems represent a di- permissions from [email protected]. verse range of cloud architectures. SOCC ’14, November 03 - 05 2014, Seattle, WA, USA We select issues submitted over a period of three years Copyright is held by the owner/author(s). Publication rights licensed to ACM. (1/1/2011-1/1/2014) for a total of 21,399 issues across our ACM 978-1-4503-3252-1/14/11...$15.00 target systems. We review each issue to filter “vital” issues http://dx.doi.org/10.1145/2670979.2670986 1 Appears in the Proceedings of the 5th ACM Symposium on Cloud Computing (SoCC ’14) from “miscellaneous” ones. The former represents real issues preached extensively within the cloud community, but we affecting deployed systems while the latter involves non-vital still find 13% of the issues are caused by hardware failures. issues related to maintenance, code refactoring, unit tests, The complexity comes from various causes: hardware can documentation, etc.. In this paper, we only present findings fail in different ways (stop, corrupt, or “limp”), hardware can from vital issues. In total, there are 3655 vital issues that we fail at any time (e.g., during a complex operation), and re- carefully study. covery itself can see another failure. Dealing with hardware For each vital issue, we analyze the patches and all the de- failures at the software level remains a challenge (§5). veloper responses for the issue. We then categorize vital issues with several classifications. First, we categorize themby • Vexing software bugs: Cloud systems face a variety of aspect such as reliability, performance, availability, security, software bugs: logic-specific (29%), error handling (18%), data consistency, scalability, topology and QoS. Second, we optimization (15%), configuration (14%), data race (12%), introduce hardware type and failure mode labels for issues hang (4%), space (4%) and load (4%) issues. Exacerbating that involve hardware failures. All types of hardware, disks, the problem is the fact that each bug type can lead to al- network, memory and processors, can fail and they can fail in most all kinds of implication such as failed operations, per- different ways (stop, corrupt, or “limp”). Next, we dissect vi- formance degradation, component downtimes, data loss, stal- tal issues by a wide range of software bug types such as error eness, and corruption (§6). handling, optimization, configuration, data race, hang, space, load and logic bugs. Fourth, we also study the issues by im- • Availability first, correctness second: From our study, plication such as failed operations, performance problems, we make a conclusion that cloud systems favor availability component downtimes, data loss, staleness, and corruption. over correctness (e.g., reliability, data consistency). For ex- In addition to all of these, we also add bug scope labels to ample, we find cases where data inconsistencies, corruptions, measure bug impacts (a single machine, multiple machines, or low-level failures are detected, but ignored so that the sys- or the whole cluster). tem can continue running. Although this could be dangerous because the future ramifications are unknown, perhaps run- The product of our classifications is Cloud Bug Study ning with incorrectness “looks better” than downtimes; users DB (CBSDB), a set of classification text files, data mining often review systems based on clear performance and avail- scripts, and graph utilities [9]. In total, we have added 25,309 ability metrics (e.g., throughput, 99.9% availability) but not annotations for vital issues in CBSDB. The combination of on undefined metrics (e.g., hard-to-quantify correctness). cloud bug repositories and CBSDB enables us (and future CBSDB users) to perform in-depth quantitative and qualita- • The need for multi-dimensional dependability tools: As tive analysis of cloud issues. each kind of bugs can lead to many implications and vice versa (§5, §6), bug-finding tools should not be one dimen- 1.3 Findings sional. For example, a tool that captures error-handling mis- takes based on failed operations is an incomplete tool; error- From our extensive study, we derive the following important handling bugs can cause all kinds of implications (§6.1). findings that provide interesting insights into cloud systems Likewise, if a system attempts to ensure full dependability development and deployment. on just one axis (e.g., no data loss), the system must deploy all bug-finding tools that can catch all hardware and soft- • “New bugs on the block”: When brokendown by issue as- ware problems (§7). Furthermore, we find that many bugs are pects (§3), classical aspects such as reliability (45%), perfor- caused because of not only one but multiple types of prob- mance (22%), and availability (16%) are the dominant cat- lems. A prime example is distributed data races in conjunc- egories. In addition to this, we find “new” classes of bugs tion with failures (§6.3); the consequence is that data-race unique to cloud systems: data consistency (5%), scalability detectors must include fault injections. (2%), and topology (1%) bugs. Cloud dependability tools should evolve to capture these new problems. In summary, we perform a large-scale bug study of cloud systems and uncover interesting findings that we will present • “Killer bugs”: From studying large-scale issues, we find throughout the paper. We make CBSDB publicly avail- what we call as “killer bugs” (i.e., bugs that simultaneously able [9] to benefit the cloud community. Finally, to the best affect multiple nodes or the entire cluster). Their existence of our knowledge, we perform the largest bug study for cloud (139 issues in our study) implies that cascades of failures systems to date. happen in subtle ways and the no single point of failure prin- In the following sections, we first describe our methodol- ciple is not always upheld (§4). ogy (§2), then present our findings based on aspects (§3), bug scopes (§4), hardware problems (§5), software bugs (§6) and • Hardware can fail, but handling is not easy: “Hardware implications (§7).

What Bugs Live in the Cloud? a Study of 3000+ Issues in Cloud Systems

Improving Efficiency of Map Reduce Paradigm with ANFIS for Big Data (IJSTE/ Volume 1 / Issue 12 / 015)

Character-Word LSTM Language Models

Study and Analysis of Different Cloud Storage Platform

Large-Scale Youtube-8M Video Understanding with Deep Neural Networks

Sector and Sphere: the Design and Implementation of a High Performance Data Cloud Yunhong Gu University of Illinois at Chicago

Cloud Computing Synopsis and Recommendations

Mapreduce: Simplified Data Processing On

Big Data Challenges and Hadoop As One of the Solution of Big Data with Its Modules Tapan P

The Pagerank Algorithm and Application on Searching of Academic Papers

Mapreduce: a Flexible Data Processing Tool

Mapreduce: Simplified Data Processing on Large Clusters Jeffry Dean and Sanjay Ghemawat ODSI 2004 Presented by Anders Miltner Problem

Full Reading List