High Availability in Clouds: Systematic Review and Research Challenges Patricia T

Endo et al. Journal of Cloud Computing: Advances, Systems Journal of Cloud Computing: and Applications (2016) 5:16 DOI 10.1186/s13677-016-0066-8 Advances, Systems and Applications REVIEW Open Access High availability in clouds: systematic review and research challenges Patricia T. Endo1,2*, Moisés Rodrigues2, Glauco E. Gonçalves2,3, Judith Kelner2, Djamel H. Sadok2 and Calin Curescu4 Abstract Cloud Computing has been used by different types of clients because it has many advantages, including the minimization of infrastructure resources costs, and its elasticity property, which allows services to be scaled up or down according to the current demand. From the Cloud provider point-of-view, there are many challenges to be overcome in order to deliver Cloud services that meet all requirements defined in Service Level Agreements (SLAs). High availability has been one of the biggest challenges for providers, and many services can be used to improve the availability of a service, such as checkpointing, load balancing, and redundancy. Beyond services, we can also find infrastructure and middleware solutions. This systematic review has as its main goal to present and discuss high available (HA) solutions for Cloud Computing, and to introduce some research challenges in this area. We hope this work can be used as a starting point to understanding and coping with HA problems in Cloud. Keywords: Cloud computing, High availability, Systematic review, Research challenges Introduction in $336,000 less revenue per hour. Paypal, the online pay- Cloud Computing emerged as a novel technology at the ment system, experiences in a revenue loss of $225,000 end of the last decade, and it has been a trending topic ever per hour. To mitigate the outages, Cloud providers have since. The Cloud can be seen as a conceptual layer on the been focusing on ways to enhance their infrastructure Internet, which makes all available software and hardware and management strategies to achieve high available (HA) resources transparent, rendering them accessible through services. a well-defined interface. Concepts like on-demand self- According to [5] availability is calculated as the percent- service, broad network access, resource pooling [1] and age of time an application and its services are available, other trademarks of Cloud Computing services are the given a specific time interval. One achieves high avail- key components of its current popularity. Cloud Com- ability (HA) when the service in question is unavailable puting attracts users by minimizing infrastructure invest- less than 5.25 minutes per year, meaning at least 99.999 % ments and resource management costs while presenting a availability ("five nines"). In [5], authors define that HA flexible and elastic service. Managing such infrastructure systems are fault tolerant systems with no single point of remains a great challenge, considering clients’ require- failure; in other words, when a system component fails, it ments for zero outage [2, 3]. does not necessarily cause the termination of the service Service downtime not only negatively effects in user provided by that component. experience but directly translates into revenue loss. A Delivering a higher level of availability has been one report [4] from the International Working Group on of the biggest challenges for Cloud providers. The pri- Cloud Computing Resiliency (IWGCR)1 gathers informa- mary goal of this work is to present a systematic review tion regarding services downtime and associated revenue and discuss the state-of-the-art HA solutions for Cloud losses. It points out that Cloud Foundry2 downtime results Computing. The authors hope that the observation of such solutions could be used as a good starting point to *Correspondence: [email protected] addressing with some of the problems present in the HA 1University of Pernambuco (UPE), BR 104 S/N, Caruaru, Brazil Cloud Computing area. Full list of author information is available at the end of the article © 2016 The Author(s). Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. Endo et al. Journal of Cloud Computing: Advances, Systems and Applications (2016) 5:16 Page 2 of 15 This work is structured as follows: “Cloud outages” Consequently, they decided to add validation checks for section describes some Cloud outages that occurred configurations, improve detection, and diagnose service in 2014 and 2015, and how administrators overcame failure. these problems; “Systematic review” section presents the methodology used to guide our systematic review; Google Apps “Overview of high availability in Clouds” section The Google Apps Team schedules maintenance on data presents an overview regarding HA Cloud solutions; center systems regularly and some procedures involve “Results description” section describes works about upgrading groups of servers and redirecting the traffic HA services based on our systematic review result; to other available servers. Typically, these maintenance “Discussions” section discusses some research challenges procedures occur in the background with no impact on in this area; and “Final considerations” section delineates users. However, due to a miscalculation of memory usage, final considerations. on March 17th, 2014 the new set of backend servers lacked of sufficient capacity to process the redirected Cloud outages traffic. These backend servers could not process the vol- Cloud Computing has become increasingly essential to ume of incoming requests and returned errors for about the live services offered and maintained by many com- three hours. panies. Its infrastructure should attend to unpredictable The Google Engineering team said that they will “con- demand and should always be available (as long as possi- tinue work in progress to improve the resilience of Hangouts ble) to end-clients. However, assuring high availability has service during high load conditions”. been a major challenge for Cloud providers. To illustrate this issue, we describe four (certainly among many) exam- Verizon Cloud ples of Cloud services outages that occurred in 2014 and Verizon Cloud4 is a Cloud provider that offers backup 2015: and synchronization data to its clients. On January 10th, 2015 Verizon provider suffered a long outage Dropbox of approximately 40 hours over a weekend. The out- Dropbox’s Head of Infrastructure, Akhil Gupta, explained age occurred due to a system maintenance procedure that their databases have one master and two replica which, ironically, had been planned to prevent future machines for redundancy, and full and incremen- outages. tal data backups are performed regularly. However, So, as we can see, Cloud outages can occur from dif- on January 10th, 20143, during a planned mainte- ferent causes and can be fixed using different strate- nance scheduled intended to upgrade the Operat- gies. However, in most cases, in addition to the loss ing System on some machines, a bug in the script of revenue, such service disruptions pushed Cloud caused the command to reinstall a small number of providers to rethink their management strategies and active machines. Unfortunately, some master-replica pairs sometimes to re-design their Cloud infrastructure design were impacted which resulted in the service going altogether. down. Financial losses due to Cloud outages foment studies To restore it, they performed the recovery from backups about HA solutions, in order to minimize outages for within three hours, but the large size of some databases Cloud providers. In the next Section, we describe the delayed the recovery. The lesson learned from this episode systematic review approach that we used to undertake was the need to add a layer to perform distributed state research about HA solutions. verification and speed up data recovery. Systematic review Google services In this work, we adapted the systematic review proposed Some Google services, such as Gmail, Google Calendar, by [6], in order to find strategies that address HA Clouds. Google Docs, and Google+, were unavailable on Jan- Next, we describe each activity (see Fig. 1) in detail and uary 24th, 2014, for about 1 hour. According to Google describe how we address it. Engineer, Ben Treynor, “an internal system that gener- ates configurations - essentially, information that tells Activity 1: identify the need for the review other systems how to behave - encountered a software As stated previously, high availability in Clouds remains bug and generated an incorrect configuration. The incor- a big challenge for providers since Cloud infrastructure rect configuration was sent to live services over the systems are very complex and must address different ser- next 15 minutes, caused users’ requests for their data vices with different requirements. In order to reach a to be ignored, and those services, in turn, generated certain level of high availability, a Cloud provider should errors”. monitor its resources and deployed services continuously. Endo et al. Journal of Cloud Computing: Advances, Systems and Applications (2016) 5:16 Page 3 of 15 Fig. 1 Systematic review process With information about resources and service behaviors Activity 4: define sources of research available, a Cloud provider could make good management For this work, we chose the following databases: IEEE decisions in order to avoid outages or failures. Xplore5,ScienceDirect6, and ACM Digital Library7. Activity 2: define research questions Activity 5: define criteria for inclusion and exclusion In this activity, we need to define which questions we want In order to limit the scope of this analysis, we considered to answer. The main goal of this work is to answer the only journals and conferences articles published between following research questions (RQ): 2010 and 2015. The keywords “cloud computing” and “middleware” or “framework” were required to be in the • RQ.1: What is the current state-of-the-art in HA article.

High Availability in Clouds: Systematic Review and Research Challenges Patricia T

Openicra : Vers Un Modèle Générique De Déploiement Automatisé Des Applications Dans Le Nuage Informatique

Kubernetes As an Availability Manager for Microservice Based Applications Leila Abdollahi Vayghan

System Design for Telecommunication Gateways

Protection Group (PG): This Is a Dynamic Entity That Informally Represents the Groups of Components to Which a CSI Has Been Assigned

Kubernetes As an Availability Manager for Microservice Applications

Introduction Minimizing Planned Downtime of SAP Systems with the Virtualization Technologies & Systematic Review and Resear

Opensaf and Vmware from the Perspective of High Availability

Combining Opensaf with Vmware Architecture, Measurement

Third Party Software Notices And/Or Additional Terms and Conditions

IP Multimedia for Municipalities

Opensaf Release 4 Overview “The Architecture Release”

Administració Avançada Del Sistema Operatiu GNU/Linux