IAGREE: Infrastructure-agnostic Resilience Tool for Cloud Native Platforms

Paulo Souza1 a, Wagner Marques b,Romuloˆ Reis c and Tiago Ferreto d Polytechnic School, Pontifical Catholic University of Rio Grande do Sul, Ipiranga Avenue, 6681 - Building 32, Porto Alegre, Brazil

Keywords: Cloud Computing, Cloud Native Platform, Resilience, Benchmark.

Abstract: The cloud native approach is getting more and more popular with the proposal of decomposing application into small components called microservices, which are designed to minimize the costs with upgrades and maintenance and increase the resources usage efficiency. However, microservices architecture brings some challenges such as preserving the manageability of the platforms, since the greater the number of microservices the applications have, the greater the complexity of ensuring that everything is working as expected. In this context, one of the concerns is to evaluate the resilience of platforms. Current resilience benchmark tools are designed for running in specific infrastructures. Thus, in this paper we present IAGREE, a benchmark tool designed for measuring multiple resilience metrics in cloud native platforms based on Cloud Foundry and running upon any infrastructure.

1 INTRODUCTION able unit has shown to be a suitable solution for small systems. However, as the size of the application in- Cloud computing is becoming more and more popu- creases, tasks involving scaling or performing main- lar among multiple-sized organizations by providing tenance become more difficult due to challenges re- benefits such as commodity of running large-scale ap- lated to code readability and over-commitment to a plications with no need to care about issues involving specific technology stack. In this context, the Cloud local resources. Moreover, cloud also allows a better Native approach arises with the proposal of taking resources utilization through the pay-per-use pricing advantage of the microservices architecture to allow model, wherein cloud resources can be allocated and on-premises applications to fully exploit the benefits deallocated on demand, so users only have to pay for of cloud computing by decomposing applications into the resources they are actually being used (Nicoletti, microservices that could be deployed and scaled inde- 2016). By adopting such a strategy, customers avoid pendently and do not need to be built with the same issues related to underprovisioning, in cases when an technology stack (Balalaie et al., 2015). application gets more popular than expected result- Cloud native platforms offers an additional ab- ing in revenue losses due to not having the computing straction layer over the infrastructure by the adoption resources enough to meet the demand, or overprovi- of PaaS model. These platforms aim to simplify the sioning, where applications do not meet the expec- build, deployment, and management of cloud native tations, and the reserved resources end to not being applications. These kind of applications are designed used (Armbrust et al., 2010). to exploits the advantages of the cloud computing de- However, not every application architecture ex- livery model. ploits the maximum potential of cloud environments. Despite the benefits brought by cloud native, there For example, the monolithic architecture in which the are still concerns regarding topics like resilience, application logic is contained into a single deploy- which is the ability to deliver and maintain a certain level of service despite the failure of one or several a https://orcid.org/0000-0003-4945-3329 system components (Abbadi and Martin, 2011). Re- b https://orcid.org/0000-0003-3304-5611 silience issues in cloud services can lead to several c https://orcid.org/0000-0001-9949-3797 negative consequences, for example, system instabil- d https://orcid.org/0000-0001-8485-529X ity, bottlenecks or downtime caused by unexpected

396

Souza, P., Marques, W., Reis, R. and Ferreto, T. IAGREE: Infrastructure-agnostic Resilience Benchmark Tool for Cloud Native Platforms. DOI: 10.5220/0007728503960403 In Proceedings of the 9th International Conference on Cloud Computing and Services Science (CLOSER 2019), pages 396-403 ISBN: 978-989-758-365-0 Copyright c 2019 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved

IAGREE: Infrastructure-agnostic Resilience Benchmark Tool for Cloud Native Platforms

workload can lead to business revenue losses (CA 2 BACKGROUND Technologies, Inc, 2018). Moreover, applications without proper resilience schemes can demand con- Cloud computing consists of a model that provides siderable costs with repair and replacement of cloud ubiquitous, configurable, on-demand access to com- components or data loss (Vishwanath and Nagappan, puting resources which include networks, servers, 2010). In cases of applications that do not employ ef- storage, applications, and services through the Inter- ficient resilience schemes, these kinds of failures can net (Mell et al., 2011). It also provides cost reduc- cause SLAs violations, which can negatively affect all tion, since customers pay just for the resources that the levels of cloud consumers: Final customers will are consumed (Nicoletti, 2016). Due to these features, have problems with trying to access the cloud ser- many organizations employ cloud services to meet the vices, Cloud service providers will suffer customer peak demand in their applications through services in attrition, and cloud platform and cloud infrastructure the cloud (Gajbhiye and Shrivastva, 2014). providers will incur penalties for non-compliance of By dealing with on-premises software, customers the accorded SLAs. must manage all levels of abstraction to make their One of the possible steps to improve aspects such applications up and running, dealing with manners as resilience of a cloud native platform is monitor- of networking, storage, servers, runtimes, and so on. ing its behavior under different circumstances. In this On the other hand, cloud computing provides service context, benchmarks play a significant role, since they models that focus on delivering different levels of ab- can be used to measure and to compare different soft- straction to meet the needs of customers more effi- ware or hardware configurations. Besides, these tools ciently. can also be used to implement improvements at the platform based on collected metrics. Several bench- There are still many challenges in Cloud Comput- mark tools were proposed; however, there is a lack of ing. For instance, the interoperability among differ- benchmark tools designed to measure the resilience ent cloud services is still a big challenge. Also, Cloud of cloud native platforms. Computing is powered by virtualization, which pro- In this sense, this paper presents a bench- vides several independent instances of virtual execu- mark tool designed for measuring the resilience of tion platforms, often called virtual machines (VMs). cloud native platforms. Based on this purpose, we However, applications have lower performance on present the following contributions: VMs than on physical machines. The overhead gener- ated by the virtualization layer (hypervisor) is one of • A review of the benefits and challenges brought the leading concerns in the context of virtualized en- by the cloud native approach. vironments. Therefore, identifying and reducing such • A discussion around the relevance of measuring overhead has been the subject of several investiga- resilience in cloud native platforms and metrics tions (Li et al., 2017; SanWariya et al., 2016; Chen that can be analyzed to achieve this purpose. et al., 2015). • An infrastructure-agnostic resilience benchmark In this context, Cloud Native approach emerges tool to measure the resilience of cloud native plat- with the proposal of decomposing applications into forms based on Cloud Foundry. small components designed to operate independently called microservices. Each microservice is executed The remaining of this article is organized as fol- on a container, so its possible to upgrade or replace lows: In Section 2 we present the theoretical back- a microservice without having to modify the entire ground about some of the current topics on cloud application. Such modularity allows minimizing the computing such as the need for strategies, which is costs with upgrades and maintenance and increase addressed by approaches such as cloud native, that the resources usage efficiency (Amogh et al., 2017). aims to improve the resources usage efficiency of ap- On the other hand, running microservices in sepa- plications running on the cloud. In Section 3 we rated containers brings some challenges such as pre- present the proposed benchmark tool, which is val- serving the manageability of the platforms, since the idated in Section 4, where we present a Proof-of- higher the number of containers an applications has, Concept that includes the use of our proposal to per- the greater the complexity of ensuring everything is form experiments in a real-world cloud native plat- working as expected. Due to these challenges, there form. Then, in Section 5 we present other bench- is a concern on increasing the customers’ trust in the mark tools and compare them with our proposal, and services provided. in Section 6 we present final considerations and re- Service Level Agreements (SLAs) (which are search topics to be addressed in future researches. contractual agreements between service providers and its customers) play a significant role in this context,

397 CLOSER 2019 - 9th International Conference on Cloud Computing and Services Science

since they specify the quality of service (QoS) guar- relevance of measuring resilience stood out since it is anteed by the providers so customers can verify if important to verify the efficiency of the implemented the delivered QoS befits that one was accorded (Pa- resilience strategies. Accordingly, as shown in Fig- tel et al., 2009). One element which is considered ure 1, some resilience metrics should be taken into in SLAs is resilience, which is the ability to deliver account to measure a system or a platform: and maintain a certain level of service despite the fail- • Mean Time To Failure (MTTF): often known ure of one or several system components (Abbadi and as uptime, MTTF refers to the average amount of Martin, 2011). time that a system is available. MTTF is widely There are tons of elements that can affect the oper- used in SLA contracts to indicate how reliable a ation of cloud services and must be considered when system or a platform is based on the amount of building resilience mechanisms. One of the current time it has spent without failing. leading causes of failures in cloud computing services is human errors, which is characterized by failures • Mean Time To Detect (MTTD): refers to the av- caused by human actions (e.g., a cloud operator acci- erage amount of time it takes to detect a failure. dentally erases a database registry required by a sys- This metric is useful to measure the efficiency of tem vital component). Other causes of system fail- monitoring and incident management tools. ures are disasters (like earthquakes and tornadoes) or • Mean Time To Repair (MTTR): this metrics de- software failures (bugs or errors caused by malicious notes to the average time spent to repair a failed software) which can also lead to critical failure sce- system. MTTR is a valuable metric to find ways narios (Colman-Meixner et al., 2016). to mitigate downtime. Resilience issues in cloud services can lead to • Mean Time To Between Failures (MTBF): several negative consequences, for example, sys- refers to the average time that elapses between tem instability, bottlenecks or downtime caused by failures. It is a useful metric to predict the reli- unexpected workload can lead to business revenue ability and availability of a platform accurately. losses. Moreover, applications without proper re- silience schemes can demand considerable costs with repair and replacement of cloud components or data loss (Vishwanath and Nagappan, 2010). In cases of 3 IAGREE ARCHTECTURE applications that do not employ efficient resilience DESIGN schemes, these kinds of failures can cause SLAs vi- olations, which can negatively affect all the levels of IAGREE is an infrastructure-agnostic resilience cloud consumers: Final customers will have prob- benchmark tool for cloud native platforms. It allows lems with trying to access the cloud services; Cloud the user to measure the resilience of Cloud-Foundry service providers will suffer customer attrition, and based platforms by simulating failing scenarios of cloud platform and cloud infrastructure providers will a sample REST application while receiving user re- incur penalties for non-compliance of the accorded quests (Figure 2). Our tool use commands provided SLAs. by the Cloud Foundry Command Line Interface (cf Cloud Computing takes advantage of the tradi- CLI), which is used by Cloud Foundry and its based tional IT model by providing dynamic resource provi- platforms and allows the deployment, management, sioning instead of delivering computational power ac- and monitoring of applications. This makes IAGREE cording to the peak demand (which leads to unneces- compatible to any platform based on CF. Besides that, sary costs with computing resources). However, fail- this benchmark tool also enhances the agnostic to in- ures are an undesired but inherent aspect when scaling frastructure characteristic from Cloud Foundry. applications in the cloud (Zhang et al., 2010). More- IAGREE uses features provided by a Cloud over, infrastructure manners such as the adopted vir- Foundry component called Loggregator, which is re- tualization strategies can impact the resilience of ap- sponsible for collecting and streaming logs from the plications. applications and the platform itself. Given that, one of the key aspects on achieving re- Our proposal takes advantage of Loggregator’s ar- silience in cloud environments is building a resilient chitecture to analyze the resilience provided by the infrastructure that implements failure prevention tech- platform while applications are receiving requests niques like the isolation of applications’ layers and from multiple users and failing events occur. To sim- components so even if some components get compro- ulate real user accesses, IAGREE generates multiple mised the rest of the system will be able to operate concurrent requests using Siege, an open source tool normally (Salapura et al., 2013). In this context, the that allows simulating multiple user HTTP requests.

398 IAGREE: Infrastructure-agnostic Resilience Benchmark Tool for Cloud Native Platforms

Mean Time Between Failures (MTBF)

Mean Time to Mean Time to Repair (MTTR) Failure (MTTF)

Mean Time to Detect (MTTD)

System Failure Begin Repair Resume Operations System Failure Figure 1: Different measures covered by resilience metrics.

while timecurrent <= timetotal do send N simultaneous requests foreach failureInstant fi ∈ F do Cloud Foundry send a request that will crash an app App Instance instance end Load Balancer App Instance Loggregator end App Instance Algorithm 1: Strategy adopted by the proposed benchmark to measure resilience of cloud native platforms.

4 PROOF-OF-CONCEPT Requests App Logs

Benchmark EXPERIMENTS Client Figure 2: IAGREE benchmark tool design. We conducted a proof-of-concept running our pro- As illustrated in Figure 3, upon running IAGREE, posal on Pivotal Web Services (Pivotal Web Services, the first step is to ask for API endpoint and creden- 2018), which is a Cloud-Foundry based platform that tials (email and password) in order to establish a con- uses EC2 servers located in Northern Virginia (United nection to the platform. As soon as the benchmark States). We did choose PWS since it automatically confirms that cf CLI was able to authenticate the user handles distributing multiple instances applications into a Cloud Foundry environment, it deploys a sam- across multiple availability zones. We performed tests ple application to receive the requests. Once the ap- in 8 different configurations gathering in each of them plication deployment is finished, the benchmark starts MTTF, MTBF, MTTD, MTTR, mean response time a log stream from the Loggregator and asks the user to per requests (s), transactions per second, and through- define three parameters: i) the amount of time the ap- put (MB/s). Table 1 summarizes the set of configura- plication will be tested; ii) how many simulated users tions explored by our experiments. will try to access the application simultaneously; and iii) how many fails per minute the application will ex- Table 1: Tested specification. perience. Once those parameters are informed, the App Instances Total Time Failures per Minute Concurrent Users 50 2 benchmark tells Siege the desired configuration and 100 2 50 the tests start. In the meantime, IAGREE may sends 4 100 10 minutes requests to a route of the application which is config- 50 2 100 ured to crash the whole instance. since there is the 4 50 4 load balance component, any instance can stop. This 100 process is described in Algorithm 1. As the benchmark finishes running the tests, it parses the data from the log stream and presents 4.1 Mean Time To Failure the results denoted by the following metrics: MTTF, MTBF, MTTD, MTTR, mean response time per re- The results presented in Figure 4 showed that the im- quests, transactions per second, and throughput. pact on an application’s availability caused by the number of failures per minute varies according to the number of instances an application have, especially

399 CLOSER 2019 - 9th International Conference on Cloud Computing and Services Science

Login into Parse and present Execute resilience infrastructure benchmark results benchmark tests

No Yes

Login was successful? Parameters are valid?

No

Yes No

Configure Deploy benchmark Deployment was Yes Create log stream benchmark application successful? parameters

Figure 3: IAGREE process flow. as the number of users accessing the application in- 4.3 Mean Time Between Failures creases. In other words, the higher the number of users accessing the application, the higher is the rel- The results presented in Figure 6 highlighted the cor- evance of scaling mechanisms since if an application relation between the number of application instances crashes due to insufficient resources, its availability and the frequency of failure occurrences. In general could be severely impacted. Moreover, in the scenar- terms, the mean time between failures is highly de- ios with 2 failures per minute, the application with 2 pendent on the number of failures per minute speci- instances suffered a 44% availability loss as the num- fied to the benchmark. However, as we can see com- ber of users accessing it increased from 50 to 100. paring the results of 2 and 4 instances, the bigger On the other hand, the application with 4 instances the number of instances that an application have, the only got an 8% availability loss. These results indi- smaller the interval between failures. This happens cate that the impact caused by a given number of fail- because load balancers have a buffer that stores the ures could be significantly higher in applications with incoming requests to be distributed across the applica- few instances. tion instances, then in scenarios with multiple simul- taneous requests, increasing the number of applica- 4.2 Mean Time To Detect tion instances reduces the delay to a request to be pro- cessed. As the proposed benchmark simulates an ap- The results in Figure 5 regarding the average time plication failure by sending a request to a specific ap- that the platform took to detect failures show that in plication URL configured to crash the web server, in- scenarios where applications are suffering from sev- creasing the number of instances of the sample appli- eral and constant crashes, the higher the number of cation resulted in few waiting time to those requests users accessing the application, the greater the time to be processed, which in its turn reduced the interval the platform would take to detect the failures. Espe- between failures. cially in scenarios where failures do not crash the en- tire instance (e.g., the web server stops working, but 4.4 Mean Time To Repair the container where it is running continues to work) and therefore are undetectable for instance healing According to the results presented in Figure 7, in the tools, monitoring mechanisms must parse more mas- experiments with the 2 instances application, increas- sive amounts of application logs to detect failures as ing the number of failures from 2 to 4 caused a de- the number of requests increases, which could lead to lay to the platform to repair the application failures. delayed detection of application failures. Besides, de- However, such a phenomenon does not appear in the tecting the failures in the 4 instances application took experiments with the 4 instances application. Besides, 12% longer, since the higher the number of instances considering that MTTR covers the time to detect fail- the app has, the higher the complexity of ensuring that ures (see Figure 1), the results also highlighted the everything is working as expected. negative impact of having too frequent failures in ap- plications with few instances, since the 2 instances ap- plication took longer to be repaired despite demand- ing less time to detect failures than the 4 instances application.

400 IAGREE: Infrastructure-agnostic Resilience Benchmark Tool for Cloud Native Platforms

Mean Time To Failure Mean Time To Failure 2 Instances Application 4 Instances Application 50 Concurrent Users 100 Concurrent Users 15 50 Concurrent Users 100 Concurrent Users 12

10 9

6

5 3

Mean Time To Failure (s) Failure Mean Time To (s) Failure Mean Time To 0 0 2 4 2 4 2 4 2 4 Failures Per Minute Failures Per Minute Figure 4: Mean Time To Failure.

Mean Time To Detect Mean Time To Detect 2 Instances Application 4 Instances Application 50 Concurrent Users 100 Concurrent Users 50 Concurrent Users 100 Concurrent Users 1.00 0.5

0.4 0.75

0.3 0.50 0.2

0.25 0.1 Mean Time To Detect (s) Mean Time To Detect (s) Mean Time To 0.00 0.0 2 4 2 4 2 4 2 4 Failures Per Minute Failures Per Minute Figure 5: Mean Time To Detect.

Mean Time Between Failures Mean Time Between Failures 2 Instances Application 4 Instances Application 50 Concurrent Users 100 Concurrent Users 50 Concurrent Users 100 Concurrent Users 30

40

30 20

20 10 10

0 0 Mean Time Between Failures (s) Mean Time Between Failures 2 4 2 4 (s) Mean Time Between Failures 2 4 2 4 Failures Per Minute Failures Per Minute Figure 6: Mean Time Between Failures.

Mean Time To Repair Mean Time To Repair 2 Instances Application 4 Instances Application 50 Concurrent Users 100 Concurrent Users 50 Concurrent Users 100 Concurrent Users 40 20

30 15

20 10

10 5 Mean Time To Repair (s) Mean Time To Repair (s) Mean Time To 0 0 2 4 2 4 2 4 2 4 Failures Per Minute Failures Per Minute Figure 7: Mean Time To Repair.

5 RELATED WORK of multiple OpenStack nodes, cloud verification, test- ing, and profile creation. Rally generically does that, Rally (Rally, 2018) provides a framework for eval- making it possible to check whether OpenStack will uating the performance of OpenStack components meet the business requirements. as well as full production OpenStack cloud deploy- Cloud CMP (Li et al., 2010) is a benchmark tool ments. Rally automates and unifies the deployment developed to provide a systematic comparator of the

401 CLOSER 2019 - 9th International Conference on Cloud Computing and Services Science

performance and cost of cloud providers. CloudCMP two metrics: Scaleup and Elastic Speedup. measures the elasticity, persistent storage and net- Chaos Monkey (Chaos Monkey, 2018) is a bench- working services offered by a cloud along with met- mark tool created by Netflix that randomly terminates rics that directly reflect on the performance of cus- instances of virtual machines and containers inside tomer applications. This tool was tested in various the production environment to simulate unexpected cloud providers, including , failure events. Chaos Monkey follows the principles Azure, AppEngine and Rackspace of chaos engineering. Chaos Monkey was designed CloudServers (LI et al., 2010). to work with any backend that provides support for HiBench (HiBench, 2018) is an open source Spinnaker, which is an open source, and a multi-cloud benchmark suite for Hadoop. This tool consists of continuous delivery platform for releasing software synthetic micro-benchmarks and real-world applica- changes with the intention to give velocity and con- tions. The application currently has several work- fidence. loads classified into four categories: Microbench- To the best of our knowledge, our proposal (IA- marks, Web Search, Machine Learning, and Data GREE) is the only infrastructure-agnostic benchmark Compression. The microbenchmarks provide several tool that allows measuring the resilience of any cloud algorithms such as Sort, WordCount, Sleep, enhanced native platform based on Cloud Foundry. Table 2 DFSIO and TeraSort provided by Hadoop. These summarizes the differences between IAGREE and the benchmarks are used to represent a subset of real- previous work. world MapReduce jobs. Web Search uses PageRank and Nutch indexing benchmark tools. These bench- mark tools are used since large-scale search indexing 6 CONCLUSIONS is one of the most significant uses of MapReduce. The machine learning workloads provide Bayesian Classi- Companies have adopted cloud computing because of fication and K-means Clustering implementations. It the functionalities it provides in a simple way and also worth noting that HiBench also provides several other for the value that is lower than having an on-premise machine learning benchmark alternatives such as Lin- solution. The cloud native approach was driven by ear Regression and Gradient Boosting Trees. cloud computing and container technology, though Cloud Suite (CloudSuite, 2018) is an open source there is still resistance to migrating to cloud native benchmark tool for cloud computing focused on pro- platform solutions. One of the reasons for this resis- viding scalability and performance evaluation on real- tance is the lack of how to reliably evaluate which world setups. Cloud Suite has a suite of benchmarks platform provides the resilience requirements for the that represent massive data manipulation with tight la- applications. tency constraints such as in-memory data analytics. Thus, in this paper, a benchmark tool for resilience Among the tasks included in this set of benchmarks, for cloud native platforms called IAGREE was pro- there are MapReduce, media streaming, SAT solving, posed. This tool supports platforms based on Cloud web hosting, and web search tasks. Cloud Suite is Foundry like the Pivotal Application Service. Be- compatible with private and public cloud platforms, sides, it inherits the ability of the CF to be agnostic and the authors emphasize that it is integrated into 1 to the infrastructure provider, running on both the in- Google’s PerfKit Benchmarker that helps to auto- dustry’s top cloud service providers (AWS, GCP, Mi- mate the benchmarking process. crosoft Azure) and on-premise. Yahoo! Cloud Serving Benchmark (Cooper et al., A set of experiments was also performed to vali- 2010) can be used to perform elasticity and scalability date the proposed tool, using Pivotal Web Services as evaluation in different cloud platforms such as Ama- a platform provider. The results obtained by IAGREE zon AWS, , and Google App Engine. were analyzed based on the following resilience met- The authors defined two benchmark tiers for evalu- rics: Mean Time To Repair, Mean Time Between Fail- ating cloud serving systems: The first tier focuses ures, Mean Time To Detect, Mean Time To Failure. on the latency of requests when the database is un- As future work, we intend to perform experiments on der load. It aims to characterize this trade-off for a larger scale and with different configurations. We each database system by measuring performance as also intend to execute them using different infrastruc- the user requests increase. The second tier focuses on ture providers in order to compare their performance. examining the impact on performance as more ma- chines are added to the system. In this tier, there are

1PerfKit Benchmarker. Available at: .

402 IAGREE: Infrastructure-agnostic Resilience Benchmark Tool for Cloud Native Platforms

Table 2: Comparison between IAGREE and other cloud benchmark tools. Benchmark Evaluated Metrics Output Supported Infrastructures Rally Elasticity CLI and HTML Infrastructure Agnostic Amazon Web Services, Microsoft Azure, Cloud CMP Elasticity JSON Amazon Elastic Compute Cloud, HiBench Scalability and Performance CLI Microsoft Azure Cloud Suite Scalability and Performance CLI Infrastructure Agnostic Amazon Web Services, Microsoft Azure, Yahoo! Cloud Serving Benchmark Elasticity and Scalability CLI and HTML Google Cloud Platform Amazon Web Services, Microsoft Azure, Chaos Monkey Resilience HTML Google Cloud Platform, -based Platforms IAGREE Resilience CLI and JSON Infrastructure Agnostic

ACKNOWLEDGEMENTS ACM symposium on Cloud computing, pages 143– 154. ACM. This work was supported by the PDTI Program, Gajbhiye, A. and Shrivastva, K. M. P. (2014). Cloud com- funded by Dell Computadores do Brasil Ltda (Law puting: Need, enabling technology, architecture, ad- vantages and challenges. In Confluence The Next Gen- 8.248 / 91). eration Information Technology Summit (Confluence), 2014 5th International Conference-, pages 1–7. IEEE. HiBench (2018). Hibench suite. Available at: https://github. REFERENCES com/intel-hadoop/HiBench. Accessed: 2018-12-24. Li, A., Yang, X., Kandula, S., and Zhang, M. (2010). Cloud- Abbadi, I. M. and Martin, A. (2011). Trust in the cloud. in- cmp: comparing public cloud providers. In Proceed- formation security technical report, 16(3-4):108–114. ings of the 10th ACM SIGCOMM conference on Inter- net measurement, pages 1–14. ACM. Amogh, P., Veeramachaneni, G., Rangisetti, A. K., Tamma, B. R., and Franklin, A. A. (2017). A cloud native Li, Z., Kihl, M., Lu, Q., and Andersson, J. A. (2017). Per- solution for dynamic auto scaling of mme in lte. In formance overhead comparison between hypervisor Personal, Indoor, and Mobile Radio Communications and container based virtualization. In Advanced In- (PIMRC), 2017 IEEE 28th Annual International Sym- formation Networking and Applications (AINA), 2017 posium on, pages 1–7. IEEE. IEEE 31st International Conference on, pages 955– 962. IEEE. Armbrust, M., Fox, A., Griffith, R., Joseph, A. D., Katz, R., Konwinski, A., Lee, G., Patterson, D., Rabkin, A., Mell, P., Grance, T., et al. (2011). The nist definition of Stoica, I., et al. (2010). A view of cloud computing. cloud computing. Communications of the ACM, 53(4):50–58. Nicoletti, B. (2016). Cloud computing and procurement. In Proceedings of the International Conference on Inter- Balalaie, A., Heydarnoori, A., and Jamshidi, P. (2015). Mi- net of things and Cloud Computing grating to cloud-native architectures using microser- , page 56. ACM. vices: an experience report. In European Conference Patel, P., Ranabahu, A. H., and Sheth, A. P. (2009). Service on Service-Oriented and Cloud Computing, pages level agreement in cloud computing. 201–215. Springer. Pivotal Web Services (2018). Available at: https: CA Technologies, Inc (2018). Available at: https://bit.ly/ //pivotal.io/platform/pivotal-application-service. Ac- 2BVZhsx. Accessed: 2018-12-24. cessed: 2018-12-24. Chaos Monkey (2018). Chaos monkey. Available at: https: Rally (2018). Rally benchmark. Available at: https://github. //github.com/Netflix/chaosmonkey. Accessed: 2018- com//rally. Accessed: 2018-12-24. 12-24. Salapura, V., Harper, R., and Viswanathan, M. (2013). Re- Chen, L., Patel, S., Shen, H., and Zhou, Z. (2015). Profiling silient cloud computing. IBM Journal of Research and and understanding virtualization overhead in cloud. In Development, 57(5):10–1. Parallel Processing (ICPP), 2015 44th International SanWariya, A., Nair, R., and Shiwani, S. (2016). Analyzing Conference on, pages 31–40. IEEE. processing overhead of type-0 hypervisor for cloud CloudSuite (2018). Cloudsuite. Available at: https://github. gaming. In Advances in Computing, Communication, com/parsa-epfl/cloudsuite. Accessed: 2018-12-24. & Automation (ICACCA)(Spring), International Con- ference on, pages 1–5. IEEE. Colman-Meixner, C., Develder, C., Tornatore, M., and Mukherjee, B. (2016). A survey on resiliency tech- Vishwanath, K. V.and Nagappan, N. (2010). Characterizing niques in cloud computing infrastructures and appli- cloud computing hardware reliability. In Proceedings cations. IEEE Communications Surveys & Tutorials, of the 1st ACM symposium on Cloud computing, pages 18(3):2244–2281. 193–204. ACM. Cooper, B. F., Silberstein, A., Tam, E., Ramakrishnan, Zhang, Q., Cheng, L., and Boutaba, R. (2010). Cloud com- Jour- R., and Sears, R. (2010). Benchmarking cloud serv- puting: state-of-the-art and research challenges. nal of internet services and applications ing systems with ycsb. In Proceedings of the 1st , 1(1):7–18.

403