KTH Royal Institute of Technology

Master Thesis

Combining analytics framework and Cloud schedulers in order to optimise resource utilisation in a distributed Cloud

Author: Supervisor: Nikolaos Stanogias Ignacio Mulas Viela

A thesis submitted in fulfilment of the requirements for the degree of Software Engineering of Distributed Systems

July 2015

TRITA-ICT-EX-2015:154 Declaration of Authorship

I, Nikolaos Stanogias, declare that this thesis titled, ’Combining analytics framework and Cloud schedulers in order to optimise resource utilisation in a distributed Cloud’ and the work presented in it are my own. I confirm that:

 This work was done wholly or mainly while in candidature for a research degree at this University.

 Where any part of this thesis has previously been submitted for a degree or any other qualification at this University or any other institution, this has been clearly stated.

 Where I have consulted the published work of others, this is always clearly at- tributed.

 Where I have quoted from the work of others, the source is always given. With the exception of such quotations, this thesis is entirely my own work.

 I have acknowledged all main sources of help.

 Where the thesis is based on work done by myself jointly with others, I have made clear exactly what was done by others and what I have contributed myself.

Signed:

Date:

i “Thanks to my solid academic training, today I can write hundreds of words on virtually any topic without possessing a shred of information, which is how I got a good job in journalism.”

Dave Barry KTH ROYAL INSTITUTE OF TECHNOLOGY

Abstract

Faculty Name Software Engineering of Distributed Systems

Masters Degree

Combining analytics framework and Cloud schedulers in order to optimise resource utilisation in a distributed Cloud

by Nikolaos Stanogias

Analytics frameworks were initially created to run on bare-metal hardware so they con- tain scheduling mechanisms to optimise the distribution of the cpu load and data allo- cation. Generally, the scheduler is part of the analytics framework resource manager. There are different resources managers used in the market and the open-source commu- nity that can serve for different analytics frameworks. For example, Spark is initially built with Mesos. Hadoop is now using YARN. Spark is also available as a YARN application. On the other hand, cloud environments (Like OpenStack) contain theirs own mechanisms of distributing resources between users and services. While analytics applications are increasingly being migrated to the cloud, the scheduling decisions for running an analytic job is still done in isolation between the different scheduler layers (Cloud/Infrastructure vs analytics resource manager). This can seriously impact per- formance of analytics or other services running jointly in the same infrastructure as well as limit load-balancing, and autoscaling capabilities. This master thesis identifies what are the scheduling decisions that should be taken at the different layers (Infrastructure, Platform and Software) as well as the required metrics from the environment when mul- tiple schedulers are used in order to get the best performance and maximise the resource utilisation. Acknowledgements

First, I would like to thank my main supervisor Ignacio Mulas Viela for his constant support, motivation and passion that he transmitted to me during the process of this work. Many thanks to my other supervisors Nicola Seyvet and Tony Larsson for the advices, the comments, the help and the valuable insights they gave to me during the development of this thesis. I also thank my examiner and associate professor Jim Dowling for all his valuable knowledge, inspiration and encouragement that he gave me throughout the last year and his willingness to provide help when needed for this work.

Finally, I would like to thank my family for their financial and psychological support over the last two years and my friends who helped me in difficult moments and gave me the opportunity to share and celebrate my successes with them.

iv Contents

Declaration of Authorshipi

Abstract iii

Acknowledgements iv

Contents v

List of Figures vi

List of Tables vii

1 Introduction1 1.1 Motivation...... 1 1.2 Problem Statement...... 2 1.3 Methodology...... 4 1.3.1 Method and Research Approach...... 4 1.3.2 Data Collection and Analysis...... 5 1.4 Contribution...... 5 1.5 Document structure...... 5

2 Background6 2.1 ...... 6 2.2 Cloud Service Models...... 7 2.2.1 Infrastructure as a Service (Iaas)...... 8 2.2.2 Platform as a Service (Paas)...... 8 2.2.3 Software as a Service (SaaS)...... 9 2.3 IaaS Cloud Deployment Models...... 9 2.3.1 Public Cloud...... 9 2.3.2 Private Cloud...... 10 2.3.3 Hybrid Cloud...... 10 2.4 Hadoop 2.0 - YARN...... 10 2.5 Virtualizing Hadoop...... 12 2.6 Related work...... 13 2.6.1 AWS CloudWatch...... 13 2.6.2 Amazon EMR...... 15

v Contents vi

3 Design Overview 16 3.1 Introduction to Openstack...... 16 3.1.1 Openstack Architecture...... 16 3.2 Autoscaling model...... 17 3.2.1 Autoscaling algorithm...... 18 3.2.2 Scaling the cluster out...... 20 3.2.3 Scaling the cluster in...... 21

4 Implementation 23 4.1 Environment and Cloud setup...... 23 4.2 Openstack4j...... 24 4.3 Ceilometer...... 24 4.4 Puppet...... 25 4.4.1 What is Puppet...... 26 4.4.2 How does Puppet work...... 26 4.5 Workload generation...... 28 4.6 Choosing instance type...... 31

5 Performance Evaluation 33 5.1 Performance analysis...... 33 5.2 Autoscaling effect...... 36

6 Conclusions 38 6.1 Conclusion...... 38 6.2 Future work...... 39

Bibliography 40 List of Figures

1.1 Gigaom Research Data Warehousing Survey...... 2 1.2 Average traffic distribution...... 3

2.1 Cloud service models...... 7 2.2 The new Architecture of YARN...... 11

3.1 Openstack Architecture...... 16 3.2 Scale out...... 21 3.3 Scale in...... 22

4.1 Interaction between puppet agents and puppet master...... 27 4.2 Node join...... 28 4.3 Pi application...... 30 4.4 DFSIO application...... 30 4.5 Terasort application...... 30 4.6 I/O intensive applications run faster on high-I/O instances...... 31

5.1 Experimental performance with 50 Pi applications...... 34 5.2 Experimental performance with DFSIO-Terasort and Spark Application. 35 5.3 Experimental performance with 50 Spark applications...... 36 5.4 Autoscaling effect from three to six VMs...... 37

vii List of Tables

4.1 System services running on Nodes...... 24 4.2 Variety of meters that can be measured with Ceilometer...... 25 4.3 Different instance flavors...... 31

6.1 Impact on performance of the physical VM location...... 39

viii Chapter 1

Introduction

We begin this chapter by taking a look at the motivation, which analyses the reasons for choosing this area of research and diving deeper into it. Problem statement defines the problem that we aim to solve with this thesis. We outline the major goals of this thesis, explain the importance of them in the context of maximised resource utilisation and resource provisioning and we discuss the methodology that we followed when im- plementing this work. Finally, we discuss the most of note contributions of this thesis and the structure of this document.

1.1 Motivation

In recent years, it has been noticed a significant explosion of data stored worldwide, in- creasing continuously with an exponential rate. Individual companies and organizations often have petabytes or more of data including business information which is crucial to continued growth and success. However, the amount of data is often too large to store and process using traditional relational database systems, or the data is in unstructured forms inappropriate for structured schemas, or the hardware needed for the analysis of this huge dump of data is too costly. The need to process this avalanche of Big Data bear out Apache Hadoop [1], which is an open source software framework that pioneered new ways of storing and processing data. Instead of relying on expensive hardware, Hadoop can be installed on a cluster of commodity machines so that they can communicate and work together storing and processing in parallel huge amounts of data. More than that, it allows to scale out to hundreds or thousands of nodes as the data and processing demands grow, and can automatically recover from partial failure of servers.

1 Chapter 1. Introduction 2

Hadoop clusters were designed for stroring analyzing fast huge amounts of unstructured data in a distributed computing environment. However, it costs considerably for or- ganizations to build the infrastructure as well as the manpower needed to maintain it. Thus, IT enterprises are increasingly looking to cloud computing as the best structure to support their big data projects as they can exploit the cloud’s pay-per-use model in order to save costs. Moving big data into the cloud is also a considerable benefit to smaller businesses as they have the ability to effectively leverage processing power they simply do not have access to otherwise.

In September 2014, Gigaom Research [2] surveyed more that 300 senior management leaders in the U.S. at medium (500+ employees) to large enterprises (2.000+ employees) across IT and business roles. The survey showed that 53 percent of these businesses are either leveraging cloud resources for big data analytic needs (28 percent) or are planning to do so (25 percent). Among the respondents, only 13 percent reported that they would only use private data centers for their analytic processes.

Figure 1.1: Gigaom Research Data Warehousing Survey [3]

1.2 Problem Statement

It has been noticed that applications running on the cloud often present a fluctuation when it comes to resource demanding during the day, month, or even a year. It is quite important thus for big enterprises to optimise the infrastructure to respond to this Chapter 1. Introduction 3 varying workload demand. Figure 1.1 shows the daily distribution of a typical internet application in the USA.

[4]

Figure 1.2: Average traffic distribution

Data analytic applications running on a Hadoop cluster often show the same behaviour, that is workload demand varies at different times. As data and computation is growing the same should happen with resources. Very often, the size of a Hadoop cluster is fixed with a dedicated number of nodes. This could be problematic when all the resources of the cluster are exhausted as it would result in high response times for submitted applications. It is imperative thus, to explore ways where more nodes can join the cluster in case of a sudden workload demand in a relatively short period of time.

Auto Scaling is a technique that provides features to manage a running pool of machines, having the capability to replace failed instances and automatically grow and shrink the size of the pool. It can be used as a solution to the aforementioned problem.

Furthermore, an interesting property of the cloud paradigm is that the cost of using 1000 machines for 1 hour, is the same as using 1 machine for 1000 hours. This ob- servation leads us to the inference that a Hadoop job’s performance can potentially be improved with autoscaling, while bringing the same cost, since Hadoop is built to exploit parallelism.

In most auto scaling systems, the metrics that are taken account for an autoscaling deci- sion come only from the IaaS layer. This can seriously impact autoscaling capabilities as there are several metrics from the analytics resource manager that are transparent to the Chapter 1. Introduction 4 infrastructure. As a result the autoscaling decision does not bring the maximum benefit to the running applications, especially when they are resource intensive and consume all the resources given to them. The basis of this thesis is to identify what metrics should be considered critical from both layers (Cloud/Infrastructure vs analytics resource man- ager) in order to take an optimum autoscaling decision and build a communication layer between these two layers.

1.3 Methodology

In this section we describe the methodology that we used for this work. We talk about the research approach that we followed as well as the research strategy taken in order to produce the results. Finally we discuss how data collection and analysis of the results was done.

1.3.1 Method and Research Approach

In this thesis, we used the quantitative research method. We have built a component which we embedded in the Resource Manager of YARN. The goal is to show that lever- aging this component we can optimize resource utilization in a distributed cloud where data analytics applications are running and auto-scaling decisions can be improved by taking account metrics from the data analytics Platform (YARN). Companies with lots of data analytics processing can benefit from the approach proposed in this thesis by utilising their underlying infrastructure better. Due to limitation of time, we conducted only a limited amount of experiments and most of the benchmarks were created from the example applications included with the Hadoop and Spark distribution. However, all the experiments show consistency and and prove the effectiveness of our implementation.

The main hypothesis in this work is that by monitoring metrics not only from the cloud Infrastructure layer but also from the data analytics layer, we can enhance the cloud dynamic scalability benefits, save costs of running unnecessary machines and make the entire underlying resource capacity more adaptive to the application real-time workload. The experiments come to verify this hypothesis as they show the gain performance we get when we stress the hadoop cluster with heavy workload and the elastic controller component is enabled.

The experiments that we performed imply the use of experimental research method. Finally, in this thesis we used the deductive research approach because the verification of the initial hypothesis is based on experiments. Chapter 1. Introduction 5

1.3.2 Data Collection and Analysis

In order to collect the data and evaluate the performance we run a series of experiments. We tested our implementation with a variety of application workloads. More specifically, we test the behaviour of the system with computational, high-I/O and mixed applica- tions. The collection of data is taken from measurements of IaaS metrics as well as YARN metrics. Furthermore, we increased the cluster size when performing the experi- ments in order to observe how this affects our implementation. The results indicated in Chapter 5 in the form of graphs indicate the effectiveness of our approach.

1.4 Contribution

The major contribution of this thesis is the design and implementation of an elastic resource controller that offers on demand resource provisioning when needed. This con- troller allows the analytics framework Resource Manager to ask for more resources from the IaaS as well as to free resources from the multi-tenant cloud in order to be available for other users. We combine metrics both from the IaaS level as well as the Resource Manager level in order to take an autoscaling decision. We evaluate the effectiveness of our approach by conducting a series of experiments running different kind of applica- tions.

1.5 Document structure

The thesis is documented in the following order. Chapter 1 introduces the reader to the general idea of the problem and states what is the exact problem to be investigated and solved. Chapter 2 touches upon the relevant background information on cloud comput- ing, service models, Hadoop YARN and virtualization technology. In that chapter we also review some related work. In Chapter 3, the proposed architecture is explained in detail including models for autoscaling strategies. Chapter 4 discuss in detail the ex- perimental setup, and provides the relevant implementation details. Chapter 5 presents the experimental results as well as the useful conclusions extracted from them. Finally, Chapter 6 makes the final remarks of this thesis and proposes some ideas for further work. Chapter 2

Background

In this chapter we begin by explaining the background information about cloud comput- ing and its service models. We then make a brief overview of YARN and its characteris- tics and explain the benefits of leveraging virtualization technology in Hadoop. Finally, we talk about the most relevant works related to cloud monitoring and autoscaling platforms in cloud environments.

2.1 Cloud Computing

Cloud computing is the notion for delivering resources such as computing power, storage, network and software over the Internet in a remotely accessible fashion. Users have access to cloud applications using web-browsers or mobile devices, while all the data and software is stored on servers at a remote location. According to National Institute of Standards and Technology (NIST), U.S Department of Commerce, there are five essential characteristics of a cloud deployment model [5].

• On-demand self-service: The capability of a consumer to provision computing resources as needed without having to interact with the service provider.

• Broad network access: Network access for resources through standards mechanisms that allow heterogenous clients to make use of them.

• Resource pooling: Providers can serve their resources to multiple consumers in a multi-tenant model, according to their demands. The consumer is unaware from details such as datacenter location but may be able to request location at a higher level of abstraction.

6 Chapter 2. Background 7

• Rapid elasticity: Customers have the ability to elastically provision and release resources scaling outward and inward according to their demands over time. They get the illusion of unlimited compute resources.

• Measured service: Cloud resource usage can be monitored, controlled, yet remain transparent from both the provider and consumer perspective.

Cloud computing is a technology that is becoming increasingly popular due to the bene- fits that it offers for organizations and individuals. Among them, is the ability to deploy elastic applications, simplifying the process of acquiring and releasing resources to a running application, while paying only for the resources actually allocated (pay-per-use or pay-as-you-go model).

2.2 Cloud Service Models

There are three basic kinds of cloud service models and they are classified based on the delivered services to the cloud subscriber. Each share similarities but have their own distinct differences as well. Figure 2.1 illustrates the cloud computing stack and the three service models [6].

Figure 2.1: Cloud service models [7] Chapter 2. Background 8

2.2.1 Infrastructure as a Service (Iaas)

Infrastructure-as-a-service is the lowest layer and foundation of cloud computing. It provides computing, storage or network resources, delivered over the Internet in a pay- as-you-go model. Customers can request IaaS resources without having to pay in advance and get access to them within a matter of few minutes. Leading IaaS providers include (AWS), Windows Azure, , Rackspace Open Cloud, and IBM SmartCloud Enterprise. IaaS cloud providers may offer resources dedicated to only one client or shared between many. Furthermore, resources can be physical or virtual. AWS’s elastic compute cloud, for instance, offers virtualized re- sources abstracting the physical and virtualization layers. The cloud subscribers get complete access to the virtual machines, from the choice of operating system to the application software installation.

Although the IaaS cloud provider pricing is slightly higher than the private infrastructure pricing calculated over a sufficiently long period of time, the distinct advantage of IaaS cloud model is the elimination of initial capital expense and a significant reduction in operating expenses. Also, because of the pay-as-you-go pricing model, and the efficient and fast infrastructure scaling provisions offered by the cloud providers, the subscribers may choose to dynamically adapt the infrastructure to their fluctuating needs. This is very important especially for small and medium companies that are unable to afford big initial investment costs and also enables better service cloud utilisation, using the resources only when it is needed [8][9].

2.2.2 Platform as a Service (Paas)

The Platform-as-a-Service model designates programming environments and tools hosted and supported by cloud providers that can be used by consumers to build and deploy applications onto the cloud infrastructure. PaaS further simplifies the job of cloud sub- scribers, by hiding the complexities of hardware and software resource management, application deployment and dynamic scaling of infrastructure to cater to the growing application needs. Developers may use PaaS services to build applications that are hosted by the PaaS provider and offered as a service to the end users, usually over the internet. PaaS model enables the enterprises to focus only on the software development cycle involved in building the application, as the other aspects such as infrastructure management, dynamic scaling mechanisms are made transparent by the PaaS provider. Some examples of PaaS include Google App Engine, Azure Services, and the Force.com platform [10]. Chapter 2. Background 9

2.2.3 Software as a Service (SaaS)

This is the final layer of the cloud services model. This allows businesses and orga- nizations to run programs in the cloud where all portions are managed by the cloud vendor. In this model, the subscriber has limited control over the physical hardware, software stack, application execution environment and other factors unlike in IaaS or PaaS clouds. As consumers we interact with Software-as-a-Service based applications everyday without even realizing it. Examples of this are online banking and email such as Gmail Google Docs and Hotmail. SaaS represents a paradigm shift in the software service model, especially in the end user sector, as it reduces the need for client side soft- ware installation and mandates less system requirements than in the traditional software service model [11][12].

As you can see, while the three cloud service models share some similarities, there are significant differences between them as well. It is up to the consumer to choose which model is best for their company in order to use this invaluable service to its fullest potential.

2.3 IaaS Cloud Deployment Models

IaaS clouds may be categorized primarily in to 3 deployment models primarily distin- guished by ownership, size, and access.

2.3.1 Public Cloud

In a public cloud the infrastructure can be accessed openly and used by the general public. It may be owned, managed, and operated by a business, academic, or govern- ment organization, or some combination of them. It exists on the premises of the cloud provider [5]. Most of the public clouds provide a REST based API for automatic allocat- ing and deallocating of resources. Subscribers are in that way free to dynamically adapt the resources, according to their workload demands. They can afford to make short term plans for decisions on resource utilisation and optimisation. The most prominent public cloud providers are Amazon Web Services, Rackspace, Gogrid, IBM etc.The key aspect of this model is the promotion of virtualization where virtual resources can be spawned in a matter of few minutes, pay-as-you-go pricing model and cloud provider APIs for automatic control of infrastructure. The downside of public cloud deployment model is that the subscribers are generally not in control of the underlying hardware and software stacks involved in infrastructure management, location of data centers, security Chapter 2. Background 10 implications of multi-tenancy and country specific laws that may impose restrictions on sensitive information crossing geographical borders [12][13].

2.3.2 Private Cloud

In a private cloud the infrastructure is provisioned for exclusive use by a single organi- zation comprising multiple consumers (e.g., business units). It may be owned, managed, and operated by the organization, a third party, or some combination of them [5]. Con- ceptually, within a private cloud, the same organization can be technically considered to have the role of both cloud provider and cloud consumer. The most distinct advantage of private cloud is that the organizations have total control on the infrastructure unlike in a public cloud. The private cloud infrastructure capacity is fixed and dedicated and hence cannot supply resources to workloads beyond a certain limit, while the public clouds provide an illusion of unlimited resources. Although private clouds do not bring a lot of value to an organization apart from providing ways and means to efficient infras- tructure management, private cloud adoption eases the future migration to public cloud or usage of a hybrid architecture utilizing the private infrastructure as well as the public infrastructures services offered by a public cloud provider. , OpenNebula and OpenStack are three most visible players in the private cloud space.

2.3.3 Hybrid Cloud

The hybrid cloud deployment model allows companies to secure sensitive data and ap- plications such as medical records, telecommunication operators, government data etc on a private cloud, while still enjoying cost benefits of the public cloud by storing shared data and applications on it. This model is also used to handle sudden workload bursts when the existing private cloud infrastructure does not suffice to handle load spikes and requires an auxiliary option to support the load. Hence, the cloud moves workloads between private and public hosting in a flexible way and transparently for the client. and Force.com are two examples of this model [14].

2.4 Hadoop 2.0 - YARN

YARN is the next generation Hadoop MapReduce architecture. It is designed to be more flexible, improve scalability and achieve a higher resource utilisation rate, among other things. Furthermore, this new architecture not only supports the old MapReduce programming model, but also opens up the possibility of new data processing models. Chapter 2. Background 11

Although YARN remains similar to the old Hadoop MapReduce architecture, the two are different enough that most components have been rewritten and the same terminology can no longer be used for both. Some of the ways in which YARN amplified the Hadoop framework power are the following [15].

• Multi-tenancy: YARN allows multiple access engines to use Hadoop as the com- mon standard for batch, interactive and real-time engines that can simultaneously access the same data set.

• Cluster utilization: YARN allocated its resources dynamically in the form of containers and not just map-reduce slots improving thus overall utilisation of the cluster.

• Scalability: Yarn?s RM focuses exclusively on scheduling and keeps pace as clus- ters expand to thousands of nodes managing petabytes of data.

• Compatibility: Existing MapReduce applications developed for Hadoop 1 can run on top of YARN without breaking previous applications that already work.

The following figure illustrates the architecture of YARN.

[16]

Figure 2.2: The new Architecture of YARN

In short, the Hadoop YARN architecture (or called MR2) splits the two major functions of the JobTracker of MR1 into seperate components. A global Resource Manager which lies at the root of a YARN hierarchy and Application Master per application. The Chapter 2. Background 12

Resource Manager rules the entire cluster and distributes resources among the running applications in the system according to their requirements and constraints. It is the single process that contains the information required in order to make scheduling deci- sions in a shared, secure, and multi-tenant manner. In addition, it works closely with ApplicationMasters allocating resources to them upon requests.

The ApplicationMaster keeps track of the lifecycle of each application that runs within YARN. Its main responsibility is to negotiate resources from the Resource Manager and through the NodeManager, monitor the execution and resource consumption of containers. A container is the encapsulation of a given resource in YARN, and its size depends upon the amount of resources it contains such as memory, CPU, disk and network IO. In the future new types of resources are expected to be supported such as disk/network I/O, GPUs etc.

The NodeManager is a per-node running deamon that is responsible for launching appli- cations’ containers, monitoring their resource availability and execution, and reporting these results as long as any faults that may happen to the ResourceManager. This approach differs from MRv1, which managed execution of map and reduce tasks via slots. YARN continues to use the HDFS layer, with its master NameNode for metadata services and DataNode for replicated storage services across a cluster [16].

2.5 Virtualizing Hadoop

Virtualization is an old software technology that has become widely used within the last years and is well on the way to transforming the practice of IT, changing the manner that computing hardware (particularly for enterprise systems) is used. Virtualization enables multiple virtual machines (VMs) to run on a single physical host, sharing the resources of a single hardware unit across multiple virtualized devices. Different virtual machines can run different operating systems and several applications on the same physical ma- chine. Both the guest operating system and the application software running on the virtual server are unaware of the virtulization process, meaning that these virtualized IT resources are installed and executed as if they were running on a seperate physical computer.Usually a given server is not utilised highly at all times, so virtulization brings an increase in the effective utilisation of the hardware. Organizations can in that way run many VMs on a smaller set of physical machines, which in turn brings about savings in costs for power, real estate etc [17].

Apache Hadoop has emerged as one of the leading application in the big data space and is used by many enterprises for Big Data analytics. The first project where virtualization Chapter 2. Background 13 technology met Hadoop was initiated by VMWare [18]. Some of the benefits that were brought with virtualization of Hadoop were the following [19]:

• Scheduling: With VMs, unused hardware capacity can be used in order to schedule batch jobs that require lots of computing resources.

• Resource utilisation: Different kinds of VMs can be hosted on the same physi- cal machine. Hadoop VMs running data analytics jobs can be used along with other VMs for other tasks. This allows better overall utilisation by consolidating applications that use different kinds of resources.

• Datacenter Efficiency: Virtualization of Hadoop can optimise datacenter efficiency by increasing the variety of workloads that can be run on a virtualized infrastruc- ture.

• Deployment: Its much faster to deploy new nodes in a Hadoop cluster by leveraging virtulization. Hadoop configuration on a machine can finish quickly by cloning an already configured VM.

There are however some concerns when running Hadoop in a virtualized environment. First of all CPU performance is lower when virtualized. More than that, Hadoop takes advantage of high data ”locality” Much of Hadoop’s value relies on having the data locally in nodes where it runs (HDFS) but using virtualization this is no longer ensured. The blocks used by the VM can be sitting remotely in another node, which can impact seriously the performance of Hadoop. Finally, in case of a physical host failure, many hadoop virtual nodes can shut down at the same time. Careful attention must be given by cluster admins, and particular configurations and setups must be used in order to solve these problems.

2.6 Related work

Over the last few years, several Cloud monitoring and autoscaling platforms have been developed. The most prominent ones are reviewed in the following sections.

2.6.1 AWS CloudWatch

Amazon CloudWatch [20] is a service for monitoring all web services that are provided by Amazon such as EC2 (Elastic Compute Cloud) [21]. The CloudWatch service collects the values of different kinds of metrics such as CPU utilisation, disk read bytes, network Chapter 2. Background 14 usage and stores them for a predefined period of time. On these data users can build plots, statistics, thresholds, and set up alarms to watch a particular metric. These alarms are triggered when the value of a metric exceeds a threshold over a number of periods for a specific interval. In response to these alarms, an autoscaling policy can be invoked and based on the some conditions of the metrics will boot more instances or terminate some of the running instances. Chapter 2. Background 15

2.6.2 Amazon EMR

Amazon Elastic MapReduce (Amazon EMR) [22] is a web service that makes it easy to quickly and cost-effectively process vast amount of data. Using the elastic infrastructure of Amazon EC2 and amazon S3 [23] Amazon EMR provides a managed hadoop frame- work that distributes computation of your data over multiple Amazon EC2 instances. Amazon EMR monitors the jobs running in a cluster and when they are completed it shuts down the cluster so that users stop paying. It provides elasticity, as the user can easily expand or shrink the cluster to process more or less data. Chapter 3

Design Overview

3.1 Introduction to Openstack

Openstack is an open source cloud computing software that provides infrastructure as a service cloud deployment for public and private cloud. Openstack provides a significant alternative for organizations which don’t wish to go for commercially provided cloud as openstack is open source. Openstack was first introduced in June 2010, born with its initial code from NASA’s Nebula platform and Rackspace’s Cloud Files platform with the mission to enable any organization regardless of size to operate and offer cloud computing services running on standardized hardware. It is written in python and all the code for Openstack is freely available under the Apache 2.0 license.

3.1.1 Openstack Architecture

Figure 3.1: Openstack Architecture [24]

16 Chapter 3. Design Overview 17

Openstack architecture is built using three main modules [25]:

• Openstack Compute: Also known as Nova, it is a management platform that controls the infrastructure to control IaaS clouds. Nova Compute allows managing large networks of virtual machines and redundant and scalable architectures.

• Openstack Storage: Openstack supports both object and block type storages. The object type storage is ideal for cost effective, scale-out storage. It provides a fully distributed, API-accessible storage platform that can be used for backup, archiving and data retention. The block storage type allows block devices to connect with compute instances for expanded storage, better performance and integration with enterprise storage platforms.

• Openstack Networking: Openstack Networking (Neutron) provides the networking capability for Openstack and it is a system for managing networks and IP addresses easily, quickly and efficiently.

3.2 Autoscaling model

As we mentioned in previous chapter, a key characteristic of cloud computing is elasticity. This can be two-edge sword. While it enables applications to attain and release resources dynamically, adjusting to shifting demands, deciding the right amount of resources is not an easy task. In the best case, we would like to have a system that automatically adjusts the resources to the current workload of applications, with least possible human intervention, or even better, without it at all. We refer to this system as an auto-scaling system [26].

Resource scaling can be either horizontal or vertical. In horizontal scaling (or scale out) new nodes are added or released to the system as needed. In vertical scaling (or scale up) new resources are added to a single node in a system, for example more CPU power or memory. The majority of common operating systems do not allow the on-the-fly (during runtime) resize of the machine on which it runs. The same applies for VMs. Most cloud providers thus, prefer the horizontal scaling approach which is also the one we take in our autoscaling algorithm.

The type of applications we consider in this thesis are data analytics applications running on top of YARN Resource Manager. Most autoscaling algorithms take account mainly metrics obtained from the host physical infrastructure on which the VMs are running, from the VMs themselves and from the hypervisor managing the VMs. While this is an effective approach adopted my most cloud providers, there are some limitations as some Chapter 3. Design Overview 18 metrics from YARN’s Resource Manager are not taken into account and as a result the autoscaling decision is not optimum.

Finally, the autoscaler component itself is integrated to the Capacity Scheduler of YARN. Let us mention at this point that initially we developed the elastic controller as part of the FIFO scheduler but we extended it to work with the Capacity Scheduler.

3.2.1 Autoscaling algorithm

The autoscaling process consists of the following steps [26]:

• Monitor: A monitoring system is required during the autoscaling process in order to provide measured metrics about user demands and system status. Openstack provides such information through Ceilometer. A variety of metrics can be used as drivers for scaling decisions. Most common are related to hardware usage such as CPU utilization per VM, disk access, network interface access and memory usage.

• Analyze: During this phase the metrics that are gathered from the monitoring system are analyzed in order to determine the current state of the system, the running applications and the pending applications in the queue.

• Plan: Once the analysis phase finishes, the autoscaler is ready to make a plan on how to make a satisfactory planning decision. Some examples are removing a VM or add a VM with a specific flavour. Decisions will be made based on the data taken from the analysis phase and the target SLO if any, as well as the penalty for a VM to launch and join the YARN cluster.

• Execute: The last phase consists of executing the scaling actions decided in the previous step. It is during this phase that the Resource Manager of YARN interacts with the scheduler of Openstack to ask for more resources. This interaction is implemented through Openstack REST API with the help of a fluent Openstack client SDK for Java which we mention in later chapter.

There is always a time lag from the time when an autoscaling action is executed (adding a server for example) until it is effective. It takes for example almost a couple of minutes to assign a physical server to launch a new VM with a specific image, boot the operating system and applications, and have the server fully operational. Apart from that a script has to run in the newly VM which installs Puppet [27] in order to connect with the host where the ResourceManager and Namenode are running. We refer to this total delay as penalty of the autoscaling process. Chapter 3. Design Overview 19

In addition, cloud usually offers a variety of instance types, such as high-CPU and high I/O instances. We verified that choosing the appropriate instance types based on the application workload can further improve performance and save user money when it comes to public clouds. For example, an I/O intensive application can run faster on high-Disk machines than high-CPU machines.

In order to decide the autoscaling action in the planning phase we decided to combine characteristics of threshold-based rules and reinforcement learning [28]. Threshold-based autoscaling policies are very popular among cloud providers like Amazon EC2 and third- party tools like RightScale [29]. The main reason is the simplicity of these policies that make them attractive to cloud customers. A threshold-based rule consists of a series of conditions that when met, trigger some actions over the underlying cloud infrastruc- ture. The condition itself uses one or more performance metrics (extracted in our case from both the VMs and ResourceManager) x1,x2,..., such as CPU utilisation, pending applications or memory utilisation. We assign upper and lower thresholds to each per- formance metric. Whenever the observed performance metric is above or below a certain threshold, a predefined number of instances will be added or removed from the cluster. One example of trigger could be like ”Add 2 instances when CPU usage exceeds 60% for the last 5 minutes”. This automation enhances the dynamic scalability benefits of the cloud by adding in a transparent way more resources to handle increasing workload and by shutting down unnecessary machines. In this way, users do not have to worry about capacity planning. The underlying resource infrastructure can be adaptive to the application real-time workload. Reinforcement learning focuses on learning through direct interaction between an agent (elastic controller in our case) and the environment where this agent acts on (hadoop cluster). We use this technique in order to make the controller learn from experience the best scaling action to take and be able to predict how much time it will take for all the current applications to finish. In more detail, the elastic controller holds in a map the time it took for a specific type of application to finish, as well as the number of nodes that existed in the cluster at that moment. In that way, the controller will be able to predict how much time similar applications will take to complete. This will enable users to submit jobs with deadline constraints as the necessary number of VMs will be requested from the infrastructure in order to meet these deadlines. This is a feature that is not implemented yet, but we are planning to do so as further work. Chapter 3. Design Overview 20

3.2.2 Scaling the cluster out

We have implemented an elastic controller which has been integrated in the scheduler component of YARN’s Resource Manager. In contrast with other approaches that have been developed for providing elasticity in the cloud, our elastic controller monitors not only the metrics from the VMs, but also takes account performance metrics of the ResourceManager itself. Furthermore, depending on the application workload, it chooses the appropriate instance in order to improve performance.

We define two set of rules: one for scaling out and one for scaling in. Rules are structured like these examples: if x1 > Upthr1 and/or x2 > Upthr2 and/or . . . for the last tUp seconds then N = N + S and do nothing for inUp seconds end if if x1 < Lthr1 and/or x2 > Lthr2 and/or . . . for the last tLow seconds then N = N - s and do nothing for inL seconds end if

Each rule is composed of a series of conditions. The condition itself is composed of a series of performance metrics that are available it the Resource Manager, such as pending applications, lost nodes, or CPU load. Each performance metric has upper Upthr and lower Lthr thresholds. If the condition is met for a predefined time (tUp or tLow), then corresponding action will be triggered and executed. Since we are dealing with horizontal scaling, we define a fixed amount S of VMs to be acquired or released. The default amount is one for both in-scaling (the reason is explained in the following section) and out-scaling. When it comes to out-scaling, the elastic controller checks the number of pending applications in the queue of the Capacity Scheduler and depending on the number decides the amount of new VMs to be spawned. This decision is very important as we want to avoid a state where the gain in throughput is limited but the cost of coordinating hadoop instances continues to grow. After adding one or more instances to the cluster the throughput speedup eventually levels off and reaches as know in the bibliography the steady state. In addition, the situation where we provision more machines after the steady state is known as over-provisioning. Such situation is clearly undesirable as it results in waste of resources [30]. At this point, two more controls are carried out. First, the controller checks if the requested VMs can be provided by the infrastructure. Second, it checks the last time when an autoscaling action was executed and if the cooling period (defined by inUp or inL) has passed, it proceeds with the next Chapter 3. Design Overview 21 step of the process. At the next step, the controller identifies what kind of applications are running in the Resource Manager for the last period of time. We explain in the next chapter how this functionality is implemented. The scale-out action is illustrated in Figure 3.2 with a sequence diagram.

Figure 3.2: Scale out

3.2.3 Scaling the cluster in

We had to be more careful when scaling the cluster down. The reason for that is that HDFS could remain with inconsistent data. This is due to the fact that each node holds a block of data for which there are 2 more replicas. In our cluster we have reduced this replication factor to two. Thus, upon in-scaling decision, we release one VM at a time, as otherwise all the data for a specific block could be erased. From now on we refer to this VM node. Another aspect we take into account, is to choose nodes where no ApplicationMasters are running on. In that way, the tasks that were running in containers of that node will be restarted on other live nodes in the cluster, but the client would not have to resubmit the application from the beginning. Apparently, the node which we take down is a slave node (NodeManager and DataNode) and not a master node (NameNode, ResourceManager). When we remove a node from a hadoop cluster we have to do a graceful decommission of the Hadoop Services that are running on that node. We use again a sequence diagram to show the scale-in process. Chapter 3. Design Overview 22

Figure 3.3: Scale in Chapter 4

Implementation

This chapter covers the implementation of our work. We firstly discuss the environment and the hadoop cluster setup. Then, we describe the Ceilometer service provided by Openstack in order to get metrics from the running instances. We continue with a description of how we leverage a configuration management tool called Puppet, so that newly added VMs can pull and apply the appropriate hadoop configuration files. After that, we talk about the workload modeling and generation and we close the chapter by discussing the importance of choosing the right instance type.

4.1 Environment and Cloud setup

In order to implement and test the hypothesis about efficient autoscaling combining data analytics framework scheduler and cloud scheduler, a private cloud was setup for conducting simulations. We used the private cluster at Ericsson Research which runs Openstack on bare metal machines. The underlying physical infrastructure consists of 32 hosts, which are interconnected by a Gigabit Ethernet LAN. The operating system of the host machines as well as the virtual machines is CentOS. Each physical node contains six cpu cores with hyper-threading enabled, 64 GB of RAM and 1TB disk space.

Virtual instances are hosted on Openstack compute nodes and can be of different flavors. The scheduler creates virtual instances on a compute node until either the number of virtual CPU cores or virtual memory exceed the limit on that compute node. Openstack supports overcommiting of CPU and memory resources on compute nodes, which is a technique of allocating more virtualized CPUs and/or memory than there are physical resources. The default CPU overcommit ratio is 16 virtual cores to 1 physical core and 1.5 GB of virtual to physical memory. The disks associated with VMs are ephemeral,

23 Chapter 4. Implementation 24 meaning that (from the user’s perspective) they effectively disappear when a virtual machine is terminated. This is done for the data locality problem we mentioned in previous Chapter. With ephemeral disks the data resides locally on the nodes where computation is running.

We are running Hadoop 2.6.0 with Java version 1.6 and the default scheduler (Capacity) is used. Our cluster consists of one master node and the rest are slave nodes. We outline the services running on these nodes in table 4.1.

Master node Slave node NameNode DataNode ResourceManager NodeManager JobHistoryServer Puppet Agent SecondaryNameNode

Table 4.1: System services running on Nodes

4.2 Openstack4j

We used OpenStack4j[31] for the communication with the IaaS, which is an open source library for managing an OpenStack deployment written entirely in Java. It provides an API which gives full control over the various OpenStack services.

4.3 Ceilometer

The main objective of the Openstack project was to introduce an Open Source Cloud Computing platform that would meet the needs of public and private clouds by being simple to implement and scale massively. Since Openstack offers Infrastructure as a Service (IaaS) to end clients, it is necessary for administrators to be able to meter its performance and utillization for billing, chargeback, or monitoring purposes [32].

There are several projects to meter Openstack Infrastructure such as Zabbix, Synaps, Healthmon and others. The most promising and actively developed is Ceilometer. Some of the characteristics that make Ceilometer be preferred as the dominant cloud metering component are the following [33]:

• It provides efficient collection of metering data i.e CPU and Network costs.

• Deployers are allowed to integrate with the metering system directly or by replacing components. Chapter 4. Implementation 25

• Data may be collected by notifications sent by the system or by polling the infras- tructure.

• Deployers can configure the type of data collected to meet their operating require- ments.

• It provides a REST API through which users can view the data collected by the metering system.

Some of the meters that we can measure with Ceilometer can be seen in the following table.

Name Unit Meaning cpu util % The average CPU utilization disk.read.bytes B The total number of bytes read disk.write.bytes B The total number of bytes write memory MB The memory allocated by the hypervisor to the instance network.incoming.bytes B The total number of bytes incoming on a network interface network.outgoing.bytes B The total number of packets outgoing on a network interface

Table 4.2: Variety of meters that can be measured with Ceilometer

In our work, we use Ceilometer in order to identify the current type of workload in the cluster and chose the appropriate type of instance to spawn upon an autoscaling decision. For example high-CPU instances are well suited for computing intensive application, like image processing. Instances with more memory and more disk are more suitable for I/O intensive application, like databases systems and memory caching application.

4.4 Puppet

When new VMs are created during the autoscaling process they do not have the required information in order to find and connect with the Resource Manager. It would become laborious to log in to each machine, copy our configurations to them and then apply them. Instead it would be better to keep all of our configurations in a central location and then make it possible for these machines to pull this information from the repository and apply them. Moreover, if we want to change some properties of the hadoop .xml files we would like to make these changes to one machine and enable the rest of the cluster machines to pull these updates from there. To enable this client-server behaviour, we used Puppet [27]. Chapter 4. Implementation 26

4.4.1 What is Puppet

Puppet is an open-source configuration management tool developed by Puppet Labs written in Ruby. It helps in automation, deployment and scaling of applications in the cloud. The main objective of Puppet is to provide a powerful and expressive language backend library than allows users to write their server automation applications in just a few lines of code. The architecture of Puppet is Client/Server, meaning that clients talk to one or more central servers at regular intervals i.e every half an hour, in order to download and synchronize with the latest configuration.

4.4.2 How does Puppet work

The central server that keeps the entire configuration for different hosts is the Puppet Master. In all the client servers where configurations are required, a deamon is installed and running called Puppet Agent. The job of this deamon is to query and get the configuration from the puppet master server at a specific time interval.

The communication between the puppet agent and master is built in a secure encrypted channel with the support of SSL.

The following diagram shows that puppet master server contains all the configuration options available for Host1, Host2 and Host3. Chapter 4. Implementation 27

[34]

Figure 4.1: Interaction between puppet agents and puppet master

In our work, we use puppet so that newly added slave nodes can fetch data from the master node. A master node keeps a script which installs the puppet agent. That script is transferred to the new instance and is executed. When the puppet agent starts running, it pulls all the hadoop configurations from the puppet master which is running on the master node. We have configured the interval to be every 10 minutes. The fetched information contain also the host ip address of the machine where the ResourceManager and NameNode operate. The newly added slave node sends a heartbeat to that address and becomes a member of the hadoop cluster increasing in that way its total resources. This process is demonstrated in Figure 4.2. Chapter 4. Implementation 28

Figure 4.2: Node join

4.5 Workload generation

We performed our experiments with the following applications that are included in the Hadoop distribution.

• Pi: Pi is a clearly computational application that employs a Monte Carlo method to calculate the value of pi. It is highly parallel as all the all the map tasks are independent to each other and there is only a single reduce task that gathers very little data from map tasks. Network traffic and storage I/O is small.

• TestDFSIO: TestDFSIO is a storage throughput test that consists of two phases. It firsts writes a predefined number of 100-byte rows of data to HDFS and then it reads back in.

• TeraSort: Terasort sorts a large number of 100-byte records. It is often considered to be representative of real Hadoop workloads, as it does considerable computation, networking and storage I/O. A benchmark run consists of three steps:

1. Generate the input data via Teragen Chapter 4. Implementation 29

2. Sort the data with TeraSort 3. Validate the sorted output via TeraValidate

• Spark application: We run a Spark [35]application that takes as input data logs of all HTTP requests to the NASA Kennedy Space Center WWW server in Florida for two months [36]. The logs are an ASCII file with one line per request, with the following columns:

1. host making the request. A hostname when possible, otherwise Ip address if the name could not be looked up. 2. timestamp in the format ”DAY MON DD YYYY”, where DAY is the day of the week, MON is the name of the month, DD is the day of the month, and YYYY is the year. 3. request given in quotes. 4. HTTP reply code. 5. bytes in the reply.

We process the logs and after doing the necessary transformations to the data we output the average number of daily requests by host.

• Mixed Spark applications: We used a mix of three applications included in the Spark binary distribution.

1. Pi.py: Pi application in Spark. 2. LinearRegression.scala: An example app for linear regression running on a sample synthetic dataset. 3. MovieLensALS.scala: An example app for ALS (Alternating Least Square) matrix factorization on MovieLents data [37]. The dataset consists of two files, one with sample movie titles and one with sample user ratings. The al- gorithm used is based on Collaborative Filtering [38] in order to find similarity tastes between users.

In order to see the autoscaling process in action we stress the Hadoop cluster by sub- mitting many applications of each type at a time.

Below is a set of graphs that verifies the usage of CPU and I/O with different kinds of applications. The y-axis is the meter we measure we Ceilometer and the x-axis shows the time it took for the specific benchmark to finish execution. Chapter 4. Implementation 30

(a) CPU usage (b) Disk usage (c) Network traffic

Figure 4.3: Pi application

(a) CPU usage (b) Disk usage (c) Network traffic

Figure 4.4: DFSIO application

(a) CPU usage (b) Disk usage (c) Network traffic

Figure 4.5: Terasort application

We submitted 40 Pi applications, one DFSIO application writing and reading 30 GB of data and the Terasort benchmark with 30 GB records to sort, in a cluster of 3 in- stances. We measured then cpu utilization, disk write bytes, and network outgoing bytes with Ceilometer. As we can see from the graphs, all the applications do considerable computation as they reach almost 100 % cpu usage. However, disk usage is only high for DFSIO as was expected, low for Pi and moderate for Terasort application. Finally, network utilization is only high for Terasort and DFSIO due to hdfs replication factor. Chapter 4. Implementation 31

4.6 Choosing instance type

We simulated four types of virtual machines. They are Large, Medium, High-CPU and High-I/O machines. We summarize their simulation parameters in Table 4.3.

Name VCPUs Ephemeral Disk RAM m1.small 1 20 GB 2.048 MB m1.medium 2 50 GB 5.120 MB m1.medium.moredisk 2 100 GB 5.120 MB m1.medium.morecpu 4 50 GB 5.120 MB m1.large 4 80 GB 8.192 MB m1.xlarge 8 160 GB 16.384 MB

Table 4.3: Different instance flavors

As we mentioned in the previous chapter we believe that cloud scaling actions can be done better by considering different types of instances than just handling instance numbers. This can be verified from the following figure.

Figure 4.6: I/O intensive applications run faster on high-I/O instances Chapter 4. Implementation 32

We run again 40 Pi applications and DFSIO with 30 GB of data to write and read. We performed this experiment using m1.medium (hadoop.small) flavor and m1.medium.moredisk (hadoop.small.moredisk) flavor on different cluster size at a time. X-axis shows the num- ber of VMs that the cluster contained in each run and Y-axis shows the time in seconds it took for the benchmark to finish. One can notice from that figure that I/O intensive applications run faster on high I/O instances but this does not apply also for computa- tional applications. For example in a cluster of four VMs we see that the DFSIO takes almost one minute less when the disk space of the instance increases from 50 GB to 100 GB. This can be explained by the reduced seek latency in order to find a free sector to write data in a bigger disk. Chapter 5

Performance Evaluation

This chapter analyses and evaluates the correctness of our approach. Section 5.1 dis- cusses the performance of the system with all the types of applications we conducted and Section 5.2 shows load-balancing effect when the cluster scales out.

5.1 Performance analysis

Having presented how the system works, we conducted a series of experiments where we tried to stress the hadoop cluster so that autoscaling actions could be triggered in order to evaluate and compare the performance of our elastic controller component.

In our evaluation, we simulated three types of jobs. They are mix, computing intensive and I/O intensive. We assume mix jobs contain both computation and I/O.

We performed our experiments with MapReduce and Spark Jobs that were running on YARN. This was important as we wanted to show that the effectiveness of our approach applies not only with MapReduce jobs but with any framework that can launch applications on top of YARN such as Spark, Storm, etc . . . .

The first experiment that we did was to submit 50 Pi applications in parallel to the Resource Manager, with and without the elastic controller being enabled. The result is presented in Figure 5.1. X-axis shows the number of the submitted applications that have already finished execution. Y-Axis shows the time in seconds when an application is completed. For example application number 1 completed in 50 seconds, application number 10 completed in 180 seconds, and so on. At any point of time, the number of applications that can run concurrently depends on the number of available containers. Thus, after 400 seconds, when the number of instances increased (increasing also the available containers) more applications started to be running in parallel. 33 Chapter 5. Performance Evaluation 34

Figure 5.1: Experimental performance with 50 Pi applications

Form the graph above, we make two observations. First, as we can notice, when the autoscaler component is enabled it takes almost 4 minutes less for all the applications to finish. Second, the gain of performance starts being visible at time 400 when the autoscaling action is taken. Up to that moment almost half of the submitted applications have already finished. We remind here that there is a delay from the time when an autoscaling action executed until it is effective. This delay is almost 100 seconds.

The second experiment that we performed was to test the performance when running TestDFSIO and Terasort benchmarks. TestDFSIO writes and reads 30GB of data to and from the disk, while Terasort generates and sorts 30GB records. Results are depicted in Figure 5.2 (red bars). In the first run, where autoscaling was disabled, it took 11.4 minutes for the experiment to finish. The cluster consisted of three slave nodes and one master node. When we enabled the elastic controller and rerun the experiment, although the cluster capacity increased from three to four slave nodes, we didn’t observe a big impact on the performance gain. The reason for that is because TestDFSIO is mostly I/O intensive application and does not do considerable computation. Terasort on the other hand, contains computation but the workload of generating and sorting 30 Chapter 5. Performance Evaluation 35

GB of data could be handled by three nodes so we observed that after the autoscaling there were some containers that were not assigned to any application for executing tasks. Increasing the input size, would result in better performance.

In the same Figure (5.2) we present the results of another example. We run a Spark application which takes as input data logs of all HTTP requests to the NASA Kennedy Space Center. Logs for two months were available from the NASA archive but we generated fake data for two more months in order to extend the execution time of the application. We run the program in yarn cluster mode two times and the second one (when the elastic controller was enabled) it took two minutes less to finish as the figure indicates.

Figure 5.2: Experimental performance with DFSIO-Terasort and Spark Application

In the last experiment we submitted a mix of 50 Spark applications again with and without autoscaling. The workload consisted of 30 Pi applications, 10 LinearRegression jobs, and 10 MovieLensALS jobs. The input for the last two jobs was sample datasets included in the Spark distribution. It is important to mention here that all of these jobs contain a lot of computation. The result is presented in Figure 5.3. Once again, x-axis shows the number of the submitted applications that have already finished execution and y-Axis shows the time in seconds when an application is completed. As is shown in the graph, completion times from 20th application and ahead are dropped, resulting the benchmark to finish 4 minutes faster than the first run. We observe that our approach Chapter 5. Performance Evaluation 36 is consistent and effective not only for Map Reduce jobs but for Spark applications as well. Than proves the effectiveness of the implemented component with any kind of application that can uses YARN as the cluster Resource Manager.

Figure 5.3: Experimental performance with 50 Spark applications

5.2 Autoscaling effect

In this section we show the load-balancing effect upon an autoscaling decision, resulting from the addition of extra instances to the hadoop cluster. Figure 5.4 depicts the CPU utilization of the VMs during the autoscaling action. This figure is related to Figure 5.1 where we submitted 50 Pi applications to the Resource Manager. The x-Axis shows the time in seconds and the y-Axis is the percentage of cpu utillization. Initially, the cluster consisted of 3 instances each one consuming more than 80% of its CPU. Due to the large number of applications that were pending in the queue, an autoscaling decision was taken to boot up three more instances and join in the cluster. As we can see, this happened 360 sec after tthe first jobs started to being executed. From time 360 until 600 we see that there is a slight decrease of the CPU utility percentage of the first three VMs, while the newly added machines contribute equally to the total workload using more than 60 but less than 80 percent of their CPU. As we mentioned in Chapter 2, resources are distributed in form of containers in a hadoop cluster. Depending on its capacity, a node splits its resources into containers. If all the containers are kept busy Chapter 5. Performance Evaluation 37 executing tasks then that node will be indicating high CPU and memory utility. This is denoted in Figure 5.4. After the autoscaling action, the first three VMs released some containers but kept utilizing most of them, while the rest of the machines used slightly more than half of their total resources for running tasks.

Figure 5.4: Autoscaling effect from three to six VMs Chapter 6

Conclusions

In the final chapter we present the final conclusions of this thesis. Section 6.1 reviews the work done in implementing the presented elastic controller component while in section 6.2 we discuss what could be the next possible extensions to this work.

6.1 Conclusion

This report discussed the limitations that exist in the current cloud monitoring and autoscaling platforms resulting in poor scheduling decisions when running data ana- lytics jobs. The main reason for that is because the Cloud/Infrastructure scheduler is not aware of the metrics from the analytics framework layer, which we consider to be important. Several metrics can be monitored now such as queue size of pending applica- tions, lost nodes, lost containers, and others. The implementation of this work focuses on developing a model where metrics from both sides are combined in order to exploit the cloud autoscaling capabilities, get best performance, and maximise the hardware and resource utilisation in a cloud environment where data analytics jobs are running. Therefore, an autoscaling model is proposed in Chapter 3 which resulted in the integra- tion of a new component into YARN Resource Manager. The implementation details of the solution are discussed in Chapter 4, which discusses the technologies we used in this work.

The results indicate that with our elastic controller we can exploit the on demand scaling property of the cloud and expand a hadoop cluster easily and in a short period of time when data and computation is growing, reducing in that way the average response time for submitted applications. In addition, we can shrink a hadoop cluster shutting down unnecessary instances when the workload is little, allowing other users of a shared

38 Chapter 6. Conclusions 39 cluster to make use of these resources. Finally, there are different classes of jobs, such as computing intensive and I/O intensive jobs. As we saw from the obtained results, a job class has different processing time on different instance types. It is therefore important not to choose a fixed instance type when scaling up but the one that is most appropriate for the specific type of workload at that moment.

6.2 Future work

In this work, we use a reactive strategy for triggering auto-scaling operations. That is, if at time t a monitored metric is above or below a certain threshold, an auto-scaling operation is triggered. One extension that we are planning to do for further work, is to insert proactivity feature into the autoscaling algorithm in order to be able to scale in advance. In the proactive strategy we estimate the utilization level of the system at future time and we take auto-scaling decisions according to then.

When we run our experiments with up to 7 instances, they were residing in different physical hosts. The physical location of the VMs may be a factor affecting performance, if we consider the fact that 2 VMs may need to transfer a lot of data between themselves to accomplish a task. In Openstack, the impact on performance of the physical VM location can be summarized as follows [39]:

VM Location Performance Impact Same physical core Fastest Same blade server, different blades Faster Different blade servers, same subnet Slower Different blade servers, different subnets Slowest

Table 6.1: Impact on performance of the physical VM location

Thus, an interesting test would be to launch all the VMs that form a hadoop cluster under the same physical host, perform the same series of experiments and then compare the results to see the potential gain of performance.

Finally, as we mentioned in previous Chapter we used the Capacity Scheduler of YARN. This scheduler was designed to allow many organizations share a large cluster while ensuring capacity guarantees. In our experiments, there is only one user that submits jobs. Therefore, all the pending applications are being held under the same queue and an autoscaling action will be taken depending on the size of that queue. However, if there were many users sharing the same hadoop cluster, resources from queues running below their capacity can be allocated to any queue beyond its capacity. In that case, our elastic controller should take autoscaling actions only if there is demand for resources from every queue in the system. Bibliography

[1] Apache Software Foundation. What is apache hadoop. 2015. URL https:// hadoop.apache.org/.

[2] Gigaom Research. Gigaom research. 2015. URL https://gigaom.com/.

[3] Andrew Brust. Big data analytics in the cloud: The enterprise wants it now. November 2014. URL http://research.gigaom.com/2014/11/ big-data-analytics-in-the-cloud-the-enterprise-wants-it-now/.

[4] Charlie Oppenheimer and Matrix Partners. Which is less expensive: Ama- zon or self-hosted? February 2012. URL https://gigaom.com/2012/02/11/ which-is-less-expensive-amazon-or-self-hosted/.

[5] Mell Peter and Grance Timothy. The nist definition of cloud computing. National Institute of Standards and Technology, pages 1–20, September 2011. URL http: //link.aip.org/link/?RSI/62/1/1.

[6] Alexander Lenk, Markus Klems, Jens Nimis, Stefan Tai, and Thomas Sandholm. What’s inside the cloud? an architectural map of the cloud landscape. pages 23–31, November 2009. URL http://wweb.uta.edu/faculty/sharmac/courses/ cse6331/current-offering/CC/papers/lenk2009-cc.pdf.

[7] Ephraim Baron. Big data analytics in the cloud: The enterprise wants it now. November 2011. URL https://blog.equinix.com/2011/11/aren’ n-virtualization-and-cloud-the-same-thing.

[8] Rolf Harms and Michael Yamartino. The economics of the cloud. Novem- ber 2010. URL http://news.microsoft.com/download/archived/presskits/ cloud/docs/the-economics-of-the-cloud.pdf.

[9] Zach Hill and Marty Humphrey. A quantitative analysis of high performance computing with amazon?s ec2 infrastructure: The death of the local clus- ter? October 2009. URL http://www.cs.virginia.edu/~humphrey/papers/ QuantitativeAnalysisEC2.pdf.

40 Bibliography 41

[10] Kevin L. Jackson and Cary Landis. Platform as a service (paas). Jan- uary 2012. URL http://www.fedplatform.org/wp-content/uploads/2012/01/ NJVC-Virtual-Global-PaaS-White-Paper.pdf.

[11] appcore. 3 types of cloud service models. 2015. URL http://www.appcore.com/ 3-types-cloud-service-models/.

[12] Michael Armbrust, Armando Fox, Rean Griffith, Anthony D. Joseph, Randy H. Katz, Andrew Konwinski, Gunho Lee, David A. Patterson, Ariel Rabkin, Ion Sto- ica, and Matei Zaharia. Above the clouds: A berkeley view of cloud computing. February 2009. URL https://www.eecs.berkeley.edu/Pubs/TechRpts/2009/ EECS-2009-28.pdf.

[13] Ang Li, Xiaowei Yang, Srikanth Kandula, and Ming Zhang. Cloudcmp: Comparing public cloud providers. November 2010. URL https://www.cs.duke.edu/~angl/ papers/imc10-cloudcmp.pdf.

[14] CloudTweaks. The 4 primary cloud deployment models. July 2012. URL http: //cloudtweaks.com/2012/07/4-primary-cloud-deployment-models/.

[15] Hadoop yarn. 2015. URL http://hortonworks.com/hadoop/yarn/.

[16] M. Tim Jones and Micah Nelson. Moving ahead with hadoop yarn. July 2013. URL http://www.ibm.com/developerworks/library/bd-hadoopyarn/.

[17] Aparna Raj, Kamaldeep Kaur, Uddipan Dutta, V Venkat Sandeep, and Shrisha Rao. Enhancement of hadoop clusters with virtulization using the capacity sched- uler. December 2012.

[18] VMware. A benchmarking case study of virtualized hadoop performance on vmware vsphere 5. October 2011. URL http://www.mellanox.com/pdf/case_studies/ VMW-Hadoop-Performance-vSphere5.pdf.

[19] M Kontagora and H Gonzalez-Velez. Benchmarking a mapreduce environment on a full virtualization platform. February 2010.

[20] Amazon. Amazon cloudwatch. 2015. URL http://aws.amazon.com/cloudwatch/.

[21] Amazon. Amazon ec2. 2015. URL http://aws.amazon.com/ec2/.

[22] Amazon. Amazon emr. 2015. URL http://aws.amazon.com/elasticmapreduce/.

[23] Amazon. Amazon s3. 2015. URL http://aws.amazon.com/s3/.

[24] Chin-Fah Heoh. A cloud economy emerges ... somewhat. November 2011. URL http://storagegaga.com/a-cloud-economy-emerges-somewhat/. Bibliography 42

[25] Openstack. Openstack: The open source cloud operating system. 2015. URL https://www.openstack.org/software/.

[26] Tania Lorido-Botran, Jose Miguel-Alonso, and Jose A.Lozano. A review of auto- scaling techniques for elastic applications in cloud environments. October 2014.

[27] PuppetLabs. What is puppet? 2015. URL https://puppetlabs.com/puppet/ what-is-puppet.

[28] Richard S. Sutton and Andrew G. Barto. Introduction to reinforcement learning. 1998. URL http://neuro.bstu.by/ai/RL-3.pdf.

[29] RightScale. Rightscale cloud management. 2015. URL http://www.rightscale. com.

[30] Shicong Meng, Ling Liu, and Vijayaraghavan Soundararajan. Tide: Achieving self-scaling in virtualized datacenter management middleware. 2010. URL http: //www.cc.gatech.edu/~lingliu/papers/2010/tide-middleware2010.pdf.

[31] Jeremy Unruh. Openstack4j. 2015. URL http://www.openstack4j.com/.

[32] Openstackindia. Openstack ceilometer. October 2012. URL http://www. slideshare.net/openstackindia/openstack-ceilometer.

[33] Ruslan Kiyanchuk. Openstack metering using ceilometer. July 2013. URL https: //www.mirantis.com/blog/openstack-metering-using-ceilometer/.

[34] Sudhi Seshachala. Automation, provisioning and configuration manage- ment with puppet. April 2015. URL http://devops.com/2015/04/16/ automation-provisioning-configuration-management-puppet/.

[35] Apache. Spark. 2015. URL https://spark.apache.org/.

[36] NASA. Host requests logs. 1995. URL http://ita.ee.lbl.gov/html/contrib/ NASA-HTTP.html.

[37] MovieLens. Movielens. 2015. URL https://movielens.org/.

[38] Wikipedia. Collaborative filtering. 2015. URL https://en.wikipedia.org/wiki/ Collaborative_filtering.

[39] admin. Openstack simply explained. November 2013. URL http://gonorthforge. com/november-14-2013-openstack-simply-explained/.