Combining Analytics Framework and Cloud Schedulers in Order to Optimise Resource Utilisation in a Distributed Cloud

KTH Royal Institute of Technology Master Thesis Combining analytics framework and Cloud schedulers in order to optimise resource utilisation in a distributed Cloud Author: Supervisor: Nikolaos Stanogias Ignacio Mulas Viela A thesis submitted in fulfilment of the requirements for the degree of Software Engineering of Distributed Systems July 2015 TRITA-ICT-EX-2015:154 Declaration of Authorship I, Nikolaos Stanogias, declare that this thesis titled, 'Combining analytics framework and Cloud schedulers in order to optimise resource utilisation in a distributed Cloud' and the work presented in it are my own. I confirm that: This work was done wholly or mainly while in candidature for a research degree at this University. Where any part of this thesis has previously been submitted for a degree or any other qualification at this University or any other institution, this has been clearly stated. Where I have consulted the published work of others, this is always clearly at- tributed. Where I have quoted from the work of others, the source is always given. With the exception of such quotations, this thesis is entirely my own work. I have acknowledged all main sources of help. Where the thesis is based on work done by myself jointly with others, I have made clear exactly what was done by others and what I have contributed myself. Signed: Date: i \Thanks to my solid academic training, today I can write hundreds of words on virtually any topic without possessing a shred of information, which is how I got a good job in journalism." Dave Barry KTH ROYAL INSTITUTE OF TECHNOLOGY Abstract Faculty Name Software Engineering of Distributed Systems Masters Degree Combining analytics framework and Cloud schedulers in order to optimise resource utilisation in a distributed Cloud by Nikolaos Stanogias Analytics frameworks were initially created to run on bare-metal hardware so they contain scheduling mechanisms to optimise the distribution of the cpu load and data allo- cation. Generally, the scheduler is part of the analytics framework resource manager. There are different resources managers used in the market and the open-source commu- nity that can serve for different analytics frameworks. For example, Spark is initially built with Mesos. Hadoop is now using YARN. Spark is also available as a YARN application. On the other hand, cloud environments (Like OpenStack) contain theirs own mechanisms of distributing resources between users and services. While analytics applications are increasingly being migrated to the cloud, the scheduling decisions for running an analytic job is still done in isolation between the different scheduler layers (Cloud/Infrastructure vs analytics resource manager). This can seriously impact performance of analytics or other services running jointly in the same infrastructure as well as limit load-balancing, and autoscaling capabilities. This master thesis identifies what are the scheduling decisions that should be taken at the different layers (Infrastructure, Platform and Software) as well as the required metrics from the environment when mul- tiple schedulers are used in order to get the best performance and maximise the resource utilisation. Acknowledgements First, I would like to thank my main supervisor Ignacio Mulas Viela for his constant support, motivation and passion that he transmitted to me during the process of this work. Many thanks to my other supervisors Nicola Seyvet and Tony Larsson for the advices, the comments, the help and the valuable insights they gave to me during the development of this thesis. I also thank my examiner and associate professor Jim Dowling for all his valuable knowledge, inspiration and encouragement that he gave me throughout the last year and his willingness to provide help when needed for this work. Finally, I would like to thank my family for their financial and psychological support over the last two years and my friends who helped me in difficult moments and gave me the opportunity to share and celebrate my successes with them. iv Contents Declaration of Authorshipi Abstract iii Acknowledgements iv Contents v List of Figures vi List of Tables vii 1 Introduction1 1.1 Motivation...................................1 1.2 Problem Statement...............................2 1.3 Methodology..................................4 1.3.1 Method and Research Approach...................4 1.3.2 Data Collection and Analysis.....................5 1.4 Contribution..................................5 1.5 Document structure..............................5 2 Background6 2.1 Cloud Computing................................6 2.2 Cloud Service Models.............................7 2.2.1 Infrastructure as a Service (Iaas)...................8 2.2.2 Platform as a Service (Paas).....................8 2.2.3 Software as a Service (SaaS).....................9 2.3 IaaS Cloud Deployment Models........................9 2.3.1 Public Cloud..............................9 2.3.2 Private Cloud.............................. 10 2.3.3 Hybrid Cloud.............................. 10 2.4 Hadoop 2.0 - YARN.............................. 10 2.5 Virtualizing Hadoop.............................. 12 2.6 Related work.................................. 13 2.6.1 AWS CloudWatch........................... 13 2.6.2 Amazon EMR............................. 15 v Contents vi 3 Design Overview 16 3.1 Introduction to Openstack........................... 16 3.1.1 Openstack Architecture........................ 16 3.2 Autoscaling model............................... 17 3.2.1 Autoscaling algorithm......................... 18 3.2.2 Scaling the cluster out......................... 20 3.2.3 Scaling the cluster in.......................... 21 4 Implementation 23 4.1 Environment and Cloud setup......................... 23 4.2 Openstack4j................................... 24 4.3 Ceilometer.................................... 24 4.4 Puppet..................................... 25 4.4.1 What is Puppet............................ 26 4.4.2 How does Puppet work........................ 26 4.5 Workload generation.............................. 28 4.6 Choosing instance type............................. 31 5 Performance Evaluation 33 5.1 Performance analysis.............................. 33 5.2 Autoscaling effect................................ 36 6 Conclusions 38 6.1 Conclusion................................... 38 6.2 Future work................................... 39 Bibliography 40 List of Figures 1.1 Gigaom Research Data Warehousing Survey.................2 1.2 Average traffic distribution..........................3 2.1 Cloud service models..............................7 2.2 The new Architecture of YARN........................ 11 3.1 Openstack Architecture............................ 16 3.2 Scale out.................................... 21 3.3 Scale in..................................... 22 4.1 Interaction between puppet agents and puppet master........... 27 4.2 Node join.................................... 28 4.3 Pi application.................................. 30 4.4 DFSIO application............................... 30 4.5 Terasort application.............................. 30 4.6 I/O intensive applications run faster on high-I/O instances........ 31 5.1 Experimental performance with 50 Pi applications............. 34 5.2 Experimental performance with DFSIO-Terasort and Spark Application. 35 5.3 Experimental performance with 50 Spark applications........... 36 5.4 Autoscaling effect from three to six VMs................... 37 vii List of Tables 4.1 System services running on Nodes...................... 24 4.2 Variety of meters that can be measured with Ceilometer.......... 25 4.3 Different instance flavors............................ 31 6.1 Impact on performance of the physical VM location............ 39 viii Chapter 1 Introduction We begin this chapter by taking a look at the motivation, which analyses the reasons for choosing this area of research and diving deeper into it. Problem statement defines the problem that we aim to solve with this thesis. We outline the major goals of this thesis, explain the importance of them in the context of maximised resource utilisation and resource provisioning and we discuss the methodology that we followed when im- plementing this work. Finally, we discuss the most of note contributions of this thesis and the structure of this document. 1.1 Motivation In recent years, it has been noticed a significant explosion of data stored worldwide, in- creasing continuously with an exponential rate. Individual companies and organizations often have petabytes or more of data including business information which is crucial to continued growth and success. However, the amount of data is often too large to store and process using traditional relational database systems, or the data is in unstructured forms inappropriate for structured schemas, or the hardware needed for the analysis of this huge dump of data is too costly. The need to process this avalanche of Big Data bear out Apache Hadoop [1], which is an open source software framework that pioneered new ways of storing and processing data. Instead of relying on expensive hardware, Hadoop can be installed on a cluster of commodity machines so that they can communicate and work together storing and processing in parallel huge amounts of data. More than that, it allows to scale out to hundreds or thousands of nodes as the data and processing demands grow, and can automatically recover from partial failure of servers. 1 Chapter 1. Introduction 2 Hadoop clusters were designed for stroring analyzing

Combining Analytics Framework and Cloud Schedulers in Order to Optimise Resource Utilisation in a Distributed Cloud

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support