Fachhochschule Wiesbaden Fachbereich Design Informatik Medien Studiengang Informatik

Master-Thesis zur Erlangung des akademischen Grades Master of Science – M.Sc.

Design and Evaluation of Self-Management Approaches for -Based Environments

vorgelegt von Dan Marinescu am 29. Februar 2008

Referent: Prof. Dr. Reinhold Kröger Korreferent: Prof. Dr. Steffen Reith II Erklärung gem. BBPO, Ziff. 6.4.2

Ich versichere, dass ich die Master-Thesis selbstständig verfasst und keine anderen als die angegebenen Hilfsmittel benutzt habe.

Wiesbaden, 29.02.2008 Dan Marinescu

Hiermit erkläre ich mein Einverständnis mit den im Folgenden aufgeführten Verbreitungs- formen dieser Master-Thesis: Verbreitungsform ja nein Einstellung der Arbeit in die √ Bibliothek der FHW Veröffentlichung des Titels der √ Arbeit im Internet Veröffentlichung der Arbeit im √ Internet

Wiesbaden, 29.02.2008 Dan Marinescu

III IV Contents

1 Introduction 1

2 Background 5 2.1 ...... 5 2.1.1 Taxonomy ...... 5 2.1.2 Case Study: ...... 10 2.1.3 Live Migration ...... 13 2.1.4 Hardware-Assisted Virtualization ...... 16 2.1.5 Management of Vitual Machine Environments ...... 17 2.2 Service Level Management ...... 20 2.3 Autonomic Computing ...... 21 2.3.1 Motivation ...... 21 2.3.2 Taxonomy ...... 22 2.3.3 Architectural Considerations ...... 23 2.3.4 Examples of Autonomic Computing Systems ...... 24 2.4 Complexity theory and Optimization ...... 25 2.4.1 Introduction to Complexity Theory ...... 25 2.4.2 Complexity Classes ...... 25 2.4.3 Optimization Problems ...... 26 2.4.4 The Knapsack Family of Problems ...... 27 2.4.5 Approximation Algorithms ...... 29 2.4.6 Heuristics ...... 32

V 3 Analysis 37 3.1 State of the Art ...... 37 3.1.1 Related Academic Work ...... 38 3.1.2 Limitations of the Previous Work ...... 42 3.1.3 Commercial Solutions ...... 43 3.2 Architectural considerations ...... 44 3.2.1 A Formal Model of the Managed Infrastructure ...... 44 3.2.2 Framework Requirements ...... 46 3.3 The Resource Allocation Problem ...... 47 3.3.1 Controller Requirements ...... 47 3.3.2 Approaches ...... 48 3.3.3 Heuristics ...... 49 3.3.4 Approximation Algorithms ...... 51 3.4 A Bottleneck Determination Mechanism ...... 52 3.4.1 The Expert Systems Approach ...... 53 3.4.2 The Time Series Analysis Approach ...... 55 3.5 The Overprovisioning Problem ...... 56

4 Design 59 4.1 Architecture of the Framework ...... 59 4.1.1 The basic concept ...... 59 4.1.2 The Physical Manager ...... 61 4.1.3 The VM Manager ...... 63 4.1.4 The Cluster Manager ...... 64 4.1.5 The Complete Architecture ...... 66 4.1.6 The Intra-Manager Communication ...... 66 4.1.7 The Inter-Manager Communication ...... 67 4.2 The Bottleneck Determination Mechanism ...... 69 4.3 The Controller ...... 70 4.3.1 Algorithm A ...... 71 4.3.2 Algorithm B ...... 73

VI 5 Implementation 79 5.1 Software Configuration of the Cluster ...... 79 5.2 The Framework ...... 80 5.2.1 The Self-Manager Backbone ...... 80 5.2.2 The Physical Manager ...... 83 5.2.3 The VM Manager ...... 93 5.2.4 The Cluster Manager ...... 98 5.2.5 The Inter-Manager Communication ...... 99 5.3 The Bottleneck Determination Mechanism ...... 106 5.4 The Cluster Controller ...... 111 5.4.1 Algorithm A ...... 112 5.4.2 Algorithm B ...... 113 5.5 Code Statistics ...... 118

6 Evaluation 119 6.1 The Testbed ...... 119 6.2 Evaluation of the Management Framework ...... 120 6.3 Evaluation of the Cluster Controllers ...... 122

7 Summary 125

8 Bibliography 129

VII VIII List of Figures

2.1 Hosted Architecture ...... 7 2.2 OS-Layer Virtualization ...... 8 2.3 Hardware-Layer Virtualization ...... 9 2.4 The structure of a machine running Xen [BDF+03] ...... 13 2.5 Xen Live Migration [CFH+05] ...... 15 2.6 An exemplary lifecycle of a hardware-assisted VMM [Int05] ...... 16 2.7 The Xen Management Architecture [Mel07] ...... 18 2.8 Autonomic Element [KC03] ...... 24 2.9 Complexity Classes ...... 26 2.10 Search Space Example ...... 34 2.11 The GA approach ...... 36

3.1 Example of local search trough the search space ...... 49 3.2 Crossover example ...... 51 3.3 Operation mode of an expert system ...... 54 3.4 Time series example ...... 56

4.1 The basic concept ...... 61 4.2 The Physical Manager ...... 62 4.3 The VM Manager ...... 63 4.4 The Cluster Manager ...... 64 4.5 The complete architecture of the framework ...... 67 4.6 Modular Self-Manager architecture ...... 68 4.7 Exemplary inter-manager and intra-manager communication ...... 69 4.8 The internal architecture of the VM Logic module ...... 70 4.9 The internal architecture of the Cluster Logic module ...... 70

IX 4.10 Exemplary limitation of Algorithm A ...... 73 4.11 Algorithm B visual example ...... 77

5.1 The complete architecture of the framework ...... 81 5.2 The architecture of the Self-Manager ...... 81 5.3 The Implementation of the Physical Manager ...... 84 5.4 Using the Xen API to access the Xen ...... 90 5.5 The Implementation of the VM Manager ...... 94 5.6 The Implementation of the Cluster Manager ...... 98 5.7 VMProblemSolver UML class diagram ...... 100 5.8 The VM Manager-to-Cluster Manager communication ...... 101 5.9 ClusterMonitorUpdater UML class diagram ...... 103 5.10 The Physical Manager-to-Cluster Manager communication ...... 104 5.11 ClusterRequestProcessor UML class diagram ...... 105 5.12 The Cluster Manager-to-Physical Manager communication ...... 106

6.1 The testbed ...... 120 6.2 Empirical observations ...... 121 6.3 Influence of available memory on SLO violations ...... 122 6.4 A scenario comprising two physical machines ...... 123 6.5 A scenario comprising three physical machines ...... 124

X List of Tables

4.1 Exemplary VM-to-PM encoding of two parent-child states ...... 74

5.1 Code statistics ...... 118

XI XII List of Listings

5.1 Configuration file example ...... 82 5.2 The Module interface ...... 82 5.3 The properties of the Physical Manager Monitor module ...... 85 5.4 Extract from the PMMonitor.java, showing how the module proper- ties are loaded ...... 85 5.5 Extract from the PMMonitor.java, showing how the to be monitored parameters are sent to the Xen module ...... 86 5.6 Extract from the PMMonitor.java, showing how the messages re- ceived from the Xen module are processed ...... 87 5.7 The properties of the Physical Manager Actuator module ...... 88 5.8 The setMemoryVM() from the PMActuator.java ...... 88 5.9 The properties of the Physical Manager Xen module ...... 89 5.10 Extract from the XenManager Python class ...... 91 5.11 Extract from the XenHelper Java class ...... 91 5.12 Extract from XenModule.java, showing how the received messages are processed ...... 92 5.13 The properties of the SL Monitor module ...... 94 5.14 The subscribeServiceParameters() method of the SL Monitor 95 5.15 Extract from VMActuatorModule.java, showing how a received message is processed ...... 97 5.16 Extract from ClusterMonitor.java, showing how a message is sent to the Cluster Logic ...... 99 5.17 The giveMemory() method ...... 102 5.18 Instantiating and registering a remote object ...... 103 5.19 Updating the gloabl resource view ...... 104 5.20 Delegating a task to a Physical Managers ...... 106 5.21 Parameters loaded by the VM Logic module ...... 107

XIII 5.22 Updating a TomcatParameter bean ...... 107 5.23 Calling Jess from inside the VM Logic module ...... 108 5.24 The Jess rule ...... 109 5.25 The configuration parameters for the Cluster Logic ...... 111 5.26 Algorithm A ...... 112 5.27 Extract from ClusterState.java ...... 114 5.28 The startKbfsAlgorithm() method ...... 116

XIV Chapter 1

Introduction

The first computing systems developed were large and expensive to operate. Due to an increasing demand for usage, these computers evolved into time-sharing systems (around 1965) which permitted multiple applications to run simultaneously. Yet this allowed a faulty application to crash the entire system. The need to increase the reliability of the system led to the appearance of the virtual machine monitor (VMM) in the late 1960s. Also known as hypervisor, a VMM represents a software indirection layer between the hardware on one side and the on the other side. Back then, the VMM was primary used to provide isolation between applications. Probably the best known system of that time using a VMM was VM/370, developed by IBM. The 1970s were flourishing for the virtualization technology, but then the 1980s and 1990s brought a drop in hardware prices, which meant that it was now affordable to host one application per computing system. While in the 1970s, computer architectures were virtualization-aware [PG74], the replacement of mainframes with minicomputers and PCs gave birth to new computer architectures that didn’t offer the necessary hardware support for VMMs [RG05]. As such, by the late 1980s, VMMs became history. In the 1990s, VMMs where brought back to light by researchers at Stanford University, in a research project that led to the birth of VMware Inc.. Nowadays, virtualization is mainly used to reduce costs by consolidating a bunch of low-utilized server boxes in a single-server system, with almost every big name in the IT industry being involved in a virtualization project. With VMware leading the market, vendors like SWsoft and XenSource1 all offer competitive alternatives. Also long-time established vendors like Sun2 and Oracle have recently joined the market with their own Xen-based virtualization

1recently acquired by Citrix 2note that Sun already provided a virtualization solution known as Solaris Zones or

1 Chapter 1. Introduction solution. And not only virtualization products are developed for the IA-32 architecture, but also the architecture itself is being extended by both Intel (VT) and AMD (SVM) to support virtualization. Besides reducing costs, it is the way how maintenance and system management are sim- plified that contributes to the fast adoption of virtualization in the data center. These are factors that have grown in importance over the past years, with system administrators avoiding products that fail to provide rich management capabilities. Yet it slowly became clear to the research community that the ever increasing complexity of computing systems will finally lead to systems that are impossible to manage by human beings [KC03]. Even nowadays, managing computing systems has become expensive and prone to errors. This indicates that a change in the paradigm used to design computing systems needs to take place. The solution to this problem is to develop systems which are capable to manage them- selves, called autonomic computing systems or self-* systems. This solution is inspired by biological systems existing in nature, particularly the human autonomic nervous sys- tem (ANS). The ANS is responsible for many low-level, yet vital body functions like heart rate or body temperature, without conscious control. Similarly, given high-level goals, computing systems should be able to manage themselves, without or with little human interaction. Virtualization significantly increases the overall complexity of computing systems. Ap- plications are now hosted inside virtual machines, which are themselves hosted on phys- ical machines. This increased complexity and the dynamic nature of virtualization-based data centers justify the need for the introduction of self-management approaches in these environments. Although it increases the overall complexity, virtualization also offers a great basis for self-management. Features like the allocation of resources (memory, CPU shares and others) at run-time and live migration of virtual machines are great control knobs that can be used by autonomic systems. Advanced monitoring and debugging is also possible due to the introduction of an indirection layer between the hardware and the operation system. What is missing is an intelligent controller, capable of taking decisions based on data obtained through monitoring, and then transforming these decisions into tasks executed by means of the previously mentioned control knobs. The goal of this thesis is to design and evaluate various approaches for such an intelligent controller. The remainder of this thesis is organized as follows:

2 Chapter 1. Introduction

In Chapter 2, background information with respect to virtualization, service level manage- ment and autonomic computing is provided, as well as a short introduction to complexity theory and optimization. This background information is necessary to understand the chapters that follow. Next, in Chapter 3, a state of the art analysis with respect to the autonomic management of virtualization-based environments is presented. Based on this analysis, a set of require- ments is defined for this thesis. Finally, an analysis of approaches that can be used to fulfil these requirements is presented at the end of this chapter. Following this analysis the design of a self-management framework is presented in Chap- ter 4. This framework is based on the principles of the autonomic computing paradigm and uses a service level management approach. Besides the management framework, the design of three “intelligent” components to be used in the framework is described. Both the framework and the “intelligent” components are based on the analysis performed in the previous chapter and fulfil the requirements defined there. In Chapter 5, the implementation of this design is presented. While the design was technology-independent, the implementation of this design is Java-based and uses Xen as a virtualization technology. The concept and the implementation are subsequently evaluated in Chapter 6. For this purpose, a testbed was set up in which the “intelligent” components and the framework as a whole were tested. The thesis is concluded in Chapter 7. Here, a summary is presented and indications are provided with respect to how the work of this thesis can be extended in the future.

3 Chapter 1. Introduction

4 Chapter 2

Background

2.1 Virtualization

This section provides background information regarding virtualization, starting with a taxonomy and followed by an in-depth analysis of Xen, an open-source virtualization solution.

2.1.1 Taxonomy

Smith and Nair proposed a taxonomy where a distinction is made between process-level and system-level virtualization [SN05]. While with process virtualization a virtual plat- form is created that executes a single process, system virtualization provides a complete environment that supports an operating system together with multiple user processes. For the purpose of this thesis, the focus will be on system-level virtualization. From now on, unless otherwise specified, by virtualization we understand system-level virtualization. In this context, virtualization is commonly defined as a technology that introduces a soft- ware abstraction layer between the hardware on one side, and the operating system and the applications running on top of it on the other side. This abstraction layer is called vir- tual machine monitor (VMM) or hypervisor and basically hides the physical resources of the computing system from the operating system (OS). Since the hardware resources are directly controlled by the VMM and not by the OS, it is possible to run multiple (possibly different) OSs in parallel on the same hardware. As a result, the hardware platform is partitioned into one or more logical units called virtual machines (VMs). The following requirements have been adapted from the original requirements for a VMM defined by Popek and Goldberg [PG74]:

5 2.1. Virtualization Chapter 2. Background

• Equivalence: Running an application inside a virtual machine must be equivalent to running the same application on the underlying hardware.

• Control: The VMM must control and synchronize the access of VMs to hardware resources.

• Isolation: VMs must be isolated from each other with the purpose of ensuring stability (the crash of a VM should not affect other VMs), security (a possibly compromised VM shouldn’t grant access to other VMs) and data consistency.

• Performance: The performance overhead caused by virtualization should be mini- mal, close to "bare metal" performance.

• Encapsulation: VMs must exist in a form (i.e. a file or a partition) which allows easy migration or cloning of the VM.

There are different approaches to virtualization, and there are different ways to classify them. From an architectural point of view however, the various virtualization approaches can be categorized into:

• Hosted Virtual Machine Architecture

• OS-Layer Virtualization

• Hardware-Layer Virtualization note that for the purpose of this thesis, the focus will be entirely on the Intel IA-32 (x86) architecture.

Hosted Architecture

In this approach, the VMM runs on top of a host operating system, commonly as an application in userspace. The result is that, in a VM, the applications and the guest OS run on top of a virtual hardware provided by the VMM. This architecture can be observed in Figure 2.1, and is commonly used in applications dedicated to the desktop market. VMware Workstation1, Parallels 2 and Virtual PC3 are commercial products which use this technique. The main advantage of this approach is that it is very

1http://www.vmware.com/products/ws/ 2http://www.parallels.com/ 3http://www.microsoft.com/windows/products/winfamily/virtualpc/default. mspx

6 Chapter 2. Background 2.1. Virtualization

Applications Applications Applications

Guest OS Guest OS Guest OS e.g. Windows e.g. e.g. Solaris VM Management Software Virtual Virtual Virtual Hardware Hardware Hardware

Virtual Machine Manager

Host OS

Hardware Layer

Figure 2.1: Hosted Architecture easy to use. A common user can install a software product like VMware Workstation just like any other software product on its operating system4. Inside VMware Workstation, a guest OS5 can be installed and used just like it would be running directly on hardware. The main disadvantage of this approach is the poor performance, which can be up to 30% less than when running directly on hardware. Since usability is more important than performance when it comes to the desktop market, it is easy to understand why products using this approach have been so successful in the desktop sector. However, when it comes to servers, performance plays a key role, with ease of use and installation being less relevant. This is why the server market requires another virtualization solution.

OS-Layer Virtualization

Also known as Single Kernel Image (SKI) or container-based virtualization, this concept implements virtualization by running several instances of the same OS in parallel. This means that not the hardware but the host OS is the one being virtualized. The resulting VMs all use the same virtualized OS image. This architecture is presented in Figure 2.2. Here, the virtualized OS image is called virtualization layer. Among products that use this approach are Virtuozzo6 and its open source variant

4currently VMware Workstation supports and several Linux distributions 5Microsoft Windows, Linux, Solaris x86, Netware and FreeBSD are supported 6http://www.swsoft.com/en/products/virtuozzo/

7 2.1. Virtualization Chapter 2. Background

Applications Applications Applications

Host OS (1) Host OS (2) Host OS (3)

Virtualization Layer

Host OS

Hardware Layer

Figure 2.2: OS-Layer Virtualization

OpenVZ7, Solaris Zones (Containers)8, BSD Jails9 and Linux VServer10. All these prod- ucts are commonly used in Web hosting, high performance computing (HPC) clusters and grid computing. This thin architecture eases the administration of the system, but has a big drawback: since the VMs use the same kernel as the host OS, the guest OS must be the same as the host OS. As such, it is not possible to run e.g. Windows on top of Linux.

Hardware-Layer Virtualization

This approach is commonly used on the server market due to its high virtual machine isolation and performance. Here, the VMM runs directly on hardware, controlling and synchronizing the access of the guest OSs to the hardware resources. Figure 2.3 depicts this architecture. VMWare ESX Server11 and Xen12 are the main two competing VMMs that use this ap- proach. Since the x86 architecture was not developed with virtualization in mind, and thus it is not what Popek and Goldberg would refer to as a virtualizable architecture [PG74], new techniques were developed to implement CPU virtualization. Paravirtualization is the

7http://openvz.org/ 8http://www.sun.com/bigadmin/content/zones/ 9http://www.freebsd.org/doc/en_US.ISO8859-1/books/handbook/jails.html 10http://linux-vserver.org/Welcome_to_Linux-VServer.org 11http://www.vmware.com/products/vi/esx/ 12http://xensource.com/

8 Chapter 2. Background 2.1. Virtualization

Applications Applications Applications

e.g. Guest OS e.g. Guest OS e.g. Guest OS Linux Windows Solaris

Virtual Machine Manager (Hypervisor)

Hardware Layer

Figure 2.3: Hardware-Layer Virtualization technique used by Xen, which provides a virtual machine interface representing a slightly modified copy of the underlying hardware, meaning that the nonvirtualizable portions of the x86 original instruction set are replaced with their easily virtualized equivalents. The result is that either the guest operating systems running on top of this interface must be ported to use the slightly modified instruction set [RG05], or, more recently, the hardware architecture must support virtualization (Intel-VT, AMD-SVM). On the other hand, the applications running on top of the guest OS don’t require any modifications. Although the cost of porting an OS to Xen is relatively low [BDF+03], this is still a big disadvantage of the paravirtualization approach. The approach used by VMWare ESX avoids this prob- lem by performing on-the-fly binary translation of privileged code (kernel code), which replaces the nonvirtualizable instructions with a code that can be run directly on the CPU [AA06]. A caching system is then used to increase the performance.

Process-level Virtualization Techniques

The previously mentioned classification covers system-layer virtualization. While these techniques are used in both server and desktop virtualization, there are two process-level virtualization techniques that are widely used in software development and should be mentioned: emulation and high-level-language virtualization.

The emulation of one or more hardware devices, be it a printer, a CPU or even a complete computing system represents the complete imitation of that device or set of devices. This technique is also referred to as interpretation since it involves an interpreter program that fetches, decodes and emulates the executed guest instructions. While this clearly offers

9 2.1. Virtualization Chapter 2. Background a poor performance compared with other virtualization techniques [SN05], it is the only one that makes it possible to write and run code written for one architecture (e.g. ARM) on top of a different architecture (e.g. Intel IA-32). Examples of emulation software are Bochs13 and PearPC14, but these are not the only ones. High-level-language (HLL) virtualization has been used to provide cross-platform porta- bility. As such, a HLL VM doesn’t emulate any specific hardware architecture but spec- ifies its own virtual instruction set architecture (ISA). Souce code written in such a lan- guage is compiled against the virtual ISA, resulting in binary code that can run on top of the HLL VM. This code is portable since it can run on all host platforms that imple- ment the HLL VM. Examples of such VMs are the Java Virtual Machine (JVM)15 and the Microsoft Common Language Runtime (CLR), which is the backbone of the .NET framework16.

2.1.2 Case Study: Xen

Xen started as an open-source project at the University of Cambridge (UK) and while its code base is still open-source, it has grown up into a commercial product marketed by XenSource. At the end of October 2007, Citrix acquired XenSource. Besides Citrix, ve- dors like , Sun, Oracle and more recently Microsoft also provide a Xen-based virtualization solution. This subsection provides a description of how Xen approaches virtualization.

The Virtual Machine Interface

As previously mentioned, Xen uses a technique called paravirtualization. The differ- ence between paravirtualization and full system virtualization is that, while full system virtualization provides an identical copy of the underlying hardware, paravirtualization provides a slightly different copy of the hardware. This modified copy of the hardware is referred to as the virtual machine interface (VMI). Since the VMI is different from the hardware interface for which common operating systems like Linux or Windows were designed, running such an OS on top of the VMI requires applying some modifications to the respective OS. A modified guest OS communicates with the hypervisor via so-called

13http://bochs.sourceforge.net/ 14http://pearpc.sourceforge.net/ 15http://java.sun.com/ 16http://msdn.microsoft.com/netframework/

10 Chapter 2. Background 2.1. Virtualization hypercalls. In an operating system, a system call is a software trap from the application to the kernel. Similarly, a hypercall is software trap from the guest OS to the hypervisor.

The VMI consists of three main subsystems: memory management, CPU and device I/O. The following paragraphs present how Xen virtualizes these subsystems, as described in [BDF+03]. However, the focus will be mostly on how Xen virtualizes memory and CPU.

Virtualizing Memory is considered to be the most difficult part of paravirtualizing the x86 architecture [BDF+03] since it requires both the development of complex mecha- nisms in the hypervisor as well modifying the guest OS. The approach used by Xen is to provide the guest OS with direct read-only access to the hardware page tables, while page table update requests are passed to Xen via a hypercall. It is then the job of the hy- pervisor to validate these updates to ensure safety. Since entering the hypervisor through hypercalls is an obvious performance overhead compared to direct updates, a guest OS may queue updates before applying an entire batch with a single hypercall. To avoid a TLB flush when entering and leaving the hypervisor, the top 64MB of each address space are reserved to Xen and cannot be accessed by any guest OS. However, this is no problem for the applications running on the guest OS, since none of the common x86 application binary interfaces (ABIs) uses this address region.

Just like VMware ESX Server, a baloon driver [Wal02] was implemented for Xen guests, with the help of which the memory allocation of a domain can by dynamically adjusted. This is done by passing memory pages back and forth between the hypervisor and the page allocator of the guest OS. A threshold can be specified over which the memory capacity of a domain can no longer be extended.

Virtualizing CPU implies inserting a hypervisor below the OS and thus violates the common assumption that the OS is the most privileged part of the system. To guarantee domain isolation, the guest OS must run at a lower privilege level. Since the x86 architec- ture provides four distinct privilege levels (referred to as ring 0-3), the code of a Xen guest OS, originally running in ring 0, must be ported to run in ring 1, while Xen becomes the most privileged entity of the system by running in ring 0. This is possible because none of the modern OSes use rings 1 and 2, with applications executing in ring 3. Because the ported guest OS runs in ring 1, privileged instructions cannot run directly on the CPU and need to be paravirtualized and executed by Xen. Since more domains can run concur- rently on the same CPU, they need to be scheduled. Three scheduling mechanisms have been developed for Xen: Borrowed Virtual Time (BVT) [DC99], Simple Earliest Dead-

11 2.1. Virtualization Chapter 2. Background line First (SEDF) and the Credit Scheduler. According to the Xen wiki17, the first two have proven several limitations and have been replaced by the Credit Scheduler, which is currently the default scheduler in Xen. The Credit Scheduler is a proportional fair share scheduler. Each virtual machine, called domain in Xen terminology, is assigned a weight and a cap. The weight of a domain reflects the fact that, for example, when a host is contended, a domain with a weight of 4 will get twice as much CPU time as a domain with a weight of 2 (legal weights range from 1 to 65535). The cap of a domain determines the maximum amount of CPU the domain is allowed to consume, independently of CPU utilization. This parameter is optional and means that even if the CPU has idle cycles, a domain with a cap value of say 50 is allowed to consume maximum 50% of the CPU time. On an symmetric multiprocessing (SMP) host, a cap value of say 200% means that the respective domain can fully use maximum two CPUs. Credit Scheduler uses the notion of virtual CPU (VCPU) to reflect the cpu of a virtual machine. If not specifically restricted, a VCPU can run on any physical CPU. A physical CPU can host several VCPUs, which find themselves in a running queue managed by the CPU. An accounting thread recomputes every 30 ms the amount of credits a VCPU has earned. When a VCPU runs, its credits are consumed. VCPUs in the queue of a CPU are arranged on behalf of their priority. A VCPU can have two priorities: over or under. A priority of over signalizes that the VCPU has a negative amount of credits, while a priority of under signalizes a positive amount of credits. On a SMP host, VCPUs are load balanced across physical CPUs. When no VCPUs with a priority of under are located in the running queue of a CPU, the respective CPU will look for VCPUs with a priority of under in the running queues of the other physical CPUs of the SMP host. This guarantees a fair scheduling across all the domains. When no such VCPU is found, the next to run will be a VCPU with a priority of over from the running queue. When the running queue of a physical CPU is empty, the queue of the other physical CPUs will be looked for VCPUs that request CPU time.

Virtualizing Device I/O is done in Xen with the help of the priviledged control do- main, called Domain 0. This domain contains the original, unmodified device drivers, which access the hardware through the hypervisor. The other unprivileged domains use paravirtualized drivers. This is done by modifying the kernel to use virtual drivers that communicate with the original drivers in the Domain 0. For example, when an unprivi- leged domain wants to write something on its hard disk, the modified kernel will recognize that this is a write operation on a block device, and uses its virtual block device driver. 17http://wiki.xensource.com/xenwiki/

12 Chapter 2. Background 2.1. Virtualization

Control read-only VBD may be created, or a VIF may filter IP packets to User User User Plane prevent source-address spoofing). Software Software Software Software This control interface, together with profiling statistics on the current state of the system, is exported to a suite of application- level management software running in Domain0. This complement GuestOS GuestOS GuestOS GuestOS (XenoLinux) (XenoLinux) (XenoBSD) (XenoXP) of administrative tools allows convenient management of the entire server: current tools can create and destroy domains, set network Xeno-Aware Xeno-Aware Xeno-Aware Xeno-Aware Device Drivers Device Drivers Device Drivers Device Drivers filters and routing rules, monitor per-domain network activity at packet and flow granularity, and create and delete virtual network Domain0 virtual virtual virtual virtual X control x86 CPU phy mem network blockdev E interfaces and virtual block devices. We anticipate the development interface N of higher-level tools to further automate the application of admin- istrative policy. H/W (SMP x86, phy mem, enet, SCSI/IDE) 3. DETAILED DESIGN FigFigureure 1: 2.4:The Thestr structureucture of ofa ama machinechine runningrunning Xenthe [XBDFen +hy03pe] r- In this section we introduce the design of the major subsystems visor, hosting a number of different guest operating systems, that make up a Xen-based server. In each case we present both including Domain0 running control software in a XenoLinux Xen and guest OS functionality for clarity of exposition. The cur- This driver is nothing more than a communication interface to the original driver located environment. rent discussion of guest OSes focuses on XenoLinux as this is the in the Domain 0. The write operation is then performed by the driver in Domain 0, withmost mature; nonetheless our ongoing porting of Windows XP and the hypervisor being responsible for determining the hardware address of the hard disk.NetBSD gives us confidence that Xen is guest OS agnostic. this process was automated with scripts. In contrast, Linux needed far fewer modifications to its generic memory system as it uses pre- 3.1 Control Transfer: Hypercalls and Events Control andpro Managementcessor macros to access PTEs — the macro definitions provide Two mechanisms exist for control interactions between Xen and a convenient place to add the translation and hypervisor calls re- an overlying domain: synchronous calls from a domain to Xen may When Xenq wasuired developed,by paravirtu onealiz ofatio then. goals was to separate mechanism from policy.b Ase made using a hypercall, while notifications are delivered to do- In both OSes, the architecture-specific sections are effectively mains from Xen using an asynchronous event mechanism. a result, thea p hypervisorort of the itselfx86 providesode to o onlyur pa therav basicirtualiz controled arc operations,hitecture. exportingThis themThe hypercall interface allows domains to perform a synchronous by meansin ofvo alv controled rewritin interfaceg routin toe thes wh authorizedich used p domain,rivilegedDomaininstructio 0n.s, Thisand domainso isftware trap into the hypervisor to perform a privileged operation, created atre bootmov timeing a andlarg ise theamo placeunt of wherelow-le thevel applicationsystem initia managementlization cod softwaree. a cannalogous to the use of system calls in conventional operating sys- be hosted.A Thisgain, architecturemore chang ises presentedwere requ inire Figured in W2.4ind.o Throughws XP, m theain controlly due interface,tems. An example use of a hypercall is to request a set of page- to the presence of legacy 16-bit emulation code and the need for table updates, in which Xen validates and applies a list of updates, Domain0 isa so ablemew toha createt diffe andrent terminateboot-loadin otherg me domains,chanism. adjustNote th schedulingat the x86 parameters,- returning control to the calling domain when this is completed. memory allocationspecific c andode createbase in virtualXP is networksubstan interfacestially larg ander th blockan in devices.Linux and Communication from Xen to a domain is provided through an hence a larger porting effort should be expected. asynchronous event mechanism, which replaces the usual delivery mechanisms for device interrupts and allows lightweight notifica- 2.3 Control and Management 2.1.3 Live Migration tion of important events such as domain-termination requests. Akin Throughout the design and implementation of Xen, a goal has to traditional Unix signals, there are only a small number of events, been to separate policy from mechanism wherever possible. Al- each acting to flag a particular type of occurrence. For instance, Migratingth theou workloadgh the hy fromperv oneisor physicalmust be computerinvolved toin anotherdata-pa canth a besp veryects useful.(for Systemevents are used to indicate that new data has been received over the administratorsexam canple, thensche easilyduling performthe CPU loadbetw balancing,een doma faultins, filte managementring netwo andrk hardwarenetwork, or that a virtual disk request has completed. maintenance.pack Inets [HJ04before],tra thensm authorsission point, or en outforc threeing ac approachescess contro tol w workloadhen read- migration.Pending events are stored in a per-domain bitmask which is up- ing data blocks), there is no need for it to be involved in, or even dated by Xen before invoking an event-callback handler specified Process migrationaware of,washig inher thelev 1980’sel issue as popularsuch as techniquehow the CPU to achieveis to b workloade shared, migrationby the guest OS. The callback handler is responsible for resetting [pro00]. However,or which becausekinds of ofpac someket ea limitations,ch domain m processay tran migrationssmit. hasn’t been widelythe set of pending events, and responding to the notifications in an used. The mostThe prominentresulting limitationarchitectu isre thatis o processne in wh migrationich the h suffersyperviso fromr itse dependencieslf appropriate manner. A domain may explicitly defer event handling provides only basic control operations. These are exported through by setting a Xen-readable software flag: this is analogous to dis- an interface accessible from authorized domains; potentially com- abling interrupts on a real processor. plex policy decisions, such as adm13ission control, are best performed by management software running over a guest OS rather than in 3.2 Data Transfer: I/O Rings privileged hypervisor code. The presence of a hypervisor means there is an additional pro- The overall system structure is illustrated in Figure 1. Note that tection domain between guest OSes and I/O devices, so it is crucial a domain is created at boot time which is permitted to use the con- that a data transfer mechanism be provided that allows data to move trol interface. This initial domain, termed Domain0, is responsible vertically through the system with as little overhead as possible. for hosting the application-level management software. The con- Two main factors have shaped the design of our I/O-transfer trol interface provides the ability to create and terminate other do- mechanism: resource management and event notification. For re- mains and to control their associated scheduling parameters, phys- source accountability, we attempt to minimize the work required to ical memory allocations and the access they are given to the ma- demultiplex data to a specific domain when an interrupt is received chine’s physical disks and network devices. from a device — the overhead of managing buffers is carried out In addition to processor and memory resources, the control inter- later where computation may be accounted to the appropriate do- face supports the creation and deletion of virtual network interfaces main. Similarly, memory committed to device I/O is provided by (VIFs) and block devices (VBDs). These virtual I/O devices have the relevant domains wherever possible to prevent the crosstalk in- associated access-control information which determines which do- herent in shared buffer pools; I/O buffers are protected during data mains can access them, and with what restrictions (for example, a transfer by pinning the underlying page frames within Xen. 2.1. Virtualization Chapter 2. Background on the source host, which must remain online to handle certain system calls or memory ac- cesses from the migrated process. These dependencies are called residual dependencies. The second approach is object mobility. This only works in complex frameworks like CORBA, J2EE and .NET, yet porting all legacy software to these frameworks is not real- istic since it would be a very time intensive job. With the emergence of x86 virtualization, a third approach became possible: the VM migration. This can be either implemented in the hypervisor or in the OS itself, in the latter case being called self-migration [HJ04]. The following describe the Xen approach to VM migration, as presented in [CFH+05].

The Xen live migration mechanism was designed to enable the migration of a VM be- tween two physical hosts located in the same LAN. The performance of this approach depends on two criteria:

• the downtime, representing the amount of time for which the VM is offline (sus- pended)

• the total migration time, representing the total amount of time needed for migration

Since it is not possible to minimize both downtime and total migration time, a trade-off must be found between the two. There are three phases that can be part of the migration process:

• the push phase, where pages are copied across the network from the source VM to the destination VM while the source machine keeps running

• the stop-and-copy phase, where the source VM is suspended while pages are copied to the destination VM

• the pull phase, where the destination VM starts and if it accesses a page that has not been copied, it pulls the page from the source VM over the network

OS migration techniques can be differentiated depending on which of these phases are part of the memory migration process. For example, in pure stop-and-copy, the VM is halted, all pages are copied to the destination VM which is then started. While easy to implement, this approach obviously doesn’t fit a scenario where a live service (like for example a streaming server) is hosted inside the VM. With pure demand-migration, this problem is solved by copying only the essential kernel data structure to the destination during the stop-and-copy phase and then entering the pull phase. While the downtime is

14 Chapter 2. Background 2.1. Virtualization

VM running normally on Stage 0: Pre-Migration For network resources, we want a migrated OS to maintain Host A Active VM on Host A all open network connections without relying on forward- Alternate physical host may be preselected for migration Block devices mirrored and free resources maintained ing mechanisms on the original host (which may be shut down following migration), or on support from mobility Stage 1: Reservation or redirection mechanisms that are not already present (as Initialize a container on the target host in [6]). A migrating VM will include all protocol state (e.g. Overhead due to copying Stage 2: Iterative Pre-copy Enable shadow paging TCP PCBs), and will carry its IP address with it. Copy dirty pages in successive rounds.

To address these requirements we observed that in a clus- Downtime Stage 3: Stop and copy (VM Out of Service) Suspend VM on host A ter environment, the network interfaces of the source and Generate ARP to redirect traffic to Host B destination machines typically exist on a single switched Synchronize all remaining VM state to Host B

LAN. Our solution for managing migration with respect to Stage 4: Commitment network in this environment is to generate an unsolicited VM state on Host A is released ARP reply from the migrated host, advertising that the IP VM running normally on Host B Stage 5: Activation has moved to a new location. This will reconfigure peers VM starts on Host B Connects to local devices to send packets to the new physical address, and while a Resumes normal operation very small number of in-flight packets may be lost, the mi- grated domain will be able to continue using open connec- Figure 2.5:Figu Xenre 1: LiveMigra Migrationtion time [linCFHe +05] tions with almost no observable interference.

Some routers are configured not to accept broadcast ARP to system failure than when it is running on the original sin- replies (in order to prevent IP spoofing), so an reducedunsolicite thisd way,gle theho totalst. To migrationachieve th timeis, w ise veryview longthe andmigra thetio performancen process as after migration ARP may not work in all scenarios. If the operating system can turn out to bea tra unacceptable.nsactional interaction between the two hosts involved: is aware of the migration, it can opt to send directed replies only to interfaces listed in its own ARP cache, to remove Stage 0: Pre-Migration We begin with an active VM on the need for a broadcast. Alternatively, on a swTheitche approachd net- used byphy Xensical ish calledost A. pre-copyTo speed migrationany future andmigra combinestion, a ta anr- iterative push work, the migrating OS can keep its original Ethephasernet M withAC a stop-and-copyget host phase.may be Thepre functioningselected wh ofere thisthe approachresources isre shown- in Figure address, relying on the network switch to detect its move to quired to receive migration will be guaranteed. a new port1. 2.5 and can be split into six stages. First, a destination host must be selected (Stage 0) and a VM-containerS musttage 1 be: R initializedeservatio onn A thatreq hostuest (Stageis issue 1).d to Then,migra thete a pushn OS phase is called In the cluster, the migration of storage may be similarly ad- from host A to host B. We initially confirm that the interactively, copying at first all pages and then only the dirty pages from the source VM dressed: Most modern data centers consolidate their stor- necessary resources are available on B and reserve a age requirements using a network-attached storatog thee (N destinationAS) VMV (StageM conta 2).in Theer of numberthat siz ofe. F roundsailure isto limitedsecure andreso basedurces on a statistical device, in preference to using local disks inanalysis.individua Inl Stage 3,h theere sourcemeans VMthat th ise suspendedVM simply andco CPUntinue states to ru andn o then A remaining dirty servers. NAS has many advantages in this enviropagesnment, arein- copied to theunaf destinationfected. VM. To ensure protection against hardware failures, cluding simple centralised administration, widespread ven- Xen uses a transactionalStage 2: modelIterati forve P there migration-Copy Du process.ring the Thus,first ite inra Stagetion, a 4,ll the destination dor support, and reliance on fewer spindles leading to a pages are transferred from A to B. Subsequent itera- reduced failure rate. A further advantage for mphysicaligration machineis commits that it has received a consistent OS image. The source VM is tions copy only those pages dirtied during the previous that it obviates the need to migrate disk storage, adiscardeds the NAS and then, in Stage 5, the VM on the destination host resumes normal operation. transfer phase. is uniformly accessible from all host machines in the clus- ter. We do not address the problem of migrating local-disk Stage 3: Stop-and-Copy We suspend the running OS in- storage in this paper, although we suggest somThee po twossib otherle resourcesstan besidesce at A memoryand redire thatct needits ne totw beork migratedtraffic to areB network. As and storage. strategies as part of our discussion of future workA. migrated VM offeringdescrib a liveed e servicearlier, C needsPU sta tote maintainand any re thema sameining IPinc addresson- that it had before migration. Thissiste isn donet mem byory broadcastingpages are th anen unsolicitedtransferred. ARPAt th replye end to advertise the new physical locationof (thethis MACstage addressthere is ofa thecon destinationsistent susp host).ended Xencopy doesn’tof handle the 3.3 Design Overview the VM at both A and B. The copy at A is still con- migration of local storage.sidered Instead,to be prim network-attachedary and is resum storageed in ca hasse o tof fa beilu used.re. The logical steps that we execute when migrating an OS are Stage 4: Commitment Host B indicates to A that it has summarized in Figure 1. We take a conservative approach successfully received a15consistent OS image. Host A to the management of migration with regard to safety and acknowledges this message as commitment of the mi- failure handling. Although the consequences of hardware gration transaction: host A may now discard the orig- failures can be severe, our basic principle is that safe mi- inal VM, and host B becomes the primary host. gration should at no time leave a virtual OS more exposed The migrated VM on B is now ac- 1 Stage 5: Activation Note that on most Ethernet controllers, hardware MAC filtering will tivated. Post-migration code runs to reattach device have to be disabled if multiple addresses are in use (though some cards support filtering of multiple addresses in hardware) and so this technique drivers to the new machine and advertise moved IP is only practical for switched networks. addresses. 2.1. Virtualization Chapter 2. Background

2.1.4 Hardware-Assisted Virtualization

As already mentioned in Chapter 1, both Intel (VT) and AMD (SVM) extended the IA-32 architecture to support virtualization. Although the extensions from Intel and AMD are not compatible with each other, they are very similar [AMD05, Int05]. What follows is an overview of the Intel VT-x architecture.

With VT-x, Intel introduced a new form of processor operation mode named VMX. This form of operation can be entered by executing the VMXON instruction and left by execut- ing the VMXOFF instruction. There are two VMX execution modes: VMX root and VMX non-root. Both of these modes support all four (0-3) IA-32 privilege levels, which means that legacy OSs can run without modifications with respect to the privilege for which they were designed. The VMX root mode is similar to the IA-32 without VT-x. It is the VMM that is supposed to run in this mode. The VMX non-root mode is where the guest OS should run. This is because software running in this mode is restricted in certain ways, in INTRODUCTION AND VMX OVERVIEW INTRODUCTION AND VMX OVERVIEW spite the privilege level it is running in.

A VMM is able to retain selective control of processor resources, physical memory, interrupt management, and I/O. • Guest software: Each virtual machine is a guest software environment that supports a stack consisting of operating system (OS) and application software. Each operates Guest 0 Guest 1 independently of other virtual machines and uses on the same interface to processor(s), memory, storage, graphics, and I/O provided by a physical platform. The software stack acts as if it were running on a platform with no VMM. Software executing in a virtual machine must operate with reduced privilege so that the VMM can retain control of platform resources. VM Exit VM Entry VM Exit

1.3 INTRODUCTION TO VMX OPERATION VMXON VM Monitor VMXOFF Processor support for virtualization is provided by a new form of processor operation called VMX operation. There are two kinds of VMX operation: VMX root operation and VMX non- root operation. In general, a VMM will run in VMX root operation and guest software will run Figure 1-1. Interaction of a Virtual-Machine Monitor and Guests in VMX non-root operation. Transitions between VMX root operation and VMX non-root oper- ation are called VMX transitions. There are two kinds of VMX transitions. Transitions into Figure 2.6: An exemplary lifecycle of a hardware-assisted VMM [Int05] VMX non-root operation are called VM entries. Transitions from VMX non-root operation to VMRESUME; it regains control using VM exits. VM exits transfer control to an entry VMX root operation are called VM exits. point specified by the VMM. The VMM can take action appropriate to the cause of the Processor behavior in VMX root operation is very much as it is outside VMX operation. The VM exit and can then return to the virtual machine via a VM entry. principal differences are that a set of new instructions (the VMX instructions) is available and The two VMX• E executionventually, the modesVMM ma cany deci bede to entered shut itselfby dow meansn and leav ofe V twoMX o VMXperation. transitions:It does so that the values that can be loaded into certain control registers are limited (see Section 1.8). by executing the VMXOFF instruction. Processor behavior in VMX non-root operation is restricted and modified to facilitate virtual- ization. Instead of their ordinary operation, certain instructions (including the new VMCALL instruction) and events cause VM exits to the VMM. Because these VM exits replace ordinary • VM Entry1.5 : fromVIRTU VMXAL-M rootACH toINE VMX CONT non-rootROL STR (VMMUCTUR toE guest OS) behavior, the functionality of software in VMX non-root operation is limited. It is this limitation that allows the VMM to retain control of processor resources. VMX non-root operation and VMX transitions are controlled by a data structure called a virtual- • VM Exitmac:h fromine cont VMXrol struct non-rooture (VMCS). to VMX root (guest OS to VMM) There is no software-visible bit whose setting indicates whether a logical processor is in VMX non-root operation. This fact may allow a VMM to prevent guest software from determining that Access to the VMCS is managed through a component of processor state called the VMCS it is running in a virtual machine pointer (one per logical processor). The value of the VMCS pointer is the 64-bit address of the VMCS. The VMCS pointer can be read and written using the instructions VMPTRST and Because VMX operation places these restrictions even on software running with current privi- The exemplaryVMP lifecycleTRLD. The ofVM aMhardware-assisted configures a VMCS using VMM other ins istru presentedctions: VMRE inAD Figure, VMWRI2.6TE, . A virtual lege level (CPL) 0, guest software can run at the privilege level for which it was originally and VMCLEAR. designed. This capability may simplify the development of a VMM. machine control structure (VMCS) is used to manage VMX transitions and VMX non- A VMM could use a different VMCS for each virtual machine that it supports. For a virtual root operations.machin Thise withdata multip structurele logical pro isces dividedsors (virtua inl pr severalocessors), t sections.he VMM cou Oneld use ofa di them,fferent the guest VMCS for each virtual processor. 1.4 LIFE CYCLE OF VMM SOFTWARE state section,Ch isapte usedr 2 desc toribe saves the st andructur loade of a V guestMCS. C statehapter (the3, Cha virtualpter 4, and CPU Chapte state)r 5 provi atde VM exits Figure 1-1 illustrates the life cycle of a VMM and its guest software by illustrating the interac- details on how the VMCS controls VMX non-root operation, VM entries, and VM exits. tions between them. Chapter 7 provides detailed descriptions for each of the new VMX instructions. • Software enters VMX operation through execution of the VMXON instruction. 16 • The VMM can then enter its guests into virtual machines (one at a time) using VM entries. The VMM effects a VM entry using the VMX instructions VMLAUNCH and

1-2 1-3 Chapter 2. Background 2.1. Virtualization respectively VM entries. Similarly, the host state section is used to load host state at VM exits. In the VMX non-root mode, many instructions and events can cause VM exits. While many instructions exit unconditionally, other instructions, interrupts and events can be configured to do so. This is achieved by using the VM-execution control fields in the VMCS, which provides support for VMMs to implement flexible virtualization strategies by specifying which instructions, interrupts and events should cause VM exits [UNR+05]. To optimize the performance of a VMM, the frequency of VM exists must be reduced [AA06].

2.1.5 Management of Vitual Machine Environments

With virtualization now making its place in the data centers, the management of virtu- alized environments is becoming an important issue. However, managing virtualization- based environments nowadays generally means that one has to use simple, GUI-based tools. An integration with traditional platforms like IBM Tivoli18 and HP Open View 19 is still under development. In this context, the management capabilities of Xen and VMWare ESX are described. This involves both management interfaces provided by the two VMMs, as well as existent management solutions that use these management inter- faces.

2.1.5.1 Management Interfaces

Xen has recently finished its 1.0 version of the Xen-API20 and included it in the Xen 3.1 release. This management API is aiming at integrating systems running Xen with enterprise management solutions. Figure 2.7 shows the Xen management architecture. On top of the Xen hypervisor runs Xend, the Xen daemon, which controls and supervises the execution of the unprivileged domains. One of the requirements of the Xen-API was to allow remote access and control. The result was the Xen-RPC layer which consists of a set of RPCs that use a wire format based on XML-RPC and runs on top of Xend. Based on Xen-RPC, bindings for various languages have been developed or are in the process of being developed. Bindings for Python and C (libvirt) are already available. Although the Xen-API already reached version 1.0, the integration with enterprise man- agement solutions still requires the development of a CIM integration interface. The Xen-

18http://www.ibm.com/tivoli 19http://www.managementsoftware.hp.com/ 20http://wiki.xensource.com/xenwiki/XenApi

17 2.1. Virtualization Chapter 2. Background !"#$%&#&'"("#)$*+,-.)",)/+"

:.+)6(&#&'"+ >"#.#D1 !"#69C% >( ?C#'&+2 F&:&6=&0"2 5.=:.+) %":E'@

*A&,-" 5.=>"#$?9$=.#2.#'0@ AB>"# 789HH!%G !%G6789

!"#6789$1:"+$3;;8$1+$3;;8<

!"#2

310) !"#0)1+" !"# 4).5.)."0 ! ! Figure 2.7: The Xen Management Architecture [Mel07]

CIM project21 works on an open source implementation of a DMTF22 CIM provider for Xen which uses the Xen-API. The DMTF System Virtualization, Partitioning, and Clus- tering Working Group (SVPC WG) is a standardization body which is currently working on defining models for virtualization and resource allocation which are not directly bound to any product. VMware has already developed several management APIs. For managing a VMware infrastructure, two APIs are available: the VMware infrastructure SDK23 provides devel- opers of third-party tools a Web Service interface to the Virtual Center, which controls the ESX Server. The other infrastructure API is the CIM SDK24 which allows managing the ESX Server via a CIM interface. Besides these APIs, a guest SDK25 can be used to monitor guests running inside an ESX Server. The management capabilities of ESX Server clearly overpower the ones of Xen, at least for the time being.

2.1.5.2 Management Software

In the following paragraphs, several management software products used for the manage- ment of virtual machine-based environments are described. Like before, the focus is only on management tools for Xen and VMware ESX. Note that only the products offering a larger amount of features have been selected. Enomalism26 is an open source, web-based management tool for Xen virtual machines. 21http://wiki.xensource.com/xenwiki/XenCim 22http://www.dmtf.org/ 23http://www.vmware.com/support/developer/vc-sdk/ 24http://www.vmware.com/support/developer/cim-sdk/ 25http://www.vmware.com/support/developer/guest-sdk/ 26http://www.enomalism.com/

18 Chapter 2. Background 2.1. Virtualization

The feature list of Enomalism includes:

• Monitoring interface

• Resource management (like for example dynamic memory allocation)

• Live migration

• Disk management tools

• SSH client and VNC virtual desktop

• Virtual Appliance Management (package management)

• An open API for integration with third party products

Virt-manager27 is the open source virtual machine manager developed by Red Hat. Al- though originally designated to manage Xen systems, Virt-manager can also be used to manage Qemu28 and KVM29 guests. This is possible because Virt-manager was build on top of libvirt30, a generic virtual machine management API. The list of features is quite similar with the ones of Enomalism, with few notable exceptions like the lack of disk management tools or support for live migration. Virtual Iron31 is a commercial virtualization product that is based on the Xen hypervisor. The Virtual Iron Virtualization Manager includes a web-based GUI for the management of a virtual environment. The main management feature of Virtual Iron is its policy-driven resource and workload management capability. This allows the automatic management of resources and makes the difference between the Virtual Iron Virtualization Manager and products like Enomalism and Red Hats Virt-Manager. On the other hand, the fact that it is not open-source and its free version is stripped of some features makes this product less attractive as it would otherwise be. Novell32 has been intensifying its work lately in the field of virtualization. As such, its release of the SUSE Linux Enterprise Server 10 SP1 includes a broad support for managing and provisioning Xen virtual machines. Also its management product, Nov- ell ZENworks33, has been extended with virtualization management capabilities. The

27http://virt-manager.et.redhat.com/ 28http://fabrice.bellard.free.fr/qemu/ 29http://kvm.qumranet.com 30http://libvirt.org/ 31http://www.virtualiron.com/products/virtualization.cfm 32http://www.novell.com/linux/virtualization/ 33http://www.novell.com/products/zenworks/

19 2.2. Service Level Management Chapter 2. Background

Novell ZENworks Orchestrator addresses the automatic life-cycle management of virtual machines. Xen Enterprise34 is a commercial product offered by Citrix. It is build on top of Xen and adds features like easy installation, increased performance and a management console. The management capabilities of XenEnterprise are mainly focused on monitoring, life- cycle management and resource management of CPU, memory, disk and network I/O for QoS. VMware’s Virtual Center35 was designed to manage a VMware virtual infrastructure. Its feature list includes support for provisioning, monitoring, automation through task scheduling and alerting, automated optimization and live migration. Like before, VMware seems to be one step in front of its competitors.

2.2 Service Level Management

The existence of an IT department inside a company has been common ever since infor- mation technology started to play a role in everyday business. Nowadays, IT departments are present in every big, mid-sized and even in some small companies. The job of an IT department is to provide IT services inside the company, and in some cases also to the external customers. In the last years however, a trend of outsourcing IT services can be observed. This is because companies started depending on a high variety of IT services. Having an IT department that provides such a variety of services has proven to be very costly. To reduce these costs, companies prefer to outsource certain services to specialized third-parties. For example, companies have become dependent of their Web presence. In many cases, having downtimes translates in loosing money. A certain degree of reliability is required, and this is known to be costly. In such case, many companies choose to outsource their Web services to Web hosting providers, in a move aimed at reducing costs. When a company outsources IT services to a specialized third-party, the company needs to have a certain guarantee that these services are working well. However, the company is not interested on how many resources (e.g. memory, cpu, network bandwidth) have been assigned to a certain service, or how the service has been implemented, but on how well a service performs. As such, a so-called service level management approach is required.

34http://xensource.com/products/xen_enterprise/index.html 35http://www.vmware.com/products/vi/vc/

20 Chapter 2. Background 2.3. Autonomic Computing

In the following, the core terms of service level management are defined, based on the taxonomy described in [Deb04].

• The term Quality of Service (QoS) stands for the quality with which a service is provided, and is described by a set of QoS parameters.

• A QoS parameter is a quantifiable and measurable indicator used to evaluate the quality of a service. Various QoS parameters exist. Of relevance for the purpose of this thesis are response time and throughput.

• Response time is a QoS parameter and represents the period of time between the moment in which a user makes a call to a service and the moment in which this call ends, i.e. the user receives an answer corresponding to this call.

• Throughput is a QoS parameter and represents the amount of calls successfully performed in a time unit.

• A Service Level Objective (SLO) defines a range of values for one or more QoS parameters, subject to an agreement.

• A Service Level Agreement (SLA) represents a contract between a service provider and a service consumer. An SLA defines services to be provided, a set of QoS parameters for these services and the respective SLOs.

2.3 Autonomic Computing

2.3.1 Motivation

In the last decades, advances in computer hardware and software have led to an ever increasing complexity of computing systems. Nowadays, this complexity seems to be reaching the limits of human capability [KC03]. Managing computing systems is be- coming costly and prone to errors. In the future, this complexity will go beyond the capabilities of human experts. In [GC03], Ganek et al. backs this assumption with facts: according to Nick Tabellion, the CTO of Fujitsu Softek, for every dollar spent for storage, nine further dollar are spent to manage it. Furthermore, over 40% of computer system outages are caused by operator errors due to the complexity of the system. Knowing this, it is clear that a change in the paradigm used to design computing systems needs to take place.

21 2.3. Autonomic Computing Chapter 2. Background

While computing systems are just now reaching this high complexity, in the nature biolog- ical systems of high complexity exist for millions of years. Researchers looked for inspi- ration in these biological systems and found the human autonomic nervous system (ANS). The ANS is responsible for many low-level, yet vital functions, such as the heart rate or the body temperature. These functions are managed without conscious control. Similarly, the autonomic computing paradigm [Hor01] states that computing systems should be able to manage themselves, based on a set of high-level goals defined by the system adminis- trator. Thus, tasks like installation, configuration, optimization and maintenance should be performed by computing system without human interaction.

2.3.2 Taxonomy

In many cases, when a new field of research gets the attention of the research community, a lack of unity can be observed with respect to terms and definitions. What IBM calls au- tonomic computing is more or less what Intel calls proactive computing [Ten00], whereas by organic computing [MSMW04] the German research community addresses the same problems but focuses more on hardware-related aspects. For the purpose of this thesis, the IBM taxonomy will be used as a backbone and extended with complementary terms and definitions when necessary.

Self-management denotes the process by which computing systems manage themselves. This is considered to be the essence of autonomic computing. There are various aspects of self-management. In [KC03], Kephart and Chess define four such aspects:

• Self-configuration represents the property of a computing system to configure itself automatically, based on a high-level policy. This way, the job of the system ad- ministrator is to provide this high level policy, while the system is responsible for the installation and configuration processes. New components added to the system must also be incorporated by the system in a seamless way.

• Self-optimization represents the property of a computing system to continously seek to improve its operation. This is achieved by dynamically adjusting its parameters to better adapt the changes taking place in the system.

• Self-healing represents the property of a computing system to detect, diagnose and repair local problems. These problems can be caused by both software and hard- ware components.

22 Chapter 2. Background 2.3. Autonomic Computing

• Self-protecting represent the property of a computing system to defend itself in the face of a malicious attack or cascading failures. Besides protecting in face of problems that allready took place, the system should also be capable of foreseeing possible problems based on logs and reports.

There are other self-management properties besides these four, which can be found in var- ious publications. Terms like self-monitoring, self-inspection, self-testing, self-repairing and self-organization have all been used in academic research, and are referred to as self- * or self-X properties. A term-frequency statistic for each of these self-X properties is provided in [FZ05]. One frequently appearing term is self-oranization. Self-organization is a highly interdisciplinary term. Definitions for self-organization have been provided by biologists, physicists, chemists, philosophers and recently computer scientists. The latter define self-organization as the capability of a computing system to adapt itself to a dynamic environment without outside control. The presence of a cen- tral coordination component in such systems is generally excluded. Self-organization is considered to be a special propriety of a self-managing system [HMG04].

2.3.3 Architectural Considerations

Implementing autonomic computing imposes new architectural considerations. Auto- nomic systems are seen as interactive collections of autonomic elements. The autonomic element represents the basic component of the system. Such an element is able to man- age both its internal behaviour, as well as its relationship with other autonomic elements. Together, they collaborate to meet the goals set by a higher-ranked autonomic element or by a human system administrator. From a structural point of view, an autonomic element is composed of one or more man- aged elements and only one autonomic manager. The autonomic manager controls the managed elements. This structure is shown in Figure 2.8. The autonomic manager moni- tors the managed element, analyzes the obtained information and, based on its knowledge base, constructs and executes plans that modify the behavior of the managed element. The roadmap towards autonomic computing sees both the development of new managed ele- ments and their corresponding autonomic managers, as well as developing and attaching autonomic managers to the already existing manageable elements. Every autonomic system has a specific behaviour. This behaviour can be seen as a collec- tion of functions, each implemented by a specific autonomic element. However, to facil- itate the system-level behaviour, additional autonomic elements are required [?]. These

23 of this information, the autonomic manager will relieve humans of the responsibility of directly man- aging the managed element. Fully autonomic computing is likely to evolve as designers gradually add increasingly sophisticated autonomic managers to existing managed elements. Ultimately, the distinction between the autonomic manager and the managed element may become merely conceptual rather than architectural, or it 2.3. Autonomic Computing Chapter 2. Backgroundmay melt away—leaving fully integrated, auto- nomic elements with well-defined behaviors and interfaces, but also with few constraints on their Autonomic manager internal structure. Each autonomic element will be responsible for Analyze Plan managing its own internal state and behavior and for managing its interactions with an environment that consists largely of signals and messages from Monitor Knowledge Execute other elements and the external world. An element’s internal behavior and its relationships with other elements will be driven by goals that its designer Managed element has embedded in it, by other elements that have authority over it, or by subcontracts to peer ele- ments with its tacit or explicit consent. The element Figure 2.8: Autonomic Element [KC03] may require assistance from other elements to achieve its goals. If so, it will be responsible for obtaining necessary resources from other elements infrastructure elements are calledFigure 2. Structure of an autonomicand supportelement. theElements operation interact of with the other autonomic system.and for dealing with exception cases, such as the Exampleselements of such and infrastructure with human programmers elements are: via their autonomic managers. failure of a required resource. Autonomic elements will function at many levels, interactions among autonomic elements as it will from individual computing components such as • registry: used by autonomicfrom components the internal s toelf find-man eachagem otherent of the individual disk drives to small-scale computing systems such autonomic elements—just as the social intelligence as workstations or servers to entire automated • sentinel: provides monitoringof an a servicesnt colony to a autonomicrises largely elements from the interactions enterprises in the largest autonomic system of all— among individual ants. A distributed, service-ori- the global economy. • aggregator: combines theented functions infrastructure of several will autonomic support elementsautonomic in oneele- serviceAt the lower levels, an autonomic element’s range ments and their interactions. of internal behaviors and relationships with other • broker: facilitates the interactionAs Figu betweenre 2 sho autonomicws, an auto elementsnomic element will elements, and the set of elements with which it can typically consist of one or more managed elements interact, may be relatively limited and hard-coded. coupled with a single autonomic manager that con- Particularly at the level of individual components, 2.3.4 Examples of Autonomictrols and r Computingepresents them Systems. The managed element well-established techniques—many of which fall will essentially be equivalent to what is found in under the rubric of fault tolerance—have led to the Although a relatively new fieldo ofrdi research,nary non autonomicautonomic computingsystems, al approachesthough it ca aren alreadydevelopment of elements that rarely fail, which is be adapted to enable the autonomic manager to one important aspect of being autonomic. Decades being implemented. The SMARTmonit DB2or and project contro startedl it. The byma IBMnaged aims eleme tont make could theo DB2f developing fault-tolerance techniques have pro- database management systemb autonomice a hardwa [rLL02e reso].urc Thee, su projectch as sto focusesrage, a C mostlyPU, or ond prop-uced such engineering feats as the IBM zSeries erties like self-optimization anda pr self-configuration.inter, or a software Thereso OceanStoreurce, such as project a data at- Berke-servers, which have a mean time to failure of sev- base, a directory service, or a large legacy system. eral decades. ley works on a high-available data storage, dealing with aspects like self-healing, self- At the highest level, the managed element could At the higher levels, fixed behaviors, connections, optimization, self-configurationbe and an e self-protection-utility, an app [KBClicati+on00 s].er Alsovice, ator IBM,even thean Oceanoand relationships will give way to increased project [IBM] targets the managementindividual ofbu computingsiness. The resourcesautonomic for ma softwarenager dis farms.- dyn Inamism and flexibility. All these aspects of auto- tinguishes the autonomic element from its nonau- nomic elements will be expressed in more high- the Distributed Systems Lab of the University of Applied Sciences Wiesbaden, a Self- tonomic counterpart. By monitoring the managed level, goal-oriented terms, leaving the elements Manager framework was developedelemen [Sch07t and ,itSG06s exte].rnal environment, and con- themselves with the responsibility for resolving the structing and executing plans based on an analysis details on the fly. Many other examples exist. This shows that when it comes to designing computing sys- tems,44 autonomic computingComputer is acknowledged as a viable and, in the future, necessary paradigm. In the future, we expect to see more systems which dispose of such autonomic computing capabilities.

24 Chapter 2. Background 2.4. Complexity theory and Optimization

2.4 Complexity theory and Optimization

2.4.1 Introduction to Complexity Theory

Complexity theory is a branch of theoretical computer science which analyzes the dif- ficulty of providing efficient algorithms for computational problems and the amount of resources required by the execution of an algorithm (e.g. processing time, working mem- ory and others). The introduction to complexity theory provided in this section is mainly based on [ACK+00, Vaz01, KPP04]. Much of the complexity theory deals with deci- sion problems. A decision problem is formally defined by the characteristic function f : I → S, where I represents the set of problem instances and S = {0, 1}.

Informally, a decision problem can be described as a question with a “yes” or “no” answer. Thus, based on the input of a problem, one can distinguish between problems that yield a “yes” and problems that yield a “no” answer. This distinction is being made because generally the proof for one of the answers is easier to verify than the other. For example, consider the problem of determining if a natural number is prime. This is a decision problem where a proof for a “no” answer can be easier verified than a proof for a “yes” answer. A decision problem X where the “yes” and “no” answers of a decision problem Y are reversed is called the complement of Y.

2.4.2 Complexity Classes

Computational problems can be categorized into different classes, depending on their complexity. Thus, a complexity class represents a set of problems with a similar com- plexity. Decision problems are commonly categorized into two main classes: P and NP .

• P is the class of all decision problems for which a deterministic machine exists that solves the problem in O(nk), for k ∈ N (polynomial time).

• NP is the class of all decision problems for which a non-deterministic machine exists that solves the problem in O(nk), for k ∈ N.

As it can be observed from the definitions above, P is a subset of NP . Although not demonstrated it is commonly believed that P 6= NP and so P ⊂ NP . The most difficult problems in NP are the ones that are most likely not in P and are called NP -complete. The class of NP -hard problems represents those problems that are at least as hard as the

25 2.4. Complexity theory and Optimization Chapter 2. Background problems in NP , and to which any problem in NP can be reduced in polynomial time. As shown in Figure 2.9, the NP -hard set is not a subset of NP , but NP -complete problems represents the subset of NP -hard problems that is in NP .

NP-hard

NP- complete

NP

P

Figure 2.9: Complexity Classes

2.4.3 Optimization Problems

In complexity theory, optimization problems have always been one of the most studied subjects. This is because of the practical relevance of optimization problems in many ar- eas. Example of such areas are networking, stock cutting, artificial intelligence, resource scheduling and many others. An optimization problem P can be defined as a quadrupel P = {I, Sol, m, goal}, where:

• I is the set of instances of P;

• Sol is a function that associates to any x ∈ I the set of feasible solutions to x;

• m represents the measure function that associates a positive integer to each pair (x, y) with x ∈ I and y = Sol(x);

• goal ∈ {MIN, MAX} and specifies whether P is a minimization or a maximiza- tion problem.

Informally, an optimization problem means finding the best solution(s) from the set of all valid solution. For each optimization problem, a corresponding decision problem exists.

26 Chapter 2. Background 2.4. Complexity theory and Optimization

If the optimization problem is a minimization problem, the decision problem asks for the existence of a pair (x, y), with x ∈ I and y = Sol(x) and m(x, y) ≤ K, where K is a positive integer. Analogously, a maximization problem asks for the existence of a pair (x, y), with x ∈ I and y = Sol(x) and m(x, y) ≥ K, where K is a positive integer. If the decision problem corresponding to the optimization problem is in NP, than the optimization problem is an NP optimization problem or in NPO.

2.4.4 The Knapsack Family of Problems

One of the most studied optimization problems are the knapsack problems. The original knapsack problem (KP) can be informally formulated as follows: consider a hiker that plans to go on a tour. He has a rucksack (knapsack) with a limited capacity c and a set of items available, each having a weight wi and a benefit pi that it brings to the hiker. The hiker is looking to find a subset of these items, whose total weight does not exceed the capacity of the rucksack and whose total benefit is maximized. Formally, the 0-1 knapsack problem is formulated as the following maximization prob- lem:

n X maximize pixi i=1 n X (2.1) subject to wixi ≤ c i=1

xi ∈ {0, 1}, i = 1, . . . , n

The binary variable (hence 0-1 in the name of the problem) xi equals 1 if the item i should be included in the knapsack, and 0 otherwise. Close variants of the 0-1 knapsack problem are the bounded knapsack problem (BKP) and the unbounden knapsack problem (UKP). The BKP defines a bound on the amount of identical items that can be packed in the knapsack, while the UKP defines no such limit. Another variant of the KP with high practical relevance is the subset sum problem (SSP).

The SSP is a particular case of the KP, where pi = wi, and has applications in logistics, cutting and packing. In many situations, items to be packed in a knapsack have more than one dimension. This leads to the d-dimensional knapsack problem (d-KP), or the multidimensional knapsack problem. This problem is formally defined by:

27 2.4. Complexity theory and Optimization Chapter 2. Background

n X maximize pixi i=1 n X (2.2) subject to wjixi ≤ cj, j = 1, . . . , d i=1

xi ∈ {0, 1}, i = 1, . . . , n

In other situations, one has more knapsacks available, in which one-dimensional items must be packed. This problem is called the multiple knapsack problem (MKP) and is formally defined by:

m n X X maximize pixji j=1 i=1 n X subject to wixji ≤ cj, j = 1, . . . , m i=1 (2.3) m X subject to xji ≤ 1, i = 1, . . . , n j=1

xji ∈ {0, 1}, j = 1, . . . , m, i = 1, . . . , n

where m is the number of knapsacks, n the number of items and xji = 1 if the item i is assigned to knapsack j. Otherwise, xji = 0.

A special case of the MKP of high relevance is when all the knapsacks have the same capacity. This problem is called the bin packing problem (BP). Another special case of the MKP is when a cardinality constraint is added to the list of constraints. Informally, this means that no more than k-items can be packed in a knapsack. The resulting variant is called the k-item multiple knapsack (kMP), or in the special case of identical knapsacks the k-item bin packing (kBP).

When the generalizations of the d-KP and MKP are applied in the same time, a multidi- mensional multiple knapsack problem (MMKP) results. This problem has many applica- tions in the field of cargo packing and load scheduling. There are many other variants of the knapsack problem: the multiple subset sum problem, the quadratic knapsack problem, the multiple choice knapsack problem and others (see [KPP04]). They are however of less relevance for the purpose of this thesis.

28 Chapter 2. Background 2.4. Complexity theory and Optimization

An important property of the knapsack problems is their complexity. All the problems from the knapsack family are known to be NP -hard [Pis95]. Even the simple KP is known to be NP -complete.

2.4.5 Approximation Algorithms

Considering that P 6= NP , there is no algorithm that finds an optimal solution to an NP-hard problem in polynomial time [Kan92, ACK+00]. In practice however, it is not always necessary to find an optimal solution to a problem. Instead, in most of the cases, it is enough to find an approximate solution in short time. An algorithm that finds such an approximation solution to a problem in polynomial time is called polynomial time approximation algorithm. Although the paradigm of finding approximate solutions for computing problems has been used ever since the beginning of computer science, only in the late 60’s the research community introduced the notion of guaranteed performance ra- tio. The performance ratio is the common quality measure for approximation algorithms and represents the ratio between the value of the solution returned by the approximation algorithm and the value of the optimal solution. For years, the solutions yielded by ap- proximation algorithms had a poor quality of approximation, many having a relative error of about 50% [ACK+00]. Over the past years, results have improved: the approximation quality has increased and problems which previously resisted to any approximation strat- egy have now been “cracked”. In the following, the main techniques used for the design of approximation algorithms are briefly introduced.

2.4.5.1 Greedy

The greedy method is frequently used in the design of algorithms for optimization prob- lems. The main idea of the greedy method is that, at each step, from a multitude of choices, the choice is made which seems best at the moment. A greedy algorithm com- prises of the following:

• an initial set of items

• a ranking function

• a validation function

• an objective function

29 2.4. Complexity theory and Optimization Chapter 2. Background

The initial set of items represents the input of the problem. The goal is to determine a sub- set of these items that satisfies the constraints, while maximizing the objective function. First, the ranking function is used to sort the items according to some criterion. At each step, the item with the highest ranking is chosen. The validation function is used to de- termine if this item can contribute to the solution. This cycle is repeated until a complete solution is found. The quality of the solution depends on the ranking function. Although for each problem a ranking function can be determined that leads to an optimal solution, for an NP-hard problem, we do not expect to be able to compute such a ranking function in polynomial time. However, in many cases, simple ranking functions can be found that guarantee good approximate solutions. For example, for the knapsack problem, a rank- ing function that orders the items on behalf of their profit/weight ratio leads to a greedy algorithm with a worst case performance ratio of 2.

2.4.5.2 Linear Programming

Linear programming is a technique used to optimize (minimize or maximize) a linear objective function subject to a finite number of equalities and inequalities. Many practical problems can be formulated as linear programming problems. A problem that can be solved by linear programming has the following form:

maximize F (x1, . . . , xp) = c1x1 + ... + cpxp (2.4)

subject to ai1x1 + ... + aipxp ≤ bi for i = 1, . . . , m1 (2.5)

ai1x1 + ... + aipxp ≥ bi for i = m1 + 1, . . . , m2 (2.6)

ai1x1 + ... + aipxp = bi for i = m2 + 1, . . . , m (2.7)

where xj ≥ 0 for j = 1, . . . , p (2.8)

n A vector x = (x1, . . . , xp) ∈ R is called solution of the problem if it satisfies the con- strains (2.5)-(2.7). When x also satisfies (2.8), then x is a valid solution. A valid solution ∗ ∗ ∗ ∗ x = (x1, . . . , xp) is called optimal solution when for each valid x, F (x ) ≥ F (x) (for minimization problems F (x∗) ≤ F (x), respectively).

The first algorithm that solves a linear problem was created in 1947 by George Dantzig and is named the Simplex algorithm. This algorithm has proven its efficiency in prac- tice. However, in worst case, the Simplex algorithm has an exponential running time. In 1979, the Ellipsoid algorithm was introduced by Leonid Khachiyan as the first worst case

30 Chapter 2. Background 2.4. Complexity theory and Optimization polynomial time algorithm for linear programming. Although faster in worst case, this al- gorithm is outperformed by the Simplex algorithm for most practical problems. In 1984, Narendra Karmarkar introduced an algorithm that is nowadays known as Karmarkar’s al- gorithm. This algorithm improves the worst case performance of the Ellipsoid algorithm, and yields a similar practical performance as the Simplex algorithm. For an in-depth description of these algorithms, [Kar91] is a good reference.

For an NP -hard optimization problem P, knowing that any linear program can be solved in polynomial time means that we can’t find a linear programming formulation of P so that for each instance p of P, the number of constraints is polynomial in the size of p [ACK+00], unless P = NP . However, most optimization problems can be formulated as integer linear programming problems. By relaxing the integer constraint, we obtain linear programming problems that can be solved in polynomial time. After obtaining an optimal solution to the linear program, we can derivate feasible solutions to the original integer linear program. Obviously, these solutions are only approximate solutions and can be obtained by rounding the values of the variables that violate the integrality constraint.

2.4.5.3 Dynamic Programming

Dynamic programming is a technique used to solve problems which can be divided in subproblems, and for which an optimal solution can be calculated from the optimal solu- tions of the subproblems. This is known as Bellman’s principle of optimality and gains its name from Richard Bellman, the mathematician that first introduced dynamic program- ming. Unlike the divide-and-conquer strategy, dynamic programming can be used when the subproblems overlap. This technique generally comprises three steps:

1. break the problem in subproblems

2. solve the subproblems recursively

3. calculate the solution of the problem from the solutions of the subproblems

Dynamic programming can be used to solve NP -hard optimization problems in pseudo- polynomial time. A polynomial time algorithm has a running time which is polynomial with respect to the length of the input, which is commonly approximately the log of the numerical value of the input. A pseudo-polynomial time algorithm has a running time which is polynomial with respect to the numerical value of the input, and not the length of the input. Thus, since the numerical value of the input is exponential in the length of

31 2.4. Complexity theory and Optimization Chapter 2. Background the input, such algorithms are technically exponential functions and are not considered polynomial. For example, take the most “naive” way of testing if a number n is prime: √ check if n divides evenly to any of the numbers between n and n. This algorithm is pseudo-polynomial, since it requires n − 2 divisions, so it is linear with respect to the numerical value of n. Obviously, this algorithm is not efficient for big values of n. According to [Vaz01], all pseudo-polynomial time algorithms known for NP -hard problems are based on dynamic programming.

2.4.5.4 Randomized Algorithms

A randomized algorithm is an algorithm that makes some of its choices in a non- deterministic, random way. This means that using the same input, the algorithm might yield different outputs. For optimization problems, this means that running a randomized algorithm several times on the same instance of a problem might yield different feasible solutions. Randomized algorithms offer a probabilistic performance guarantee with re- spect to the execution time or the quality of the solution. This means that there is a small probability that the algorithm takes long to execute, respectively yield a poor solution. Randomized algorithms that offer a probabilistic performance guarantee with respect to their execution time are called Las Vegas algorithms, while the ones offering a proba- bilistic performance guarantee with respect to the quality of the solution are called Monte Carlo algorithms. Both type of algorithms are known to be easy to implement and very efficient.

2.4.6 Heuristics

In the previous subsection we have seen how approximation algorithms can be used to find solutions to optimization problems that have a certain performance guarantee. For some problems however, finding such algorithms is hard, while for other problems it has been demostrated that, unless P = NP , no approximation algorithm with good performance ratio exist. In other cases, the approximation algorithms found are hard to implement or have little practical relevance. In such cases, using heuristics is the only way to find solu- tions to an optimization problem. Heuristic methods generally have the property of being simple, easy to implement and fast in execution. As with approximation algorithms, the aim is not to obtain an optimal solution but one that is reasonable good. However, when a heuristic is used, no guarantee can be made about the quality of the solution yielded by such a heuristic-based algorithm. Also, no guarantee can be made about the execution

32 Chapter 2. Background 2.4. Complexity theory and Optimization time of the algorithm. Both are major disadvantages compared with approximation algo- rithms. In the following, several heuristic methods based on local search are described.

2.4.6.1 Local Search, Hill Climbing and Simulated Annealing

One way to solve optimization problem is to use local search. Algorithms that use local search start from an initial solution and then move to a neighbour solution in the search space. This step is repeated until a local optimum is reached. The term local optimum denotes the fact that a global optimum might exist, that is a better solution than the current, but one that was not yet visited. Obviously, visiting each solution in the search space is a bad strategy with respect to the execution time. Thus, to speed up the execution of the local search, heuristics are required. Instead of visiting all the neighbours of a solution, a heuristic is used to first rank the neighbours and then move to the best one. Of course, even when heuristics are used, we do not expect to reach a global optimum in polynomial time (unless P = NP ).

Many heuristics, like for example hill climbing, only permit moving to a better solution than the current one. When this is no longer possible, which coincides with finding a local optimum, the algorithm returns this final state (solution) and terminates. However, there are many situations when this approach can lead to bad solutions. Take for example the search space in Figure 2.10. Here, both a local maximum (the smaller hill) and a global maximum (the bigger hill) exist. A heuristic like hill climbing can easily get stuck on the smaller hill — the local maximum, and thus return what could be a suboptimal solution. A way to prevent such behaviour is to permit moving to a solution than is not necessary better than the current one.

Simulated annealing is a heuristic method that permits moving from one neighbour so- lution to another, even if this neighbour solution is worse that the current one. It was inspired by the process of annealing a solid which is used in metallurgy to increase the size and reduce the defects of a certain material. The annealing of a solid first involves melting the solid by heating it at high temperatures. Than, due to the high energy of the system, the atoms become unstuck from the initial state and move randomly in a search for states with lower energy. After the solid was brought in the liquid phase, the cooling down begins. During this process, the probability of atoms moving around decreases. To avoid reaching a local minimal configuration, which corresponds to defective crystals, the cooling process must take place very slowly. This increases the chances of finding a better state.

33 2.4. Complexity theory and Optimization Chapter 2. Background

f(x,y)

2.5

2

1.5

1

0.5

-30 -2 4 -1 2 0 1 0 2 x 3 -2 y 4 -4 5

Figure 2.10: Search Space Example

In computer science, simulated annealing works as follows: from the current state, a neighbour state is randomly chosen. A chosen state that is better than the current state is always accepted. Otherwise, a state is accepted with a probability p, with p < 1. With the time, the value of p slowly decreases, meaning that the probability of moving into a worse state than the current one decreases. An algorithm that uses simulated annealing always remembers the best state that was visited. When p reaches the value 0, the algorithm only permits moving to a neighbour state if this state is better than the current one. When this is no longer possible, the algorithm returns the best state that was visited and terminates. While this state might be a local maximum, it is known that when p decreases slowly enough, a global maximum is reached with a probability close to 1 [RN04].

2.4.6.2 Tabu Search

Tabu search is a local search technique that, just like simulated annealing, permits the move from one solution to a worse neighbour solution with the hope of obtaining a bet- ter local optimum in the end. However, unlike simulated annealing, tabu search uses a deterministic approach when choosing the next move.

Commonly, a memory structure called tabu list is used to keep moves that are considered tabu, that is moves that are considered to lead to a bad local optimum and thus not allowed. When tabu search is used, a move to a neighbour solution means choosing the best non- tabu move. Since moving to a worse solution is allowed with tabu search, the best solution visited so far needs to be remembered. A common use for a tabu list is to save the moves

34 Chapter 2. Background 2.4. Complexity theory and Optimization that have already been performed, thus preventing the search to return to solutions that have already been visited. Tabu lists and other “tabu” memory structures are at the core of tabu search. However, using tabus can sometime lead to prohibiting attractive moves that could finally lead to an improved solution. To avoid this, so-called aspiration criteria are used, which permit the execution of a tabu move only if this leads to a solution better than the current best solution.

2.4.6.3 Genetic Algorithms

Genetic Algorithms (GAs) are a search technique which is used to solve optimization problems. The idea behind this technique is to simulate the biologic evolution in order to solve optimization problems. This means that technical systems are optimized based on the principles of biological evolution. Thanks to the advances in modern molecular genetic, we now know how properties of individuals are inherited by the following generations. This inherited material is carried by so-called chromosomes. Chromosomes basically represent a collection of genes. The gene is the place where the value corresponding to each property is retained. For example, a gene can hold the value “blond” of the “hair colour” property. All the genes and their values are recalled to as the genotype of an organism. The observable behaviour of this organism is referred to as the organisms phenotype. In other words, one can know about the genotype by observing the DNA of the organism, while the phenotype can be known by observing the outward appearance of the organism. When two organisms reproduce, a cross between the genetic material of the two parents takes place. This is referred to as crossover and implies recombining the chromosomes of the two parents. In some cases, a mutation can take place, meaning that the value of a resulting gene is not the result of the chromosome recombination, but rather the result of the influence of an external factor. However, the probability of a mutation taking place is rather very low. While the crossover operator is the actual local search operator, a mutation can be seen as a “jump” to a state located “far away” in the search space, relative to the current state. According to the theory of evolution, the individuals of a population are subject to natural selection. The core of the natural selection is the principle by which the fittest survives. This way, the evolution can be seen as an elimination process, where the individuals that are better adapted to the surrounding environment survive. The new generation, ob- tained as a result of the reproduction of the surviving individuals, will share the genetic

35 2.4. Complexity theory and Optimization Chapter 2. Background properties of their ancestors and will generally be even better adapted to the surrounding environment, due to the crossover and eventual mutation processes. Genetic algorithm use these principles of biological START evolution to solve optimization problems. The genetic algorithm approach comprises of two phases, as shown Problem in Figure 2.11. PHASE I

Encoding The first phase is called the initialization phase. In this phase, an appropriate representation form is cho- Initialization sen, with respect to the problem that needs to be solved.

Note that there is no general representation that suits all Not good enough Evaluation wrt. Fittness problems, but rather each problem requires its own rep- Parent Selection resentation that takes the nature of the problem into con- sideration. After such a representation form is found, Crossover the initial population must be generated, ending thereby Mutation Good enough the initialization phase, which is never to be repeated in Selection of a the problem solving process. New Generation

PHASE II The second phase is called the evolution cycle. As the Output of the Best Individual name suggests, this phase will be repeated until a ter- STOP mination condition will be reached. This termination condition generally means finding one or more accept- Figure 2.11: The GA approach able solutions to the problem, or reaching a maximum amount of iterations. The evolution cycle starts with an evaluation of all the individuals of the current generation. After that, the parent individuals are selected, based on their fitness. The parent individuals are then reproduced by means of crossover. Every now and then, depending on a previously determined probability, a mutation can take place. The last phase of the evolution cycle represents the selection of the individuals belonging to the new generation.

36 Chapter 3

Analysis

The way virtualization-based environments are nowadays managed is via GUI or command-line tools like the ones described in Section 2.1. While virtualization is gen- erally used to simplify maintenance and to consolidate servers, it doesn’t help reducing the complexity of the infrastructure. In fact, a model where a service is hosted inside a virtual machine, which in its turn is hosted along other virtual machines inside a physical machine is more complex than the previously used model where a physical machine hosts a single service. We have previously argued that precisely this complexity is the main driving factor for the development of self-managing systems. With virtualization making its place in the data center [Inf], a solution for automatically managing such VM-based environments is obviously needed. In this chapter, an analysis of self-management ap- proaches for virtualization-based environments is presented. The rest of this chapter is organized as follows: first, Section 3.1 presents the related academic work as well as re- lated commercial products. Upon observing the limitations of this work, a formal model of the infrastructure and a set of requirements for the management framework are de- fined in Section 3.2. Designing a framework that fulfills these requirements leads to three problems, which are analyzed in Sections 3.4, 3.3 and 3.5.

3.1 State of the Art

There are no commercial solutions that can manage VM-based environments in a fully autonomic way. The few existent solutions that address it use simple approaches to sim- plify the management task of the system administrator. This section presents a survey of preliminary academic work related to the self-management of VM-based environments, followed by a brief description of currently available commercial solutions.

37 3.1. State of the Art Chapter 3. Analysis

3.1.1 Related Academic Work

Various publications exist that deal with the autonomic management of virtualization- based environments. Unfortunately, the approaches described in these publications prove to be inconsistent with respect to the taxonomy used to classify each approach. Therefore, we propose a basic taxonomy which can be easily extended in the future. Furthermore, in the frame of this taxonomy, existent academic research will be described. The taxonomy used here is partially based on the work of Kephart and Walsh [pol04]. There, the authors define three types of policies: action, goal and utility functions. How- ever, the term policy is used in many publications with a more restricted meaning, denot- ing what Kephart and Walsh refer to as an action policy. In order to avoid confusion, we will use the term policy only when it represents an action policy. Thus, the currently pub- lished self-management approaches which are subject of this section can be categorized as being one of the following:

• Policy-based

• Goal-driven

• Utility-based

Policy-based

At the core of this policy-based approach lies a policy that dictates which actions are to be performed, given a specific state. A mechanism is used to match the current state of the system to one of the states given in the policy, and when this match is successful, the action corresponding to that state is performed. Policy-based approaches have been a popular way to manage computing systems [LLS02, LMKB01]. This is mainly because policy-based approaches are easy to imple- ment and perform reasonably well. Virtualization-based computing systems are no excep- tion. In [RRX+06], Ruth et al. developed a system called VIOLIN, based on a so-called nanoHUB1 infrastructure, which uses the Xen hypervisor for virtualization. The system is composed of a multitude of physical machines and one adaption manager. Each physical machine of the system hosts one or more virtual machines and one monitor daemon. The monitor daemon monitors the resource allocation (memory and CPU) of both the physical machine and the virtual machines running on that physical machine. Besides monitoring,

1http://www.nanohub.org

38 Chapter 3. Analysis 3.1. State of the Art the monitor daemon is also able to allocate resources to local virtual machines, thus serv- ing both as a monitoring and an actuator component. Results from all these monitoring daemons are gathered by the adaption manager, which then computes a global view of available and allocated resources. Furthermore, the adaption manager dictates virtual ma- chine resource allocation through the monitoring daemons. Two mechanisms are used: a local mechanism and live-migration. The local mechanism uses memory ballooning and weighted CPU scheduling to reallocate local resources belonging to the physical machine hosting the VMs. When this is not possible, live migration is used to transfer a virtual machine requiring more resources on another physical machine. The resource realloca- tion policy used by the adaption manager doesn’t attempt to find the optimal resource utilization. Instead, a heuristic is used to improve the overall performance of the system, by finding over- and under-utilized VMs. Results obtained by the authors show an in- crease of performance and resource utilization. The overhead caused by migration is also acceptable. Another policy-based approach has been published by Grit et al. [GIYC]. Here, the authors used and extended Shirako2, a Java-based toolkit for resource leasing services, to explore algorithmic challenges when using policies for the adaptive virtual machine hosting on clusters. Shirako uses the concept of leases. These leases are contracts between resource providers and resource consumers. They define the resources allocated to a consumer and the duration of their ownership. Although Xen is used as a virtualization technology, the analysis performed by the authors has a generic character. Thus, the authors propose a two-layer architecture, with brokers responsible for VM provisioning and provider sites responsible for VM placement. The authors explored three challenges: the provisioning problem, the placement problem and the combination of the two. The provisioning problem has proven to be an instance of the Knapsack problem, which, as seen in Section 2.4, even in it’s simplest form is NP -hard. Also the analysis of the VM placement problem shows that finding a plan that minimizes the number of migrations is NP -hard. Things get even more complicated when provisioning and placement policies must work together. An alternative solution can be the tight integration of the two policies, to avoid creating a tension between them. Historic data was used to emulate the behaviour of a system when a mechanism is used that combines the two types of policies. This mechanism is quite simple: when more resources are required by a VM, if enough local resources are available, the VM is reprovisioned; otherwise, live migration is used to relocate the VM on another host inside the cluster. The authors simplified the complexity of the problem by suspending a VM to disk when neither local nor remote resources are

2http://www.cs.duke.edu/nicl/cereus/shirako.html

39 3.1. State of the Art Chapter 3. Analysis directly available. Results are presented that show how different combinations of policies perform with respect to the number of migrations over a period of time.

Goal-driven

Instead of having a policy that defines what to do when a specific state is reached, goal- driven approaches specify a desired state, a goal that must be achieved. The intelligent agent must then perform an action to bring the system into the desired state, and thus to achieve the goal. Different techniques exist to implement such an intelligent agent.

One field of research that inspired strategies for the dynamic allocation of resources in virtual environments using a goal-driven approach is control theory [HDPT04]. Zhang et al. [ZBG+05] used a control-theoretic model for VM adaption. The basic idea is that VMs, called friendly virtual machines, are responsible for adjusting their demand for resources. This adjustment must consider both an increase in efficiency as well as fairness. To implement this, the authors modified an instance of User Mode Linux (UML) running on one physical machine, so that each UML VM is able to perform these adjustments. The technique implemented in each VM is called feedback-based adaptive control and comprises three mechanisms. First, a feedback signal mechanism is used to determine the bottlenecked resource of a VM. This mechanism uses the virtual clock time (VCT) as a method to detect overload. The VCT represents the real time between two consecutive clock cycles and is used by the authors to determine which resource is the bottleneck. Second, a control signal mechanism is used to adapt the resource consumption. The authors propose two methods: (1) control the maximum number of processes or threads and (2) control the rate by forcing the VM to periodically sleep. Finally, the brain of the system is the controller, which determines the adaption strategy. The authors use a rule called additive-increase/multiplicative-decrease (AIMD), which allows to linearly increase the demand of a VM and force the demand to decrease exponentially when a feedback signal indicates an overload. This way, the VM can adapt its demand to match the capacity of the underlying resource. According to the authors, AIMD is an adaption rule known to converge to efficiency and fairness.

In [PSZ+07], Padala et al. use classical control theory to address dynamic resource allo- cation problems in shared infrastructures. To test their approach, the authors use the Xen hypervisor, RUBiS — an online action site [RUB], and a Java implementation of TPC-W — an e-commerce suite [TPCa]. Both RUBiS and TPC-W are multi-tier applications, consisting of a database tier and a Web tier, and are used for benchmarking purposes. The feedback-control strategy chosen by the authors involves using a sensor, an actuator and

40 Chapter 3. Analysis 3.1. State of the Art a controller. The sensor measures how many of the allocated CPU cycles are actually consumed by the VM. The actuator uses the CPU scheduler of the hypervisor to assign CPU shares to virtual machines. This way, both the sensor and the actuator make use of the Xen SEDF scheduler, which implements weighted fair sharing of CPU processing time. The controller uses a two-layer architecture. Each VM is assigned an utilization controller which computes the CPU share required by each VM. In doing this, the uti- lization controller uses two criteria: (1) make sure that the VM has enough resources so that it does not become a bottleneck3 and (2) make sure that the resource utilization of the VM is kept at a high level. The utilization controllers send the result of their measure- ments to the arbiter controller which determines if the requested CPU entitlements for all virtual machines can be satisfied. If this is not possible, the arbiter controller uses a QoS differentiation mechanism to determine which CPU entitlements will be satisfied. Results presented by the authors show that using this approach increases the resource utilization while meeting certain QoS goals. Time series analysis has been used for decades to forecast the behavior of a system. In [BKB07], Bobroff et al. describe a mechanism for the dynamic migration of VMs. For this purpose, a forecasting technique based on time series was used on historical data to determine future demands. The result of the forecasting is then fed to a management algorithm. Bin packing heuristics are used on the forecasted demand to remap virtual machines to physical machines with the purpose of providing Service Level Agreement (SLAs) probabilistic guarantees.

Utility-based

Utility-based approaches use utility functions to express the value of a state. The job of the intelligent agent is to find a sequence of actions that leads to the state with the highest utility or close to optimal. Utility functions have been previously used in autonomic systems [WTKD04]. Menascè et al. used a utility-based approach for the dynamic CPU allocation to virtual machines [MB06]. A formal model was developed, which defines different workload classes for VMs. The response times of these workloads are used as performance metrics. Each VM has a utility function associated that is a function of its performance metrics. The global utility function of the system Ug is a function of the utility functions of each VM. This model abstracts the optimal CPU allocation task to optimizing Ug. The authors use two

3queueing theory shows that the reponse time increases sharply when the CPU utilization gets close to 100%; the authors used a CPU utilization of 80% as their safety margin

41 3.1. State of the Art Chapter 3. Analysis approaches to CPU allocation: using CPU priorization and using CPU shares. In both cases, finding an optimal allocation generates a search space that can be extremely large, depending on the number of virtual machines, priority classes respectively the number of CPU shares used. This means that a heuristic is required to find a solution close to the op- timal. The authors use a combinatorial search technique called beam-search, which works as follows: starting from the current allocation, compute Ug for all the neighbours of the allocation; keep the k-neighbours with the best values of Ug; repeat the process for all k-neighbours, keeping again only k-best ones; continue until a certain level d is reached. Although no actual virtualization technology was used to implement this approach, the authors obtained promising results by means of simulation.

3.1.2 Limitations of the Previous Work

In the preceding subsection 3.1.1, the currently existent academic work related to the use of autonomic computing techniques to manage virtual environments has been presented. Yet all the approaches that were analyzed have some limitations. Many of them use historical data and simulation to test the results of their controller strategy. Although correct from a theoretical point of view, they still need to prove their fitness in a real, working environment.

Furthermore, to reduce the complexity of the resource allocation problem, the authors used two different approaches. The most common one was to take into consideration only one resource, like for example to deal only with CPU allocation and ignore other resources like for example memory, or to perform live migration when ever service level agreement violations occur without considering the reallocation of locally available re- sources. This is again not realistic, since in a real environment different resources can be the bottleneck. In a scenario where a VM needs more memory to perform well, memory which is locally available, neither increasing the CPU share nor migrating the VM are acceptable solutions. The second approach used was to reduce the demand for resources of a VM when these resources can’t be provided. This is done by either suspending the VM or by forcing the VM to periodically sleep. However, both these solutions actually lead to increase the response times of services running in the respective VMs.

An intelligent controller capable of replacing the system administrator should take its decisions based on data gathered throughout both service-level and VM-level monitoring. The ultimate quality criteria is the behaviour of the services hosted on virtual machines, and not the resource utilization of these virtual machines. Conclusions regarding the

42 Chapter 3. Analysis 3.1. State of the Art quality of a service can not always be drawn by simply monitoring the resource utilization of a virtual machine.

3.1.3 Commercial Solutions

A few years ago, when technologies like VMWare and Xen started to be used in data centers, it became obvious that the next big concern is how to manage these new virtualization-based environments. Ever since, a multitude of tools were developed to address this problem, like the ones presented in Section 2.1.5.2. Yet most of them focus on providing a feature-rich and user-friendly management interface. In this subsection, commercial solutions from two vendors are presented, which differ from the others by focusing on automating the management process. These products have not been tested, and their description is based on the information available on the respective Web sites. For some reasons, in some cases, no assertions can be made about their mode of operation

ToutVirtual’s Virtual IQ4 Enterprise Suite consist of four products: VirtualIQ Pro, Pro- visionIQ, AssetIQ and ShieldIQ. VirtualIQ Pro provides monitoring and policy-based management through a Web interface and can be installed as a virtual appliance. Such vir- tual appliances are available for VMware, Xen and . ProvisionIQ, as the name suggests, is used to handle the provisioning and life-cycle management of virtual machines, while AssetIQ offers a comprehensive view of all the environment’s resources. ShieldIQ is probably the most interesting product of the range and is used to monitor and maintain QoS parameters of a virtual appliance. Unfortunately, SieldIQ is only available as a virtual appliance that runs on top of VMware products. It is also hard to estimate how this product range can scale with respect to different self-management strategies.

Netuitive5 developed its product range based on the concept that managing a vitual en- vironment with script- and rule-based monitoring tools is a difficult task. This is because performance metrics tend to change fast, thus making it very difficult to track possible problems and react properly to avoid them. Instead, Netuitive comes with a set of per- formance management solutions which they claim to be “self-learning and continuously adaptive” and so replacing “human guesswork with automated mathematics and analy- sis". The way it works is that the software learns which are the normal parameters of the

4http://www.toutvirtual.com/solutions/solutions.php 5http://www.netuitive.com/products/

43 3.2. Architectural considerations Chapter 3. Analysis environment and is able to forecast problems before they occur. Besides forecasting, the tools from Netuitive are also able to perform fault diagnostic and find the root cause of a problem, thus reducing the mean time of repair. Overall, Netuitive provides an advanced monitoring solution, which is able to forecast problems and perform diagnostics. A big advantage is the self-leaning process, which eliminates the need for acquiring the knowl- edge from a human expert (i.e. the system administrator), a task that always proved to be difficult. However, the solutions from Netuitive can’t provide what it takes to be a full au- tonomic manager. While components are available for acquiring knowledge, monitoring and analyzing, there seem to be no components for planning and automatically executing decisions based on the performed analysis. These tasks still need to be accomplished by a human system administrator.

3.2 Architectural considerations

As previously argued, the currently existent solutions for self-managing virtual machines have considerable limitations. This section presents the formal requirements of this thesis, aimed at designing and evaluating self-management approaches for virtual machine-based environments. First, we start with a formal model of the managed infrastructure. Then, a motivation for a common framework is given, as well as a set of requirements for this framework.

3.2.1 A Formal Model of the Managed Infrastructure

To evaluate different self-management approaches, they all must have one thing in com- mon: the managed infrastructure. In the following, a formal model of the managed in- frastructure is provided. Our infrastructure is based on a cluster model. In this cluster, three types of entities exist: virtual machines, physical machines and the cluster itself. Thus, we define: C - the cluster with C = {S, P, V, g, s} where S - set of services P - set of physical machines (PMs)

44 Chapter 3. Analysis 3.2. Architectural considerations

V - set of virtual machines (VMs) g ⊆ P × V , g is a relation which states which VMs are provisioned on which PMs s : V → S, s a surjective function which says which service runs on a given VM6

Each VM vi has:

−→ • a vector ai = (cpui, memi, neti, ioi), representing the amount of allocated re- sources −→ • a vector ri = (cpui, memi, neti, ioi), representing the amount of required re- sources where:

• cpu represents the CPU share

• mem represents the amount of memory

• net represents the network bandwith

• io represents the I/O bandwith

−→ −→ −→ −→ We say that p ≥ q if ∀pn ∈ p and its corresponding element qn ∈ q are in the relationship: pn ≥ qn. The vector operators ≤,>,< and = are defined analogically.

Furthermore, each PM pj has:

−→ • a vector tj = (cpuj, memj, netj, ioj), representing the total amount of resources on that PM −→ • a vector uj = (cpuj, memj, netj, ioj), representing the amount of resources provi- sioned to VMs

Using this formal model, we can define when a virtual machine has too much or too few resources assigned: if ri > ai, then the virtual machine vi needs more resources; if ri < ai, than the virtual machine vi has more resources than it requires. Similarly, we can

6we have defined a model where a single service is hosted on virtual machine, but the same service can be hosted on more than one virtual machines; we believe that this corresponds to scenarios where virtualiza- tion is used in data centers

45 3.2. Architectural considerations Chapter 3. Analysis define if a virtual machine can be provisioned on a physical machine, or even determine the resource utilization of a physical machine. Note that the abstration provided by this model is somehow simplified, since the dimensions are considered independetly. This means that we exclude the possibility of a virtual machine requiring more of resource x, but in the same time releasing some of resource y.

3.2.2 Framework Requirements

While the authors of the publications presented in subsection 3.1.1 all claim having ob- tained good results, it is impossible to evaluate these approaches against each other. First of all, this is not possible because different architectures and infrastructures were used. By defining a formal model of the managed infrastructure, we addressed part of the problem. However, to be able to evaluate different controller strategies, a common framework is required in which these strategies can be plugged in and replaced. For such a framework, we define the following requirements:

1. Support for service level management

2. Separation of control algorithms from the management framework

3. Support for different virtualization technologies

4. Support for a common evaluation mechanism

5. Scalability

(1) means that the framework should provide support for managing services hosted in vir- tual machines. As shown in Section 2.2, this involves monitoring various QoS parameters and making sure that these parameters are SLA-conform. (2) basically assures that the management framework is responsible for dealing with as- pects like monitoring and execution of tasks, while a separate intelligent controller defines management tasks based on gathered monitoring data. As such, a framework should sup- port different types of intelligent controllers by providing a generic controller interface. (3) requires the framework to support different virtualization technologies (e.g. VMware ESX or Xen, see Section 2.1), without affecting the design of the controller. Thus, from a controller’s perspective, the framework is acting as an abstraction layer on top of the virtualization technology. Generic interfaces for monitoring and task execution should simplify plugging in adapters for various virtualization technologies.

46 Chapter 3. Analysis 3.3. The Resource Allocation Problem

(4) means that the framework has to support a consistent mechanism for tracing and com- paring management decisions. Only this way different self-management approaches can be evaluated against each other.

(5) states that the framework should be scalable with respect to the number of physical and virtual machines that it can manage. New physical or virtual machines dynamically added to the cluster must be recognized by the framework.

3.3 The Resource Allocation Problem

The framework requirements defined in the previous section lead to a clear separation between the controller and the management framework. The management framework provides an abstraction layer on top of which the controller only deals with resource providers (the physical machines) and resource consumers (the virtual machines). The job of the controller is to satisfy the requirements of the resource consumers, given a global view of the cluster resources supplied by the resource providers.

3.3.1 Controller Requirements

Although separated from the management framework, the controller is still subject to some requirements derived from nature of the managed infrastructure. Thus, any con- troller mechanisms must consider the following requirements.

1. fast execution times

2. solution at reduced costs

(1) derives from the fact that resource requirements coming from the virtual machines must be satisfied in a short time. This puts an upper bound on the execution time of the resource allocation algorithm.

(2) is derived from the fact that reallocating resources is costly, especially when virtual machines have to be migrated.

Having these requirements means that the goodness of a solution is a function of both the profit offered by this function and the cost of reaching this solution.

47 3.3. The Resource Allocation Problem Chapter 3. Analysis

3.3.2 Approaches

The controller is responsible for satisfying the resource requirements of the virtual ma- chines. An equivalent formulation of our resource allocation problem is the problem of packing virtual machines into physical machines. More abstract, this means packing items into boxes, with the purpose of reaching a certain profit. This problem is similar to the knapsack problems presented in Section 2.4. Thus, a first observation can be made: our resource allocation problem belongs to the knapsack family of problem. This is an important observation since, as described in Section 2.4, all knapsack optimization prob- lems are known to be NP -hard 7. This means that the problem that needs to be solved by the controller is an NP -hard optimization problem. Sometimes however, satisfying all the requirements of all virtual machines might not be possible. This is similar to not being able to pack all the items in the available knapsacks. In such case, the approach used is to maximize the profit. For knapsack problems, where each item has a specific profit, this means finding the set of items that can be packed in the available knapsacks and which give the highest profit. For our problem, if we consider that all virtual machines have the same profit, this means maximizing the number of satisfied virtual machines. Alternatively, if the virtual machines have different profits, than the goal is to maximize this profit while not exceeding the capacity of the physical machines. We have seen in Section 2.4 that, unless P = NP , there is no algorithm that finds an optimal solution to a NP -complete optimization problem in polynomial time. In prac- tice, small instances of an optimization problem can be solved by simply enumerating all possible solutions. For our problem however, this approach can not be used, since the size of our problem corresponds to the size of a virtualized data center. Such a data cen- ter could for example have 100 virtual machines provisioned on 10 physical machines, case in which the simple enumeration of all possible virtual machine to physical machine mappings is not a feasible approach. This leaves two options:

• heuristics

• approximation algorithms

In the following, these two alternatives are analyzed.

7in their most general case

48 Chapter 3. Analysis 3.3. The Resource Allocation Problem

3.3.3 Heuristics

Although there are no execution time guarantees for algorithms using heuristics, in prac- tice algorithms that use heuristics are known to have fast execution times. This makes heuristics a good candidate for our problem, since they fulfill the first controller require- ment. The second controller requirement states that the costs of reaching a solution should not be too high. For our problem, a solution translates into a certain resource allocation map. Thus, the second requirement translates in finding a resource allocation map close to the initial resource allocation map. In such cases, heuristics based on local search are a good choice, since these heuristics look for solutions in the neighbourhood of the initial state. This is exemplified in Figure 3.1. Here, the black spots represent states in the search space, the red spot represnts the initial state and the circle around the initial state represents the neighbourhood of the initial state.

Figure 3.1: Example of local search trough the search space

In Section 2.4, several local search heuristics have been introduced. For the analysis, we considered the following four techniques:

• Hill climbing

• Simulated annealing

• Genetic algorithms

49 3.3. The Resource Allocation Problem Chapter 3. Analysis

• Tabu search

Hill climbing always selects the best neighbour from all the neighbours of the present state. For most problems, this strategy converges fast to a local maximum. For our prob- lem, such a behaviour has its advantages and also its disadvantages. The advantages are that hill climbing generally executes fast, and that the solution found, corresponding to a local maximum, is usually not far away from the initial state. In terms of resource al- location, this means that a new resource allocation map is found fast, and this map does not differ much from the initial one, thus reducing the number of reallocations necessary. However, there are many cases when the quality of the solution found by hill climbing is rather poor. This is a clear disadvantage of this approach.

Simulated annealing reduces the risks of reaching a local maximum of poor quality by allowing random moves through the search space. While the solutions found by simulated annealing are still quite close to the initial state, increasing the quality of the solution generally means increasing the execution time. Using simulated annealing might lead to a violation of the first controller requirement.

Genetic algorithms use an approach based on the concept of natural evolution and the survival of the fittest principle. Although this makes them very different from the other heuristics based on local search, genetic algorithms are still a local search technique. For our problem, using an appropriate encoding, the crossover mechanism of genetic algorithms is similar to exchanging virtual machines between physical machines. The success of a genetic algorithms approach depends mostly on finding such an encoding. Figure 3.2 shows an example of a one-point crossover, using a possible encoding. Here, a mapping of virtual machines to phyiscal machines is encoded in form of an individual. A virtual machine corresponds to a gene, and sections of the individual represent physical machines. When the crossover takes place, the mappings of PM 1 and PM 2 are exchanged by the two individuals, while the mapping of PM 3 remains the same. As such, in the next generation, an individual appears corresponding to a state where VM 2 and VM 1 are provisioned on PM 1, VM 4 and VM 5 are provisioned on PM 2 and VM 3, VM 6 and VM 7 are provisioned on PM 3. A fitness function must than check the validity and, if valid, the utility of this state.

Compared with other approaches, tabu search introduces the concept of taboo moves. In the search space of our problem, states can be either valid or invalid. An invalid state is one where the total size of the virtual machines packed inside a physical machine exceeds the total capacity of that physical machines. When a solution is reached, corresponding to

50 Chapter 3. Analysis 3.3. The Resource Allocation Problem

PM 1 PM 2 PM 3

VM 1 VM 5 VM 3 VM 6 VM 7

VM 2 VM 1 VM 4 VM 5 VM 3

PM 1 PM 2 PM 3 Crossover point

Figure 3.2: Crossover example a final state, the system needs to start from the initial state and go trough all the intermedi- ate states visited on the way to the final state. If any of these intermediate states is invalid, than the solution found is invalid even if the final state is valid. None of the previous techniques has a special mechanism that guarantees that such a situation can never take place. However, using tabu search, this can be achieved by defining a move from a valid state to an invalid state as a taboo move.

All these heuristics have their advantages and disadvantages. Alone the theoretical anal- ysis of these approaches can not provide a guarantee with respect to the performance of a controller implementing such a heuristic. However, there are also no strong arguments against using any one of them. It is also possible to combine these local search heuristics into one special heuristic that fits best to our problem.

3.3.4 Approximation Algorithms

Nowadays, approximation algorithms exist for many NP -hard optimization problems. However, such algorithms exist for a specific family or class of problems, not neces- sary for a specific instance of a problem. Also knapsack problems are known for having approximation algorithms. We have previously shown that our problem belongs to the knapsacik family of problems. Thus, finding an approximation algorithm for our resource

51 3.4. A Bottleneck Determination Mechanism Chapter 3. Analysis allocation problem can be done by reducing our problem to a specific knapsack problem, which has an approximation algorithm. For our problem, when only one resource is taken into consideration (e.g. memory), and only one physical machine is available in the cluster, than we are dealing with the original 0-1 knapsack problem. Furthermore, when more resources are taken into consideration (e.g. cpu, memory, network and I/O), we are dealing with a multidimensional knapsack problem. When the resources are provided by several identical physical machines, we are dealing with a multidimensional bin packing (or vector packing) problem. Finally, in the most general case, when the physical machines must not be identical, our problem is a multidimensional multiple knapsack problem (MMKP). Although mostly similar, our problem has some properties that differentiates it from the MMKP. Treating our resource allocation problem like an MMKP would mean that re- sources are provisioned offline. However, our problem deals with a dynamic resource allocation. In knapsack terminology, this means that the items are already packed in knapsacks, when one dimension of one of the items changes. In this case, taking the second controller requirement into consideration, it would mean that a limited number of items can be moved between knapsacks. An algorithm that performs under such constrain is called a semi-online algorithm [EGCGJ97]. Various approximation algorithms exist for knapsack problems. In [EGCGJ97], the au- thors present a survey of approximation algorithms for the one dimensional bin packing problem. Here, several semi-online algorithms are mentioned that offer a guaranteed per- formance ratio smaller than 1.5. Approximation algorithms for the multidimensional bin packing problems are described in [BCS06] and [CKP01]. Finding an approximation al- gorithm for our problems means finding an algorithm among the existent approximation algorithms for knapsack problems. From this multitude of algorithms, the one is chosen which fits best to our resource allocation problem, without violating any (or too many) constraints.

3.4 A Bottleneck Determination Mechanism

In our model, a single service is hosted inside a virtual machine. The first framework requirement states that the framework should provide support for managing the services inside virtual machines. However, the controller allocates resources based on require- ments coming from the virtual machines through the management framework. It is then the management framework that is responsible for translating the behaviour of the service

52 Chapter 3. Analysis 3.4. A Bottleneck Determination Mechanism into resource requirements, when this is the case. To achieve this, the framework needs to provide what we refer to as a bottleneck determination mechanism.

The job of the bottleneck determination mechanism is to monitor both the service and the virtual machine on which the service is provisioned. In case a service is having a bad behaviour, the mechanism must be able to determine which resource is the cause of this behaviour. Note that, as described in Section 3.2.1, we address only single resources. Other causes for such a behaviour, like software errors, are not treated by the bottleneck determination mechanism. When a resource is found to be the bottleneck, the manage- ment framework lets the controller know that more of that specific resource is required by a certain virtual machine.

The bottleneck determination mechanism behaves just like a human expert (e.g. a sys- tem administrator): it first observes the problem and then uses its knowledge to find the resource causing the bottleneck. Based on this diagnostic, the bottleneck determination mechanism finds or suggests a solution. This behaviour is a common behaviour of expert systems, indicating that an expert system approach could be adequate for such a mecha- nism.

3.4.1 The Expert Systems Approach

In computer science, the term expert system is used to define a system capable of perform- ing classifications and diagnostics, as well as making decisions. Thus, an expert system is a program capable of performing the job of a human expert. Like humans, expert systems can be experts for a specific domain. For example, an expert system for medical diagnos- tics can assist a doctor in the diagnosis of a patient. To be able to perform such actions, an expert system must maintain a knowledge base, focused on that specific domain. To create such a knowledge base for an expert system, a process called knowledge aquisi- tion is performed. Acquiring knowledge basically means transferring expertise from a knowledge source (for example a human expert or a book) to a program. Furthermore, this knowledge needs to be encoded somehow in the knowledge base, in a format that is machine readable and can be associated with the way humans store knowledge (from a logical perspective). This is known as knowledge representation. Depending on the approach of the expert system there are several ways to represent knowledge.

Beside a knowledge base, an expert system must also have a reasoning mechanism. Con- fronted with a specific problem, the inference engine of the expert system uses the knowl- edge base to determine a solution. This mode of operation is pictured in Figure 3.3.

53 3.4. A Bottleneck Determination Mechanism Chapter 3. Analysis

Reasoning Problem Solution Mechanism

Knowledge Base

Figure 3.3: Operation mode of an expert system

Over the years, many approaches to expert systems have been developed [Jac98]. In the following, two expert systems techniques are described, showing how each of them could be used for the bottleneck determination mechanism. These techniques have been selected, since they seem to be the most adequate for our purposes.

3.4.1.1 Rule-Based Systems

Rule-based systems are a traditional approach to expert systems. A rule-based system consists of:

• a set of rules, also known as the production memory

• a rule interpreter

• a working memory

The knowledge base is represented in form of a set of rules. Each rule has a set of facts and a set of actions. The working memory represents the central data structure and contains data about the current state. It is the data in the working memory that triggers the rules, with the rule interpreter being responsible for the selection and activation of rules. The rule interpreter tries to match the facts of a rule with the data in the working memory. When all the facts of a rule are matched, the rule interpreter fires the actions associated with these facts. In case more than one rule fires, the rule interpreter must chose one of them to apply.

Rule-based systems are a possible approach to our bottleneck determination mechanism. The main challenge of using this approach is acquiring and encoding the knowledge of a human expert (system administrator) in form of a set of rules.

54 Chapter 3. Analysis 3.4. A Bottleneck Determination Mechanism

3.4.1.2 Case-Based Reasoning

Case-based reasoning is a relatively new approach (early 1980’s) to expert systems. Like other approaches, case-based reasoning tries to imitate the way human experts solve prob- lems. However, unlike the other approaches, case-based reasoning is based on the obser- vation that a human expert solves a problem by recalling a similar problem that he or she solved in the past and applying a similar solution to the current problem. As such, the assumption that lies at the core of case-based reasoning is that similar problems have similar solutions.

An expert system that uses case-based reasoning usually performs the following steps:

• search the knowledge base for a case which documents a solution for a problem similar to the current problem

• adapt this solution to the current problem

• apply the adapted solution to the current problem

• store the current problem and its solution as a new case in the knowledge base

One of the advantages of this approach is that the knowledge representation in form of cases is easy to implement and proven to be efficient. Case-based reasoning systems are able to determine a solution relatively fast, which means that they can be used to resolve on-line problems. However, the generalization that similar problems have similar solutions has always been criticized and can lead erroneous solutions.

Like rule-based systems, case-based reasoning is a possible approach for our problem. However, acquiring knowledge from a human expert and encoding it in form of cases is more difficult than encoding this knowledge in form of rules.

3.4.2 The Time Series Analysis Approach

Time series represent a sequence of observations taken over a time interval. An exam- ple of a time series is available in Figure 3.4. In many cases, adjacent observations are dependent. Time series analysis deals with techniques for analyzing these dependencies. According to [BJR94], time series analysis has four important areas of application:

1. forecasting future values using current and past values (observations)

55 3.5. The Overprovisioning Problem Chapter 3. Analysis

Time series example 1 load

0.8

0.6

0.4

0.2

0 00:30:00 00:45:00 01:00:00 01:15:00 01:30:00 01:45:00 02:00:00 02:15:00 02:30:00 02:45:00 03:00:00 Time

Figure 3.4: Time series example

2. determining the output of a system given a specific input; this is known as deter- mining the transfer function of a system

3. determining the effects of external events, called interventional events, on the be- haviour of a time series

4. design discrete control systems, which use so-called control schemes to adjust the input of the system in order to obtain a desired output

Time series analysis has its applications in economics, social sciences, geophysics, me- teorology, engineering and others. For our bottleneck determination problem, time series analysis could be used to forecast future bottlenecks and, together with an expert sys- tems technique, to determine the cause of this future bottleneck. Alternatively, time series analysis could also be used to design a discrete control system capable of determining the bottleneck on its own, without use of an expert system.

3.5 The Overprovisioning Problem

Up to this point of the analysis, we have only considered the problem of virtual machines requiring additional resources. However, in some cases, virtual machines could be over- provisioned. This means that more resources have been provisioned to a virtual machine than the virtual machine requires. In such cases, a mechanism is required to release these resources.

56 Chapter 3. Analysis 3.5. The Overprovisioning Problem

In the previous section, an analysis of a bottleneck determination mechanism was pro- vided. For the overprovisioning problem, a similar mechanism can be used. Both the service and the virtual machine parameters should be monitored to determine when the virtual machine is overprovisioned, and both expert systems and time series analysis, pre- sented in the previous section, are feasible approaches for implementing a mechanism capable to detect an overprovisioning. The controller approach for this problem is trivial. When a virtual machine releases re- sources, these resources are assigned to the physical machine hosting the respective vir- tual machine. In some cases, this could lead to overprovisioned physical machines, that is physical machines with a low resource utilization. However, such situations are compen- sated as soon as the controller needs to satisfy resource requirements coming from virtual machines, since in such cases the controller aims to optimize the resource utilization of the entire cluster. Such an optimization also means that the controller tries to achieve a configuration that minimizes the consume of energy in the data center, i.e. reduces the number of online physical machines.

57 3.5. The Overprovisioning Problem Chapter 3. Analysis

58 Chapter 4

Design

In the previous chapter, the goal of this thesis has been described, a set of requirements has been defined and various approaches to fulfilling these requirements have been proposed and analyzed. In this chapter, the design follwing as a result of this analysis is presented. The first section describes the design of the framework. Section 4.2 describes the design of the bottleneck determination mechanism, while Section 4.3 is dedicated to the design of the various controller approaches.

4.1 Architecture of the Framework

4.1.1 The basic concept

The framework created as part of this thesis is targeting the self-management of virtual machine-based environments. As such, the autonomic computing paradigm must be taken into consideration when designing the framework.

As described in Section 2.3, the autonomic computing paradigm states that autonomic systems represent a collection of autonomic elements. In this context, the autonomic element is the basic component of the system. In section 3.2, a formal model of the managed infrastructure was defined. In this model, the system is composed of four types of entities: the service, the virtual machine, the physical machine and the cluster itself. Since in our model only one service is hosted on a virtual machine, the service and the virtual machine are treated together. As a result, the model of the system comprises three different types of entities.

59 4.1. Architecture of the Framework Chapter 4. Design

Applying the autonomic computing paradigm to this model means that these entities must appear in the system as autonomic elements. To achieve this, each of these elements requires an autonomic manager. As such, the following managed elements - autonomic element pairs are defined:

• virtual machine - VM Manager

• physical machine - Physical Manager

• cluster - Cluster Manager

The VM Manager manages both the virtual machine and the service hosted on it. Each virtual machine has exactly one VM Manager. The Physical Manager is responsible for a physical machine hosting virtual machines. While in our model a physical machine can host several virtual machines, each physical machine has exactly one Physical Manager. The cluster entity is a singleton, and thus exactly one Cluster Manager exists in the system, responsible for the cluster itself. Together with their managed elements, the autonomic managers form the autonomic elements of the system.

In Section 3.2, five framework requirements have been defined. At this stage of the design, two of them are addressed. The second framework requirement states that the controller needs to be separated from the management framework. We address this requirement by creating a hierarchical model, which is somehow similar to the model proposed by Ruth et al. [RRX+06] and described in Section 3.1. This model is shown in Figure 4.1.

Each virtual machine has a VM Manager. Each physical machine has a Physical Man- ager. The VM Manager monitors a virtual machine and the service running on this virtual machine. When the VM Manager determines that the virtual machine it manages requires more resources, it sends a message to the Cluster Manager, asking for the necessary re- sources. Each Physical Manager monitors the resource utilization of the physical machine it is assigned to. All Physical Managers forward this information to the Cluster Manager. Based on this data, the Cluster Manager computes a global resource view. This global resource view is then used by the Cluster Manager to satisfy the requirements of the VM Managers. Whenever a request arrives, the Cluster Manager consults the global resource view and determines how the resources can be allocated. The Cluster Manager then de- fines reallocation tasks, which are forwarded and executed by the Physical Managers.

Through this framework design, we have addressed but not completely fulfilled the second framework requirement. The VM Managers and Physical Managers are responsible for

60 Chapter 4. Design 4.1. Architecture of the Framework

Cluster Manager

requests/alarms requests/alarms

VM Manager VM Manager VM Manager VM Manager

Physical Manager Physical Manager

VMM VMM VM VM VM VM 1 !""""""""""# 1 !""""""""""$ 1 !""""""""""% Cluster of physical machines - Service

Figure 4.1: The basic concept monitoring and executing tasks, while the Cluster Manager only takes resource allocation decisions. Thus, the Cluster Manager wraps the controller and is separated from the rest of the management framework. It is obvious that this design is scalable with respect to the number of physical and virtual machines, in the data center limits described in Section 3.3.2. From an architectural point of view, the framework supports adding any type of these machines at run time. This fulfills the fifth framework requirement.

4.1.2 The Physical Manager

Among the autonomic managers of the framework, the Physical Manager is the simplest one. From an architectural point of view, it comprises only two modules: the MONITOR1 and the ACTUATOR module. This architecture is shown in Figure 4.2. The Physical Manager has two responsibilities. First, it monitors the resource utilization levels of the physical machine it was assigned to. The resources subject to monitoring are memory, CPU, network and I/O bandwith and respect the formal model defined in Section 3.2.1. The data is than forwarded to the Cluster Manager. The monitoring is performed by the Monitor module.

1module names that were not introduced are written using SMALL CAPITALS

61 4.1. Architecture of the Framework Chapter 4. Design

Actuator Monitor

VMM

Figure 4.2: The Physical Manager

The second responsibility of the Physical Manager is the execution of tasks coming from Cluster Manager. The framework has a generic character, and thus there are various tasks which can be assigned to the Physical Manager by the Cluster Manager. For example, the Physical Manager can be commanded to allocate additional memory to a certain virtual machine, increase the CPU share of a virtual machine or even migrate a virtual machine on another physical machine. Inside the Physical Manager, the execution of tasks is per- formed by the Actuator module.

Both the monitoring and the execution of tasks are performed by the Physical Manager on its associated physical machine trough the VMM. This is because, as shown in Section 2.1, in a virtualized physical machine the VMM and not the OS controls the access to resources. It is therefore the VMM where resource utilization levels can be monitored and resource allocation tasks can be executed.

As it can be observed in Figure 4.2, the Physical Manager has no logic component,i.e. while a common autonomic manager, as shown in Section 2.3, performs a so-called MAPE (Monitor, Analyze, Plan, Execute) loop, the Physical Manager only performs the Monitor and Execute phases. In our design, the Analyze and Plan phases are simply offloaded to the Cluster Manager. Furthermore, this design provides support for differ- ent virtualization technologies. The Monitor and Actuator modules can be implemented to work with a certain VMM (e.g. Xen), and then easily adapted to work with another VMM (e.g. VMWare ESX), without changing the way the framework works. This way, our framework is virtualization technology agnostic and thus fulfills the third framework requirement.

62 Chapter 4. Design 4.1. Architecture of the Framework

Actuator

VM Logic SLAs

SL VM Monitor Monitor

VM Service

Figure 4.3: The VM Manager

4.1.3 The VM Manager

From an architectural point of view, as it can be observed in Figured 4.3 the VM Manager is the most complex component among autonomic managers. It comprises four modules: the SL MONITOR, the VM MONITOR, the VM LOGIC and the ACTUATOR module. As the figure indicates, the VM Logic is the central module of the VM Manager. Inside the VM Manager, all modules communicate with and only with the VM Logic.

At this design stage, we address the first framework requirement: support for service level management. Through its SL Monitor module, the VM Manager is capable of monitoring the service hosted inside the virtual machine. By monitoring certain stipulated QoS pa- rameters, the VM Manager is capable of determining if the quality of the service complies with a stipulated SLA.

The main responsibility of the VM Manager is to determine if the service inside the virtual machine is working at the required parameters, and if this is not the case to determine what is causing the bottleneck. The SL Monitor module monitors service parameters like response time and throughput. It then forwards these parameters to the VM Logic module. The VM Monitor module monitors the resource utilization of the virtual machine. The parameters monitored by the VM Monitor are memory, CPU, network and I/O bandwidth, and respect the formal model defined in Section 3.2.1. Like the SL Monitor, the VM Monitor forwards these parameters to the VM Logic.

Based on the monitoring data gathered from the two monitoring modules, the VM Logic is capable of determining: (a) if the service is working well and (b) the resource that is

63 4.1. Architecture of the Framework Chapter 4. Design

Monitor

Cluster Logic

Actuator

Figure 4.4: The Cluster Manager causing the bottleneck when the service is not working well. When the service is not providing the required quality, specified by means of an SLA, the VM Manager must, upon determining the resource causing the bottleneck, require more of the bottlenecked resource from the Cluster Manager. This is done by the Actuator module: when a resource causing a bottleneck has been determined, the VM Logic sends the required additional amount from the resource to the Actuator module, which forwards this in the form of a request to the Cluster Manager. Having only one resource as a possible bottleneck respects the formal model defined in Section 3.2.1. The VM Manager performs a complete MAPE loop. The Monitor phase is performed by the VM Monitor and SL Monitor modules. The VM Logic module performs the Analyze and Plan phases. To be able to perform these phases, and implementation of the VM Logic must maintain a specific knowledge base. Finally, the Execute phase is performed by the Actuator module.

4.1.4 The Cluster Manager

In the hierarchy of the framework, the Cluster Manager is situated on the highest position. This autonomic manager addresses the self-optimization of the cluster, and communicates with all VM Managers and Physical Managers. It comprises three modules: the MONI- TOR, the CLUSTER LOGIC and the ACTUATOR modules. This architecture is shown in Figure 4.4. The Monitor module receives data from both VM Managers and Physical Managers. This data is preprocessed and forwarded to the Cluster Logic. The Cluster Manager performs two main tasks:

1. computes a global view of the resources in the cluster

64 Chapter 4. Design 4.1. Architecture of the Framework

2. satisfies the requirements of the VM Managers

These tasks are addressed by the Cluster Logic, based on the data received via the Mon- itor module. Thus, the Cluster Logic actually represents the intelligent controller we mentioned throughout Chapter 3. Decisions taken by the Cluster Logic are forwarded to the Physical Managers through the Actuator Module.

This architectural approach addresses the second and forth framework requirement. The Cluster Logic represents the controller and is separated from the monitoring and task ex- ecution functionalities of the framework. Various controllers can be easily implemented as a Cluster Logic module and plugged into the framework. This fulfill the second frame- work requirement. Furthermore, it is possible to use the same infrastructure and manage- ment framework with various controllers. This represents the foundation of a common evaluation mechanism and thus fulfils the fourth framework requirement.

Computing a global view of the resources in the cluster is done based on data received from the Physical Managers. Each Physical Manager monitors the resource utilization levels of the physical machine to which it is assigned. This local resource view is than forwarded to the Cluster Manager by all Physical Managers. From this multitude of local resource views, the Cluster Manager computes its global resource view. The global re- −→ source view is actually defined by the vector ai and the relation g, as in Section 3.2.1.As such, the global resource view contains data like which VMs are provisioned on each physical machine, how much memory is provisioned on each VM, what is the CPU share on each VM, how much memory is still available on each physical machine and what is the CPU utilization of each physical machine.

Satisfying the requirements of the VM Managers represents the most difficult task of the Cluster Logic. When a requirement from a VM Manager is received, the Cluster Logic consults its global resource view, and based on it takes a resource allocation decision. This decision is then forwarded to the Actuator module.

The job of the Actuator module is to translate a Cluster Logic decision into tasks for the Physical Managers. Simple decisions like “increase the CPU share of VM i by x” are simply forwarded to the appropriate Physical Manager, in this case the Physical Manager on which the VM i is provisioned. However, the Cluster Logic can also take complex decisions like: “migrate VM i from physical machine t on physical machine p, migrate VM j from physical machine q to physical machine t, allocate additional 100MB memory to VM j”. When such a decision is taken, the Actuator module must delegate tasks to various Physical Managers. For our example, the Actuator must command the Physical

65 4.1. Architecture of the Framework Chapter 4. Design

Manager of t to migrate VM i on p, then command the Physical Manager of q to migrate VM j on t, and then again to the Physical Manager of t to allocate additional 100MB memory to VM j. Thus, the actuator must coordinate the delegation of tasks among several Physical Managers.

We have previously mentioned that the Analyze and Plan phases of the MAPE loop cor- responding to the Physical Manager have been offloaded to the Cluster Manager. The necessity of this design decision is obvious: the Cluster Manager is the only one that disposes of a global resource view, while the Physical Managers each dispose of a local resource view. The framework is not aimed at optimizing single physical machines, but the cluster as a whole. As such, the Cluster Manager performs a complete MAPE loop. The Monitor phase is performed by the Monitor module, the Analyze and Plan phases are performed by the Cluster Logic module and the Execute phase is performed by the Actuator module.

4.1.5 The Complete Architecture

Section 4.1.1 described the managed infrastructure, together with the management back- bone of the framework. Afterwards, in Sections 4.1.2, 4.1.3 and 4.1.4, the architecture of each of the autonomic managers was described in detail. Putting all these pieces together the complete architecture of the framework is obtained. This is shown in Figure 4.5.

Since each entity in the system has an autonomic manager, our system is composed only of autonomic components. Compared with other architectural approaches to autonomic computing systems, our architecture resembles the most the approach proposed by White et. al in [WHW+04]. Furthermore, the internal architecture of each autonomic element conforms to the IBM vision [KC03] of how autonomic computing systems operate.

This complete architecture fulfils all five framework requirements defined in 3.2, as sub- sequently shown in Sections 4.1.1, 4.1.2, 4.1.3 and 4.1.4.

4.1.6 The Intra-Manager Communication

In the previous subsections, the communication between the modules of each autonomic manager was described. This is called the intra-manager communication.

The intra-manager communication is based on a modular Self-Manager, developed in a project ourside of this thesis in the Distributed Systems Lab of the University of Ap-

66 Chapter 4. Design 4.1. Architecture of the Framework

Cluster Manager

Monitor

Cluster Logic requests/alarms requests/alarms

Actuator

VM Manager VM Manager VM Manager VM Manager

Actuator Actuator Actuator Actuator Physical Manager Physical Manager

VM VM Monitor Actuator Actuator Monitor VM VM Logic SLAs Logic SLAs Logic SLAs Logic SLAs

SL VM SL VM SL VM SL VM Monitor Monitor Monitor Monitor Monitor Monitor Monitor Monitor

VMM VMM VM VM VM VM 1 !""""""""""# 1 !""""""""""$ 1 !""""""""""% Cluster of physical machines - Service

Figure 4.5: The complete architecture of the framework plied Sciences Wiesbaden [Sch07, SG06]. The Architecture of the Self-Manager can be observed in Figure 4.6. Internally, the Self-Manager consists of four main components: the central element of the Self-Manager is the management kernel, which provides adapters to connect exten- sion modules. Extension modules are instantiated by the central module manager. The modules implement sensors, effectors, and the internal logic of the self-management con- troller. A messaging subsystem, which is part of the kernel, is responsible for message handling within the Self-Manager. In order to use the communication backbone offered by the Self-Manager, each module of the framework must be implemented as a Self-Manager module. This has serious impli- cations on the implementation of the framework. Because the Self-Manager is written in Java and the communication between modules takes place over Java Messaging Service2 (JMS), it means that the framework must be implemented in Java.

4.1.7 The Inter-Manager Communication

To understand how all the pieces of the framework work together, in the following para- graphs the intra-manager communication is presented together with the inter-manager

2http://java.sun.com/products/jms/

67 4.1. Architecture of the Framework Chapter 4. Design

Figure 4.6: Modular Self-Manager architecture

communication, on a dynamic resource allocation example. By inter-manager communi- cation we understand the communication between autonomic managers.

Figure 4.7 shows the sequence diagram corresponding to an exemplary dynamic resource allocation. The SL Monitor and VM Monitor modules of the VM Manager monitor the parameters of the service and virtual machine respectively. This data is fed to the VM Logic, which then checks if the QoS parameters of the service are SLA-conform, and if not determines the resource causing the bottleneck. The results of the diagnosis are forwarded to the Actuator module of the VM Manager, which than performs a request for additional resources. This request is received by the Monitor Module of the Cluster Manager. This data is than fed to the Cluster Logic. The Cluster Logic decides about the resource reallocation and sends its decision to the Actuator module of the Cluster Manager. The Actuator module of the Cluster Manager then delegates this task to the involved Physical Manager. The Physical Manager receives the command via its Actuator module and performs the reallocation via the VMM.

Note that this sequence diagram is meant only to exemplify the inter- and intra-manager communication. Certain details like updates from Physical Managers and delegation of tasks to several Physical Managers are not shown in Figure 4.7.

68 Chapter 4. Design 4.2. The Bottleneck Determination Mechanism

Inter-VM Manager Communication Inter-Cluster Manager Communication

SL Monitor VM Monitor VM Logic (VM) Actuator (Cluster) Monitor Cluster Logic (Cluster) Actuator (Physical) Actuator

QoS Parameters

Resource Utilization

Bottlenecked Resource

Request Resource

VM Requests Resource

Resource Reallocation Allocate Resource

Intra-Manager Intra-Manager Communication Communication

Figure 4.7: Exemplary inter-manager and intra-manager communication

4.2 The Bottleneck Determination Mechanism

In Section 3.4 it was argued that, in oder to provide the abstration level required by the controller, the framework needs to comprise a bottleneck determination mechanism. As it can be observed from the architecture of the framework, presented in the previous section, this functionality is provided by the VM Logic module of a VM Manager. There are various approaches to implement the bottleneck determination mechanism. These approaches were described in Section 3.4. Among them, the rule-based system approach appears to be the most attractive one to start with. This is because rule-based systems are known to be a feasible, easy to implement approach in the context of auto- nomic computing. In the previous section, we have drawn the design decision of implementing the entire framework in Java. This means that the VM Logic module, like all the modules of the framework, must be written in Java. The VM Logic module hosts the bottleneck deter- mination mechanism. Ideally, the rule engine implementing the bottleneck determination mechanism should support the integration in a Java program. Fortunately, such rule en- gines exists, and the Java integration is defined by JSR 943. The internal architecture of the VM Logic module using a JSR 94 rule engine is shown in Figure 4.8. The VM Logic disposes a communication interface, over which the communi- cation with the SL Monitor, VM Monitor and Actuator modules of the VM Manager takes place. The center of the VM Logic component is represented by the rule engine. The rule

3http://jcp.org/en/jsr/detail?id=94

69 4.3. The Controller Chapter 4. Design

Communication Interface

JSR 94

Rule Engine SLAs

Figure 4.8: The internal architecture of the VM Logic module engine uses the communication interface to receive data from the SL Monitor and VM Monitor. This data is fed to the rule engine via the standardized JSR 94 interface. The SLAs defined for the service hosted inside the virtual machine are translated in form of rules into the rule engine. The rule engine is used for the bottleneck determination. When such a bottleneck is determined, this information is send to the Actuator module via the Communication interface.

4.3 The Controller

As described in Section 4.1, the controller functionality of the framework is implemented by the Cluster Logic module of the Cluster Manager. The internal architecture of the Cluster Logic is shown in Figure 4.9.

Communication Interface

Algorithm Global Resource View

Figure 4.9: The internal architecture of the Cluster Logic module

70 Chapter 4. Design 4.3. The Controller

The Communication Interface assures the communication with the Monitor and Actuator modules of the Cluster Manager. When the Monitor modules forwards monitoring data coming from the Physical Managers, the Communication Interface updates this data in the Global Resource View. When a VM Manager sends a request for additional resources, again via the Monitor module of the Cluster Manager, the Communication Interface for- wards this request to the controller Algorithm. The algorithm uses the Global Resource View to make a resource allocation decision, and sends this decision via the Communica- tion Interface to the Actuator module of the Cluster Manager, which then delegates tasks to the respective Physical Managers.

Obviously, the central component of the controller is the algorithm. We have designed two such algorithms: Algorithm A and Algorithm B. These algorithms are presented in the following subsections.

4.3.1 Algorithm A

Algorithm A is the simplest of the two and is quite similar to the greedy best-first tech- nique [RN04]. Whenever a VM Manager requests additional resources for its correspond- ing virtual machine, Algorithm A uses the global resource view to check if these resources are locally available, i.e. on the physical machine hosting this virtual machine. If this is the case, than the requested resources are locally assigned to the virtual machine.

If the requested resources are not locally available, than Algorithm A checks for phys- ical machines in the cluster which can satisfy the resource requirements of the request- ing virtual machine. These requirements represent the sum of the resources currently provisioned to the virtual machine and the additional resources requested by the virtual machine. The pseudocode of this algorithm is shown in Algorithm 1.

From the list of physical machines that can provision the virtual machine, the physical machine is chosen which has the highest resource utilization. Note that, in the list of remote machines which provide sufficient resources, both offline and online machine can be found. Choosing the remote machine that has the highest resource utilization makes sure that offline machines are powered-on only if no online machines can provide the required resources. This strategy aimes at increasing the energy efficiency of the cluster.

Using a different heuristic when choosing the remote physical machine on which to mi- grate, like for example choosing the physical machine with the lowest resource utiliza- tion, has one big disadvantage. Consider requirements coming from two different virtual machines. Both require a migration on a remote physical machine, and are processed se-

71 4.3. The Controller Chapter 4. Design

Input: A VM that requires more resources Output: New mapping 1 if resources available locally then 2 assign local resources; 3 return new remapping; 4 end 5 else 6 if remote machines exists with the required resources available then 7 migrate on the machine with the highest resource utilization; 8 return new remapping; 9 end 10 else 11 return no solution found; 12 end 13 end Algorithm 1: Algorithm A quentially. The first virtual machine requires a relative small amount of resources, and is provisioned on the physical machine with the lowest resource utilization, although physi- cal machines with a higher resource utilization could also provision this virtual machine. When the requirements of the second virtual machine must be fulfilled, Algorithm A finds out that no remote physical machine has enough free resources to provision this second virtual machine. However, on the physical machine where the first virtual ma- chine was provisioned, these resources were available before the first virtual machine was provisioned. Thus, it would have been possible to satisfy the requirements of the second virtual machine, if the first virtual machine would have been provisioned on a physical machine with a higher resource utilization. This clearly speaks for the heuristic used in Algorithm A and against an alternative heuristic where the low-utilized physical machines are preferred over high-utilized physical machines.

However, Algorithm A has two limitations: first, since it is a heuristic-based algorithm, it has no guaranteed performance ratio. For our case this means that, although Algorithm A aims at optimizing the allocation of virtual machines to physical machines with the final purpose of reducing the number of necessary online physical machines, one can never say how far this solution is from the optimal allocation. The second limitation of Algorithm A is that it is not capable of performing complex virtual machine to physical machine reallocations in the cluster. This limitation is exemplified in Figure 4.10. Here, VM 2 was created and needs to be provisioned. However, VM 2 is too big for PM 1, and on PM 2 there is not enough place, i.e. there are not enough resources free. In such a case, Algorithm A finds no solution, although, as it can be observed in Figure 4.10, it is possible

72 Chapter 4. Design 4.3. The Controller

VM 2

2

1

VM 1

PM 1

PM 2

Figure 4.10: Exemplary limitation of Algorithm A to migrate VM 1 from PM 2 to PM 1 and then to provision VM 2 on PM 2. Such complex solutions are not found by Algorithm A.

4.3.2 Algorithm B

Algorithm B was designed to address the limitations of Algorithm A. It is based on the K-Best First Search (KBFS) Algorithm first introduced by Felner et al. [FKK03] in 2003. KBFS is a generalization of best-first search (BFS), and was proven to outperform BFS4. Some tabu search techniques, presented in Section 2.4, are also used. Simplified, KBFS basically performs a search in a tree. To be able to apply KBFS to our problem, an encoding of the VM-to-PM mapping is required. Furthermore, the parent- child relationship needs to be defined, as well a “goodness” criteria for each state.

4.3.2.1 The Encoding

Algorithm B encodes the mapping of VMs to physical machines in the cluster in form of states. In addition, the encoding includes allocated and available resources of each phys-

4Felner et el. showed that KBFS outperforms BFS by a factor of 15 on random trees with dead-ends

73 4.3. The Controller Chapter 4. Design ical machine. This basically reflects the global resource view computed by the Cluster Manager and so the formal model defines in Section 3.2.1. The initial state represents the root of the search tree. From the initial state, child states are generated. This process is repeated recursively. A child state B of a parent state A represents a VM-to-PM mapping which differentiates from the VM-to-PM mapping of state A in only one point, i.e. a single migration took place. An exemplary encoding is shown in Table 4.1.

VM 1 PM 1 State A VM 2 PM 1 VM 3 PM 2 VM 1 PM 1 State B VM 2 PM 2 VM 3 PM 2

Table 4.1: Exemplary VM-to-PM encoding of two parent-child states

4.3.2.2 The Goodness Criteria

To apply KBFS to our problem, one needs to be able to compare different states in order to determine which state is the better one. Each state has its global profit, representing a function of the sum of the local profits of each physical machine and the validity of the state. The local profit of a physical machine represents a function of its resource utilization. As such, the higher the resource utilization of a machine, the higher the local profit. An offline physical machine has the maximum local profit possible. This assures that the global profit of a state increases when some machines are kept at a high resource utilization while the others are kept offline. This is shown by the formula:

local_profit(p) = f(resource_utilization(p)) where:

local_profit : P → Q, resource_utilization : P → Q, and f : Q → Q.

74 Chapter 4. Design 4.3. The Controller

Global profit is calculated by multiplying the sum of the local profits with the validity of the state, which can be either 0 or 1. Thus, the global profit can either be the sum of the local profits if the state is valid, or 0 if the state is not valid. A valid state is a state where the total capacity of any physical machine in the cluster is bigger or equal to the sum of requested capacities of the VMs provisioned on that physical machine. In terms of Pn the formal model defined in Section 3.2.1, this means that ∀j ∈ P, tj ≥ i=1 ai, where a1, . . . an represent the virtual machines provisioned on the physical machine j. All other states are invalid. As a result, Algorithm B defines a valid global state as a state with a global profit higher than 0. Furthermore, valid states can also be ranked, depending on their global profit. This is shown by the formula:

P]P global_profit(state) = validity(state) · i=1 local_profit(i), where:

Z = the set of states, global_profit : Z → Q, and validity : Z → {0, 1}.

Besides the global profit of a state, Algorithm B also computes the total cost of moving from the initial state to any given state. The total cost is a function of the number of migrations necessary for the translation from the initial state to that given state. The total cost is then used in combination with the global profit to determine the utility of a translation between an initial state and a given state. This is the ultimate quality factor that is used to select the final state. Thus, using the formal model defined in Section 3.2, the utility of a state can be calculated with the formula:

utility(state) = d(global_profit(state), total_cost(state)) where:

utility : Z → Q, d : Q × Q → Q, total_cost(state) = e(migrations_necessary(initial_state, state)), total_cost : Z → Q, e : N → Q, and migrations_necessary : Z × Z → N.

75 4.3. The Controller Chapter 4. Design

Note that the functions d, e and f are not determined at this point. Instead, they can vary, depending on the implementation. Tests using different configurations of d, e and f can than determine which configuration fits best to certain scenarios.

4.3.2.3 Applying KBFS

Using this encoding and goodness criteria, Algorithm B performs its local search. For this search, two lists are used: the open list of states (nodes) that have been evaluated with respect to their utility but not expanded, and the closed list which contains the nodes that have already been expanded. At each iteration, the best K nodes from the open list are expanded, their children are evaluated and added to the open list. After a limited number of iterations, the algorithm terminates and the node with the highest utility is selected. This represents the final state, and the system moves to that state by means of successive migrations. What Algorithm B perfomes is a mixture between breath search and depth search. In Figure 4.11, an example of how Algorithm B performs is shown, while the pseudocode of Algorithm B is shown in Algorithm 2

Input: A VM that requires more resources, K, ITERATIONS Output: New mapping 1 if resources available locally then 2 assign local resources; 3 return new remapping; 4 end 5 else 6 determine all the children of the current state; 7 add current state to the closed list; 8 add the children of the current state to the open list; 9 while ITERATIONS 6= 0 do 10 select the K children with the highest utility from the open list; 11 foreach children do 12 move the current state to the closed list; 13 add the children of the current state to the open list; 14 end 15 decrement ITERATIONS by one; 16 end 17 select the state with the highest utility from the closed list; 18 return new remapping; 19 end Algorithm 2: Algorithm B

76 Chapter 4. Design 4.3. The Controller

Initial State

1

2.1 2.2 2.3

K=2 ITERATIONS=3 Sorting

2.2 2.1 2.3

3.1 3.2 3.3 3.4 3.5

Sorting

3.4 3.3 3.1 3.5 3.2

4.1 4.2 4.3

Sorting

4.1 4.3 4.2

Final State

Figure 4.11: Algorithm B visual example

Algorithm B is always capable to find a valid solution if valid solutions exist and the trans- lation costs to at least one of these solutions does not exceed a certain, predefined value. To obtain this, one needs to adapt the ITERATIONS parameter to the specific allocation problem, so that the algorithm terminates exactly when the maximum number of migra- tions permitted is exceeded. Note that, in finding a solution, the algorithm determines a local optimum which is not necessarily the global optimal solution.

When a final state is obtained, the solution is represented in form of translations (migra- tions) from the initial state, through intermediary states to the final state. While the final

77 4.3. The Controller Chapter 4. Design state is in most of the cases a valid state, with the exception of the case where the prob- lem has no acceptable solution, one or more of the intermediary states might not be valid states. To avoid this problem, Algorithm B uses a tabu search technique: it defines trans- lations from valid to invalid states as a tabu move. As such, no solutions which contain tabu moves are accepted, and the algorithm chooses the best state in the closed list which leads to a solution without tabu moves. Furthermore, to avoid cycles, the parent of a state cannot be the child of the same state. Therefore, such a migration is defined as a tabu move. Algorithm B provides significant advantages. Most important, unlike Algorithm A, if an acceptable solutions exist, Algorithm B will find it. Furthermore, Algorithm B improves the resource utilization of the cluster and reduces the consume of energy, by minimizing the number of physical machines necessary. However, just like Algorithm A, Algorithm B has one limitation: it provides no guaranteed performance ratio.

78 Chapter 5

Implementation

In Section 4.1, the architecture of the management framework was designed. The frame- work is used to manage a cluster of virtualized physical machines. The software config- uration of cluster is described in Section 5.1. Furthermore, the management framework comprises three types of autonomic managers: the Physical Manager, the VM Manager and the Cluster Manager. The overall implementation of each of these managers is pre- sented in Section 5.2, while the implementation of their core control components, i.e. the bottleneck determination mechanism and the cluster controller is described in Sections 5.3 and 5.4 respectively. Finally, Section 5.5 presents the effort of this implementation measured in lines of code. Note that although the framework and the controllers were designed using the mathemat- ical model from Section 3.2.1, where four types of resources are defined (memory, CPU, network bandwidth and I/O bandwidth), the implementation of the framework focuses on only one resource, namely memory. This is because the implementation presented here is meant to be a prototype, a proof-of-concept with respect to the design presented in the previous chapter. Note that in the following sections, the implementation is described in detail. The purpose of this high level of detail is to set up a strong building block for future work.

5.1 Software Configuration of the Cluster

The framework developed as part of this thesis is targeted at the autonomic management of virtulization-based data centers. As such, for the purpose of this implementation, a virtualization technology commonly found in data centers must be used to virtualize the

79 5.2. The Framework Chapter 5. Implementation physical machines in the cluster. As shown in Section 2.1, three products fit well in this category: Virtuozzo and its open source variant OpenVZ, Xen (including here all the virtualization products that are based on the Xen hypervisor) and VMWare ESX. Out of the three, the open source variant of Xen that comes with Fedora Core 7 was chosen for the implementation. The argument for using Xen was mostly of financial nature, since Xen is open source, unlike Virtuozzo and VMWare ESX. Furthermore, OpenVZ has the disadvantage of hosting only VMs that use the same (Linux) kernel as the host OS. This is not the case with Xen, as seen in Section 2.1. As such, the physical machines in the managed infrastructure were set up to run Fedora Core 7 with Xen 3.1. Fedora Core 7 was also used as guest OS running in the virtual machines. Inside each virtual machine, a Tomcat 5.5.25 server was installed, and the JMX monitoring of Tomcat was enabled. This permits monitoring the Tomcat parameters and also some of the OS parameters of the virtual machine hosting the Tomcat server. Furthermore, for the communication with the Xen hypervisor, the XenAPI daemon was activated on both physical machines.

5.2 The Framework

5.2.1 The Self-Manager Backbone

In Section 4.1, we have seen that the management framework is based on a Self-Manager, developed in the Distributed Systems Lab of Wiesbaden University of Applied Sciences. The Self-Manager is the backbone of our modular architecture, providing support for loading modules and for the communication between modules. Recall the architecture of the framework. To facilitate the reading of this thesis, this architecture is shown again in Figure 5.1. Each of the three autonomic managers, i.e. the VM Manager, the Physical Manager and the Cluster Manager represents an instance of the Self-Manager. However, the differences between them are determined by the modules loaded by the Self-Manager core. The overall architecture of the Self-Manager is shown in Figure 5.21. In order to load and initialize modules, the Self-Manager kernel must be started with an argument pointing at a configuration file, where the modules to be loaded and their ini- tialization parameters are specified. The configuration file uses the Java .properties

1again, this figure is repeated to facilitate reading this thesis

80 Chapter 5. Implementation 5.2. The Framework

Cluster Manager

Monitor

Cluster Logic requests/alarms requests/alarms

Actuator

VM Manager VM Manager VM Manager VM Manager

Actuator Actuator Actuator Actuator Physical Manager Physical Manager

VM VM Monitor Actuator Actuator Monitor VM VM Logic SLAs Logic SLAs Logic SLAs Logic SLAs

SL VM SL VM SL VM SL VM Monitor Monitor Monitor Monitor Monitor Monitor Monitor Monitor

VMM VMM VM VM VM VM 1 !""""""""""# 1 !""""""""""$ 1 !""""""""""% Cluster of physical machines - Service

Figure 5.1: The complete architecture of the framework

Figure 5.2: The architecture of the Self-Manager

81 5.2. The Framework Chapter 5. Implementation

format2. As an example the configuration file is shown in Listing 5.1.

1 #Module Manager settings 2 moduleManager.class = de.fh_wiesbaden.cs.vs.manager.ModuleManager 3 ModuleManager.loadModules = module_1 module_2 4 5 6 ######################################## 7 # The modules with their configuration # 8 ######################################## 9 10 ################################# 11 # Module module_1 configuration # 12 ################################# 13 module_1 = de.fh_wiesbaden.cs.vs.example.Module1 14 15 #reference to the module_2 module 16 module_1.module_2_reference = module_2 17 #the initial value 18 module_1.init_value = 233 19 20 ################################# 21 # Module module_2 configuration # 22 ################################# 23 module_2 = de.fh_wiesbaden.cs.vs.example.Module2 24 25 #the hostname to monitor 26 module_2.hostname = wwwvs

Listing 5.1: Configuration file example

In this example, two modules, module_1 and module_2 are loaded. This is done in line 3. For each module, the main class of the module must be specified. For module_1, this is done in line 13, while for module_2, this is done in line 23. Each of the two modules specifies a set of properties, which are loaded when the module initializes: lines 15 - 18 for module_1 and lines 25 - 26 for module_2. There are three abstract types of modules that can be loaded by the Self-Manager: Control modules, Action modules and Event modules. The modules are loaded based on the main class of the module specified in the configuration file. Each of the three types of modules extends the Module interface, shown in Listing 5.2.

1 public interface Module { 2 3 public abstract void initialize( 4 String moduleName, ConfigReader configReader, Log tracer, 5 Log logger, Messenger messenger);

2http://java.sun.com/j2se/1.4.2/docs/api/java/util/Properties.html

82 Chapter 5. Implementation 5.2. The Framework

6 7 public abstract void shutdown(); 8 9 public abstract void updateConfig(); 10 11 public abstract boolean ping(); 12 13 public abstract State getState(); 14 15 public abstract ContentElement invoke(BasicMessage requestMessage); 16 }

Listing 5.2: The Module interface

A module implementing one of the three types of modules must provide an implementa- tion for the initialize() (lines 3-5), shutdown() (line 7) and invoke() (line 15) methods. For the rest of the methods defined in the Module interface, a default im- plementation is provided by the class BasicModule, which can be used by extending this class in the module. The three types of autonomic managers in the framework represent a composition of concrete modules that implement one of the three types of modules (Control, Action, Event). In the following subsections, the implementation of each of these autonomic managers is described.

5.2.2 The Physical Manager

The purpose of the Physical Manager is to monitor and execute commands on a virtualized physical machine. Although all the physical machines in the cluster use Xen, the imple- mentation of the framework kept the generic character of the design, presented in Section 4.1.2. Thus, from the point of view of the implementation, the internal architecture of the physical manager comprises three modules, as shown in Figure 5.3.

The MONITOR and ACTUATOR modules are used to monitor respectively execute com- mands on the physical machine and have a generic character. This means that if instead of Xen say VMware ESX is used, the Monitor and Actuator modules require no modi- fication. Instead, it is the XEN module that, as the name suggests, is technology-bound. This module receives messages from the Actuator and Monitor modules and transforms these requests into Xen-specific commands. Thus, it is this Xen module that is used for the actual interaction with the VMM. To adapt the Physical Manager to use another vir- tualization technology, say VMware ESX, all one needs to do is to implement a VMware ESX module capable of handling the same messages as the Xen module and replace the

83 5.2. The Framework Chapter 5. Implementation

RMI RMI

Actuator Monitor

JMS JMS Xen

Xen API

Xen 3.1 (VMM)

Figure 5.3: The Implementation of the Physical Manager

Xen module with it. In the following paragraphs, the implementation of each of these modules is described in detail.

5.2.2.1 The Monitor Module

The Monitor module of the Physical Manager is an Event module that is used to monitor the resource utilization of the physical machine. The implementation of the Monitor module is used to monitor:

• The amount of free physical memory

• The amount of used physical memory

• The amount of physical memory allocated to each virtual machine hosted on the physical machine

The resources to be monitored are specified as a module property in the configuration file of the Physical Manager. Listing 5.3 shows an extract of the Physical Manager con- figuration file where the properties of the Monitor module are set. The hostname of the physical machine is specified in line 2. Line 7 specifies the main class of the module. A reference to the module used to communicate with the hypervisor is set in line 10, while in lines 12-13, the parameters subject to monitoring are specified. In this case, the number

84 Chapter 5. Implementation 5.2. The Framework

of hosted VMs and the amount of physical memory allocated to each VM, including the remaining free memory on the physical host, are to be monitored.

1 #the hostname of the physical machine 2 hostname = asterix 3 4 ################################### 5 # PM Monitor Module configuration # 6 ################################### 7 pm_monitor = de.fh_wiesbaden.cs.vs.manager.modules.monitor.pm_monitor.PMMonitor 8 9 #reference to the hypervisor module 10 pm_monitor.hypervisor_module = xen_module 11 12 #parameters to monitor 13 pm_monitor.monitored_parameter_1= number_hosted_vms 14 pm_monitor.monitored_parameter_2= memory_util_of_each_vm

Listing 5.3: The properties of the Physical Manager Monitor module

Note the generic character of loading the properties of the Monitor module. Assume one wants to monitor the CPU utilization of each VM, in addition to the number of VMs provisioned on the physical machine and the amount of memory allocated to each VM. In such a case, the configuration file would be extended with a line like:

pm_monitor.monitored_parameter_3 = cpu_util_of_each_vm

In the main class of the Monitor module, these properties are read and their value is used to initialize the attributes of the module. This is shown in Listing 5.4.

1 hypervisorModuleReference = moduleProperties.getProperty( 2 this.moduleName + ".hypervisor_module"); 3 4 //read the parameters to monitor 5 while (readingParameters){ 6 tmpParameterName = moduleProperties.getProperty( 7 this.moduleName + ".monitored_parameter_" + parameterCounter); 8 9 if (tmpParameterName != null){ 10 monitoredParameters.add(tmpParameterName); 11 updatedMonitoredParameters.put(tmpParameterName, "-1"); 12 parameterCounter++; 13 } else { 14 readingParameters = false; 15 } 16 }

Listing 5.4: Extract from the PMMonitor.java, showing how the module properties are loaded

85 5.2. The Framework Chapter 5. Implementation

Here, the hostname parameter is read in lines 1-2. Then, in line 5-16, the param- eters to be monitored are read, until no more parameters are found in the config file. Each parameter is added to the monitoredParameters Vector (line 10) and to the updatedMonitoredParameters Hashtable (line 11) with the initial value -1. This method of reading the parameters supports the generic character of the configuration file. After the parameters to be monitored are read, they are forwarded to the Xen module. This is presented in Listing 5.5 and represents a good example of how the intra-manager communication is implemented.

1 hypervisorModuleQueueSender = messenger.createJMSQueueSender( 2 queueSession.createQueue(hypervisorModuleReference)); 3 4 for (int i = 0; i < monitoredParameters.size(); i++) { 5 ComplexContentElement hypervisorModuleMessageElement = new ComplexContentElement( 6 Identifiers.COM_GET_VALUE); 7 8 hypervisorModuleMessageElement.addElement(new SimpleContentElement( 9 "id", monitoredParameters.get(i))); 10 11 hypervisorModuleComplexMessage = new ComplexMessage(); 12 hypervisorModuleComplexMessage.setcontent(hypervisorModuleMessageElement); 13 14 hypervisorModuleMessage = queueSession.createObjectMessage( 15 hypervisorModuleComplexMessage); 16 hypervisorModuleMessage.setJMSReplyTo(queueReceiver.getQueue()); 17 hypervisorModuleQueueSender.send(hypervisorModuleMessage); 18 }

Listing 5.5: Extract from the PMMonitor.java, showing how the to be monitored parameters are sent to the Xen module

Here, a QueueSender is first initialized (lines 1-2). Next, for each parameter to be mon- itored, a complex element is created. The complex element contains the command for the Xen module, in this case the “get the value of the following parameter” command (lines 5-6), and a parameter identifier (lines 8-9). This complex element is then packed in a complex message (lines 11-12) and sent to the Xen module (lines 14-17). As al- ready mentioned, the messages between the modules of an autonomic manager are sent via JMS. No further details are provided here with respect to the implementation of this JMS-based communication, since this is a feature that comes with the Self-Manager and was developed outside of this thesis. When the Monitor module receives a message from the Xen module, the invoke() method is called asynchronously. In this method, the processComplexElement() method is subsequently called. The content of this method is shown in Listing 5.6.

86 Chapter 5. Implementation 5.2. The Framework

1 private ContentElement processComplexElement( 2 ComplexContentElement receivedMessageBody, ComplexMessage receivedMessage){ 3 4 if (receivedMessageBody.getName().equals( 5 Identifiers.COM_GET_NUMBER_HOSTED_VMS)) { 6 7 String tmpValue = MessageUtility.getValueFromMessage( 8 Identifiers.NUMBER_HOSTED_VMS, receivedMessage); 9 10 if (!updatedMonitoredParameters.get(Identifiers.NUMBER_HOSTED_VMS).equals( 11 tmpValue)) { 12 13 updatedMonitoredParameters.put(Identifiers.NUMBER_HOSTED_VMS, tmpValue); 14 sendUpdateToClusterManager(); 15 } 16 17 } else if (receivedMessageBody.getName().equals(Identifiers.MEMORY_UTIL_OF_EACH_VM)) { 18 processMemoryUtilOfEachVM(receivedMessageBody, receivedMessage); 19 } 20 21 return null; 22 }

Listing 5.6: Extract from the PMMonitor.java, showing how the messages received from the Xen module are processed

This method has been implemented to deal with the number_hosted_vms (line 8) and memory_util_of_each_vm (line 17) parameters, but can be easily extended to deal with other monitoring parameters. For the number_hosted_vms parameter, the received data is parsed (lines 7-8), and if the value of this parameter has been modi- fied (lines 10-11), then the value of the parameter is updated (lines 13), an update is sent to the Cluster Manager (line 14). For the memory_util_of_each_vm parame- ter, the received data is parsed, updated and sent to the Cluster Manager in the method processMemoryUtilOfEachVM() (line 18).

Updates to the Cluster Manager are sent via Java RMI. This is part of the inter-manager communication and is therefore described in Section 5.2.5.

5.2.2.2 The Actuator Module

The Actuator module of the Physical Manager is an Action module that is used to execute tasks specified by the Cluster Manager on the physical machine monitored by the Physical Manager. How these tasks are received is part of the inter-manager communication and thus described in Section 5.2.5.

87 5.2. The Framework Chapter 5. Implementation

1 #the hostname of the physical machine 2 hostname = asterix 3 4 #################################### 5 # PM Actuator Module configuration # 6 #################################### 7 pm_actuator = de.fh_wiesbaden.cs.vs.manager.modules.monitor.pm_actuator.PMActuator 8 9 #reference to the hypervisor module 10 pm_actuator.hypervisor_module = xen_module

Listing 5.7: The properties of the Physical Manager Actuator module

The Actuator module is loaded by the Self-Manager core using the configuration file just like any other module. The parameters loaded for the Actuator module are shown in Listing 5.7. Line 7 specifies the main class of the module. Since the Actuator module receives its tasks from the Cluster Manager, the only parameters from the configuration file used to initialize the attributes of the module are the hostname parameter (line 2), which specifies the hostname or IP address of the managed physical machine, and the hypervisor_module parameter (line 10), which points to the module used to communicate with the hypervisor. The way these parameters are read from the configuration file and the initialization of the attributes of the module is performed similar to the one previously described for the Monitor module, and therefore not further described.

1 public void setMemoryVM(String vmName, String bytes) throws JMSException { 2 3 ComplexContentElement hypervisorModuleMessageElement = new ComplexContentElement( 4 Identifiers.COM_SET_VM_TOTAL_MEMORY); 5 hypervisorModuleMessageElement.addElement( 6 new SimpleContentElement(Identifiers.VM_NAME, vmName)); 7 hypervisorModuleMessageElement.addElement( 8 new SimpleContentElement(Identifiers.VM_TOTAL_MEMORY, bytes)); 9 10 ComplexMessage cm = new ComplexMessage(); 11 cm.setcontent(hypervisorModuleMessageElement); 12 13 ObjectMessage om = queueSession.createObjectMessage(cm); 14 om.setJMSReplyTo(queueReceiver.getQueue()); 15 16 hypervisorModuleQueueSender.send(om); 17 }

Listing 5.8: The setMemoryVM() from the PMActuator.java

88 Chapter 5. Implementation 5.2. The Framework

The implementation of the Actuator modules can handle two types of tasks: (1) set the memory allocation of a virtual machine and (2) migrate a VM from one physical machine to another. This is done by the methods setMemoryVM() and migrateVM(), of which, because they are similar, only the first is shown in Listing 5.8. The setMemoryVM() method is used to set the memory allocation of a specific VM. This is done by sending a message to the Xen module (lines 10-16), in which a complex element is embedded (lines 3-4) which contains the command for the Xen module, in this case “set the memory allocation of the following VM to the following value”. As such, in the complex element, a reference to the virtual machine (lines 5-6) and the total amount of memory (in bytes) allocated to the VM (lines 7-8) are embedded.

5.2.2.3 The Xen Module

The Xen module of the Physical Manager is a proxy module that implements both the functionalities of an Action module and an Event module. It is responsible for the actual interaction with the Xen hypervisor. Through the Xen module, the memory allocation of each machine can be monitored and dynamically changed. For the dynamic memory allocation, the Xen hypervisor uses a balloon driver. The Xen module can also dictate a Xen hypervisor to migrate one of the VMs running on top of it to another physical machine in the cluster3. Both the balloon driver and the migration of virtual machines are techniques that were described in Section 2.1.

1 ############################### 2 # PM Xen Module configuration # 3 ############################### 4 xen_module = de.fh_wiesbaden.cs.vs.manager.modules.hypervisor.xen.XenModule 5 6 #URL, username and password to connect to the hypervisor 7 xen_module.hypervisor_URL = http://172.25.48.226:9363 8 xen_module.username = secret_username 9 xen_module.password = secret_pasword

Listing 5.9: The properties of the Physical Manager Xen module

Just like the other modules, the Xen module loads its parameters from the configuration file. The parameters of the Xen module are shown in Listing 5.9. In line 4, the main class of the module is specified. The Xen hypervisor is remotely managed via the XenAPI (see Section 2.1). The Xen hypervisor can be accessed using the Xen API under a URL

3under the constraint that this physical machine also uses Xen and the CPUs of the two machines are from the same family

89 5.2. The Framework Chapter 5. Implementation

(line 7), which specifies the IP address of the physical machine on which the Xen hy- pervisor runs and the port number at which XenAPI requests can be sent via XML RPC. Additionally, the Xen hypervisor can be configured to accept commands over the XenAPI only when a username (line 8) and password (line 9) are provided. These parameters are loaded by the Xen module and then used to initialize the attributes of the module, just like in the previously described modules. The XenAPI comes with bindings for C, Python and C#. This is a problem, since the man- agement framework is written in Java. To overcome this problem, the approach shown in Figure 5.4 was used. The XenAPI Python bindings are used to implement a XenManager Python class, which provides various methods for monitoring and executing commands on a Xen physical machine. This class is wrapped in the XenHelper Java class via Jython.

XenHelper.java

(Jython Wrapper)

xenmanager.py

Python

XenAPI Python Bindings

XML RPC

Xen 3.1 (VMM)

Figure 5.4: Using the Xen API to access the Xen hypervisor

An extract from the XenManager Python class is shown in Listing 5.10. In lines 6-10 the constructor of the class is defined. The three parameters of the constructor are the url of the hypervisor, the username and the password. The default value of the url param- eter is set to http://asterix:9363. These parameters are used in the constructor to connect to the Xen hypervisor (lines 9-10), via XenAPI (also see line 1). The only method shown in this extract is the set_vm_total_memory(self, vm_name,

90 Chapter 5. Implementation 5.2. The Framework

mem_in_bytes) method (lines 12-17). This method is used to dynamically change the memory allocation of a certain VM, and receives as parameters the name of the VM and the amount of memory to allocate to the VM, in bytes (line 12). In line 17, the XenAPI is used to perform the dynamic memory allocation.

1 import XenAPI as xenapi 2 3 class XenManager: 4 "This class is used to manage a Xen Hypervisor" 5 6 def __init__(self, url=’http://asterix:9363’, username=’’, password=’’): 7 "Constructor, connect to the hypervisor" 8 9 self.session=xenapi.Session(url) 10 self.session.xenapi.login_with_password(username, password) 11 12 def set_vm_total_memory(self, vm_name, mem_in_bytes): 13 "Dynamically allocate/deallocate memory to a VM" 14 15 vm=self.get_vm_by_name(vm_name) 16 if vm == None: return None 17 self.session.xenapi.VM.set_memory_dynamic_max_live(vm, mem_in_bytes)

Listing 5.10: Extract from the XenManager Python class

To use the Python XenAPI bindings in the Physical Manager, which is written in Java, the XenHelper Java class was implemented. This class uses Jython to wrap the Xen- Manager Python class. An extract from the XenHelper Java class is shown in List- ing 5.11. In line 1, the PythonInterpreter class provided by Jython is im- ported. A PythonInterpreter object is used to call Python code out of a Java pro- gram. This is used in the class constructor, in lines 10-19, to initialize a XenManager Python object and thus connect to the Xen hypervisor. The setVMTotalMemory() method is also presented in this code extract. This Java method simply wraps the set_vm_total_memory() Python method (line 28) and is used to dynamically change the memory allocation of a certain VM.

1 import org.python.util.PythonInterpreter; 2 3 public class XenHelper { 4 private PythonInterpreter interp; 5 6 public XenHelper( 7 String hypervisorURL, String username, String password) 8 throws HypervisorConnectionException { 9 10 interp=new PythonInterpreter();

91 5.2. The Framework Chapter 5. Implementation

11 12 interp.exec("import sys"); 13 interp.exec("import xenmanager"); 14 15 interp.set("url", hypervisorURL); 16 interp.set("username", username); 17 interp.set("password", password); 18 19 interp.exec("manager=xenmanager.XenManager(url, username, password)"); 20 } 21 22 public void setVMTotalMemory(String vmName, String memInBytes) 23 throws VMDoesNotExistException { 24 25 interp.set("vm_name", vmName); 26 interp.set("mem_in_bytes", memInBytes); 27 try { 28 interp.exec("manager.set_vm_total_memory(vm_name, mem_in_bytes)"); 29 } catch(NullPointerException e){ 30 throw new VMDoesNotExistException( 31 "VM named "+vmName+" does not exist on this host"); 32 } 33 } 34 ... 35 }

Listing 5.11: Extract from the XenHelper Java class

The Xen module receives requests in form of messages from both the Monitor and the Actuator modules of the Physical Manager. When such a message is received, the processComplexElement() is called asynchronously. This is shown in Listing 5.12. In line 5, the identifier of the received command is read. In lines 7-13, this identifier is matched against a set of known command identifiers.

1 private ContentElement processComplexElement( 2 ComplexContentElement receivedMessageBody, ComplexMessage receivedMessage){ 3 4 ContentElement returnValue = null; 5 String command = receivedMessageBody.getName(); 6 7 if (command.equals(Identifiers.COM_GET_VALUE)) { 8 returnValue = getParameterValue(receivedMessage); 9 } else if (command.equals(Identifiers.COM_SET_VM_TOTAL_MEMORY)) { 10 setMemoryVM(receivedMessage); 11 } else if (command.equals(Identifiers.COM_LIVE_MIGRATE)) { 12 liveMigrate(receivedMessage); 13 } 14 15 return returnValue; 16 } 17

92 Chapter 5. Implementation 5.2. The Framework

18 private void setMemoryVM(ComplexMessage receivedMessage){ 19 20 String vmName = MessageUtility.getValueFromMessage( 21 Identifiers.VM_NAME, receivedMessage); 22 String memInBytes = MessageUtility.getValueFromMessage( 23 Identifiers.VM_TOTAL_MEMORY, receivedMessage); 24 25 try { 26 xenHelper.setVMTotalMemory(vmName, memInBytes); 27 } catch (VMDoesNotExistException e){ 28 logger.error(this.moduleName, e.getLocalizedMessage()); 29 } 30 }

Listing 5.12: Extract from XenModule.java, showing how the received messages are processed

In this module, three types of commands are treated: a command from the Monitor mod- ule asking for one of the two parameters subject to monitoring (lines 7-8) and two com- mands from the Actuator module, the first asking to dynamically allocate memory to a certain VM (lines 9-10) and the second asking to migrate a certain VM (lines 11-12). Upon a successful match, the corresponding method is called.

In lines 18-30, the content of the setMemoryVM() method is shown. This method is called when a message commanding the Xen module to perform a dynamic memory allo- cation is received. The message is parsed (lines 20-23) and the setVMTotalMemory() method is called on a XenHelper object which was previously initialized in the con- structor of the Xen module. A similar process is performed for getParameterValue() (line 8) and liveMigrate() (line 12), with the difference that for the first one a re- sponse is also sent to the requesting module, in this case the Monitor module.

5.2.3 The VM Manager

The VM Manager is responsible for the management of both a virtual machine and the service running in this virtual machine. From the point of view of the implementation, the VM Manager comprises five modules, as shown in Figure 5.5.

The SL MONITOR and VM MONITOR modules are used to monitor the service and the virtual machine respectively. These modules are independent of the monitoring technol- ogy and thus have a generic character. For this implementation, JMX was used for moni- toring. As such, a JMX module was used for the monitoring, in a similar way as the XEN module was used as a “proxy” module in the Physical Manager. The VM LOGIC module

93 5.2. The Framework Chapter 5. Implementation

RMI Actuator

JMS

VM Logic SLAs JMS JMS SL VM Monitor Monitor JMS JMS

JMX

JMX JMX

Tomcat TPC-W

Fedora Core 7 (VM)

Figure 5.5: The Implementation of the VM Manager is the central component of the VM Manager and implements the bottleneck determi- nation mechanism. When a bottleneck was determined, the ACTUATOR module is used to request an additional amount of the bottlenecked resource from the Cluster Manager. In the following paragraphs, the implementation of each of these modules is described, with the exception of the JMX module which was implemented outside of this thesis, and the VM Logic module, which is an “intelligent" module and so described separately in Section 5.3.

5.2.3.1 The SL Monitor Module

The SL Monitor module of the VM Manager realizes service level monitoring and repre- sents an Event module that is used to monitor the service running inside a virtual machine. Like all other modules, the SL Monitor loads its parameters from a configuration file. The part of the VM Manager configuration file dedicated to the SL Monitor is shown in List- ing 5.13. The main class of the module is specified in line 4. The other two parameters read are a reference to the technology-specific monitoring module (line 7), in this case the JMX module, and a reference to the controller module (line 10), in this case the VM Logic.

94 Chapter 5. Implementation 5.2. The Framework

1 ###################################### 2 # VM SL Monitor Module configuration # 3 ###################################### 4 service_monitor = de.fh_wiesbaden.cs.vs.manager.modules.monitor.service_monitor.VMServiceMonitor 5 6 #Reference to the jmx module used to gather the required service parameters 7 service_monitor.monitoring_module_reference = jmx_module 8 9 #Reference to the controller module where the service parameters are to be forwarded 10 service_monitor.controller_module_reference = controller_module

Listing 5.13: The properties of the SL Monitor module

These parameters are initialized when the module is loaded by the Self-Management core. Unlike the Monitor module of the Physical Manager, the SL Monitor receives the service parameters it needs to monitor from the central component, the VM Logic. A message containing the parameters that need to be monitored is received as soon as the module finishes is initialization process. When such a message is received, the method subscribeServiceParameters() is called asynchronously. The content of this method is shown in Listing 5.14. In Line 4, a QueueSender object is created, which is later used to send a message containing the monitoring parameters to the JMX module (lines 22-26). In lines 13-15, the message received from the VM Logic is parsed, and for each monitoring parameter received, a complex element is created (line 11) to wrap the respective monitoring parameter (lines 17-29). Each complex element is packed in a separate message (lines 22-23) and sent to the JMX module. By sending these messages, the SL Monitor modules subscribes to the requested parameters, with updates being sent every 20 seconds4.

1 private void subscribeServiceParameters(ContentElement receivedMessageBody) 2 throws JMSException { 3 4 jmxModuleQueueSender = messenger.createJMSQueueSender( 5 queueSession.createQueue(jmxModuleReference)); 6 ComplexContentElement complexBodyElement = (ComplexContentElement) receivedMessageBody; 7 ComplexContentElement jmxModuleMessageElement; 8 String[] tmpRequiredParams; 9 10 for (int i = 0; i < complexBodyElement.getSize(); i++) { 11 jmxModuleMessageElement = new ComplexContentElement("Subscribe_value"); 12 13 tmpRequiredParams = complexBodyElement.getElement(i).getName().split("\\|");

4of course, this 20 seconds threshold can be modified in the configuration file of the JMX module

95 5.2. The Framework Chapter 5. Implementation

14 requiredServiceParameters.add(tmpRequiredParams[0]); 15 requiredServiceParameters.add(tmpRequiredParams[1]); 16 17 jmxModuleMessageElement.addElement( 18 new SimpleContentElement("command",JmxModule.COM_SUBSCRIBE_VALUE)); 19 jmxModuleMessageElement.addElement( 20 new SimpleContentElement("id",complexBodyElement.getElement(i).getName())); 21 22 jmxModuleCm = new ComplexMessage(); 23 jmxModuleCm.setcontent(jmxModuleMessageElement); 24 jmxModuleMessage = queueSession.createObjectMessage(jmxModuleCm); 25 jmxModuleMessage.setJMSReplyTo(queueReceiver.getQueue()); 26 jmxModuleQueueSender.send(jmxModuleMessage); 27 } 28 }

Listing 5.14: The subscribeServiceParameters() method of the SL Monitor

When a message from the JMX module is received, the SL Monitor module simply for- wards this message to the VM Logic.

5.2.3.2 The VM Monitor Module

The VM Monitor module of the VM Manager is an Event module which is responsible for monitoring the resource utilization of a virtual machine. The parameters indicating what exactly the VM Monitor is monitoring are received from the VM Logic. For this implementation, the virtual machine parameters monitored are free memory, total swap space and free swap space. Since the implementation of the VM Monitor is very similar to the one of the SL Monitor, no further details are presented.

5.2.3.3 The Actuator Module

The Actuator module of the VM Manager is an Action module responsible for request- ing additional resources from the Cluster Manager. When the VM Logic determines a resource that is causing a bottleneck, a message is sent to the Actuator module. When the Actuator module receives the message, the invoke() method is called asynchronously. The content of this method is shown in Listing 5.15. In this method, the nature of the message is first determined (line 5) and then the corresponding method is called, in this case the only available option being the callClusterManager() method (line 8). It is possible that the VM Logic sends a second message, signaling that a bottleneck has occurred, before the Cluster Manager is capable of allocating the required resource (e.g. when a migration is necessary). To avoid sending redundant requests to the Cluster

96 Chapter 5. Implementation 5.2. The Framework

Manager, the problemSolved boolean variable, originally set to false, is now set to true. The Actuator module sends no requests to the Cluster Manager until the value of this variable is not set to false again.

1 public ContentElement invoke(BasicMessage requestMessage){ 2 3 ContentElement requestMessageBody = requestMessage.getBody(); 4 5 if (requestMessageBody.getName().equals(Identifiers.BOTTLENECK_IDENTIFIER)) { 6 if(!problemSubmitted){ 7 problemSubmitted=true; 8 callClusterManager((ComplexMessage) requestMessage); 9 } else { 10 //Do nothing, give the cluster manager some time to solve the problem 11 } 12 } 13 return null; 14 } 15 16 private synchronized void callClusterManager(ComplexMessage requestMessage){ 17 String problemCause = MessageUtility.getValueFromMessage( 18 PROBLEM_IDENTIFIER, requestMessage); 19 20 if(problemCause.equals(Identifiers.MEMORY_IDENTIFIER)) { 21 try { 22 vmProblemSolver.giveMemory(vmName); 23 //the method returns only after the dynamic memory allocation was performed 24 25 problemSubmitted=false; 26 } catch (RemoteException e){ 27 logger.error(this.moduleName + e.getLocalizedMessage()); 28 } 29 } 30 }

Listing 5.15: Extract from VMActuatorModule.java, showing how a received message is processed

Inside the callClusterManager() method, the resource causing the bottleneck is first read from the message received from the VM Logic (lines 17-18), and then the cor- responding method is called. As previously mentioned at the beginning, the only type of resource treated by the implementation of the framework is memory. Thus, in lines 20-29, a situation where memory is the resource causing the bottleneck is treated. In line 22, a method is called to send a synchronous request to the Cluster Manager. This mechanism is part of the inter-manager communication and thus not described here, but in Section 5.2.5. When a confirmation is received from the Cluster Manager, the value of the problemSolved variable is set to false and thus the Actuator stops dumping the

97 5.2. The Framework Chapter 5. Implementation messages received from the VM Logic module.

5.2.4 The Cluster Manager

The Cluster Manager is responsible for the self-optimization of the cluster as a whole. From the point of view of the implementation, the Cluster Manager comprises three mod- ules, shown in Figure 5.6.

RMI

Monitor

JMS

Cluster Logic

JMS

Actuator

RMI

Figure 5.6: The Implementation of the Cluster Manager

The MONITOR module is used to receive updates from the Physical Managers and re- quests from the VM Managers. These requests are then forwarded to the CLUSTER LOGIC module, which determines the allocation of resources in the cluster. Upon making a decision, the Cluster Logic module forwards this decision to the ACTUATOR module. The Actuator module then transforms this decision in forms of tasks which are delegated to the corresponding Physical Managers. In the following paragraphs, the Monitor and the Actuator modules are described. The Cluster Logic module represents the most important controller of the framework and is thus described separately in Section 5.4.

98 Chapter 5. Implementation 5.2. The Framework

5.2.4.1 The Monitor Module

The Monitor module of the Cluster Manager represents an Event module and is used to receive requests from the VM Managers and updates from the Physical Managers. The way these requests and updates are received is part of the inter-manager communication and thus described in Section 5.2.5. When a request is received from a VM Manager, the request is wrapped in form of a complex element which is then given as a parameter to the sendMessageToController() method, shown in Listing 5.16. Here, the complex element is wrapped in a message (lines 3-4) which is then sent to the Cluster Logic (lines 6-8). An update from a Physical Manager is forwarded to the Cluster Logic in a similar way.

1 public static void sendMessageToController(ComplexContentElement controllerModuleMessageElement) throws JMSException{ 2 3 ComplexMessage cm = new ComplexMessage(); 4 cm.setcontent(controllerModuleMessageElement); 5 6 ObjectMessage om = queueSession.createObjectMessage(cm); 7 om.setJMSReplyTo(queueReceiver.getQueue()); 8 controllerModuleQueueSender.send(om); 9 }

Listing 5.16: Extract from ClusterMonitor.java, showing how a message is sent to the Cluster Logic

5.2.4.2 The Actuator Module

The Actuator module of the Cluster Manager is an Action module which is used to del- egate tasks coming from the Cluster Logic to the Physical Manager. This delegation of tasks is part of the inter-manager communication and thus described in Section 5.2.5.

5.2.5 The Inter-Manager Communication

The autonomic managers can be located on different physical machines in the cluster. As such, the inter-manager communication must take place over the network. Java RMI pro- vides support for creating Java-based distributed applications by allowing objects running in one Java Virtual Machine to invoke methods on objects running in another Java Virtual Machine [Jav]. Since our framework is written in Java, using the high-level approach provided by Java RMI for the inter-manager communication is a natural choice. The

99 5.2. The Framework Chapter 5. Implementation following paragraphs describe how Java RMI was used to implement the inter-manager communication.

5.2.5.1 The VM Manager-to-Cluster Manager Communication

When a VM Manager determined a resource causing a bottleneck, the VM Manager sends a request to the Cluster Manager, asking for more of that resource. This communication is referred to as the VM Manager-to-Cluster Manager communication. For this communi- cation, the Cluster Manager plays the role of the server while the VM Manager represents the client.

When the Cluster Manager is started, several remote objects are created. One of them is used for the VM Manager-to-Cluster Manager communication and is an instance of the VMProblemSolverImpl class, which extends the UnicastRemoteObject class and implements the VMProblemSolver interface. The corresponding UML class dia- gram is shown in Figure 5.7. Because the VMProblemSolver interface implements the Remote interface, it is here where the methods are defined that can be remotely called on this remote object.

«interface» java.rmi.Remote

«interface» UnicastRemoteObject VMProblemSolver +giveMemory() : int

VMProblemSolverImpl

+giveMemory() : int

Figure 5.7: VMProblemSolver UML class diagram

The sequence diagram in Figure 5.8 shows how the VM Manager-to-Cluster Manager

100 Chapter 5. Implementation 5.2. The Framework communication takes place. First, the Monitor module of the Cluster Manager cre- ates and starts a server thread. When this thread is created, an object instance of the VMProblemSolverImpl class is created and registered in the RMI Registry. There is only one RMI Registry in the cluster, and this is the central place where all the remote objects are registered. When the Actuator module of the VM Manager wants to request more memory, it does this by first getting a remote reference of the remote object in the RMI Registry.

ClusterMonitor VMRequestProcessorServerThread VMActuatorModule RMI Registry VMProblemSolverImpl

start

instantiate

bind

get remote reference

remote reference

giveMemory()

Forward to Cluster Logic

memory allocated memory allocated

Figure 5.8: The VM Manager-to-Cluster Manager communication

The VMProblemSolverImpl class is not contained in the VM Manager, but dy- namically downloaded at run time. After the class definition was downloaded, the giveMemory() method is called on the remote object (see Listing 5.15, line 22). The implementation of the giveMemory() method is shown in Listing 5.17. The method receives as single parameter an identifier for the virtual machine that requires additional memory (line 1). Next, this request is wrapped in a complex element (lines 3-6) and then forwarded to the Cluster Logic module (line 9). In case an error occurs, the method returns -1, otherwise returning 0. This return value is received by the VM Actuator module, which can then determine if the request was successful or not.

101 5.2. The Framework Chapter 5. Implementation

1 public int giveMemory(String vmIdentifier) throws RemoteException { 2 3 ComplexContentElement controllerModuleMessageElement = new ComplexContentElement( 4 Identifiers.COM_GET_MORE_MEMORY); 5 controllerModuleMessageElement.addElement( 6 new SimpleContentElement(Identifiers.VM_NAME, vmIdentifier)); 7 8 try { 9 ClusterMonitor.sendMessageToController(controllerModuleMessageElement); 10 } catch (JMSException e){ 11 logger.error(this.moduleName, e.getLocalizedMessage()); 12 return -1; 13 } 14 15 return 0; 16 }

Listing 5.17: The giveMemory() method

5.2.5.2 The Physical Manager-to-Cluster Manager Communication

One of the responsibilities of the Physical Manager is to keep the Cluster Manager up to date with the resource utilization of the physical machine assigned to it. For this im- plementation, the Physical Manager sends the Cluster Manager data showing the amount of memory assigned to each virtual machine, and the amount of free memory remaining. This communication is called the Physical Manager-to-Cluster Manager communication. One of the remote objects created when the Cluster Manager starts is an instance of the ClusterMonitorUpdaterImpl class. This remote object is then used by the Physical Manager for the Physical Manager-to-Cluster Manager communication. The ClusterMonitorUpdaterImpl class extends the UnicastRemoteObject class and implements the ClusterMonitorUpdater interface. The corresponding UML diagram is shown in Figure 5.9. The ClusterMonitorUpdater interface extends the Remote interface and thus specifies the methods that can be remotely called on the remote object. The sequence diagram in Figure 5.10 shows how the Physical Manager-to-Cluster Man- ager communication takes place. The first part is similar to the VM Manager-to-Cluster Manager communication: a remote object is instantiated and registered in the RMI Reg- istry. This is also shown in Listing 5.18. The object is first created (line 6) and then registered in the RMI Registry (line 11).

102 Chapter 5. Implementation 5.2. The Framework

«interface» java.rmi.Remote

«interface» UnicastRemoteObject ClusterMonitorUpdater +updateResourceView() : string

ClusterMonitorUpdaterImpl

+updateResourceView() : string

Figure 5.9: ClusterMonitorUpdater UML class diagram

1 public ClusterMonitorUpdaterServerThread() throws RemoteException{ 2 if(System.getSecurityManager()==null){ 3 System.setSecurityManager(new SecurityManager()); 4 } 5 6 clusterMonitorUpdater= new ClusterMonitorUpdaterImpl(); 7 } 8 9 public void run(){ 10 try { 11 Naming.rebind("rmi://localhost:1099/ClusterMonitorUpdater",clusterMonitorUpdater); 12 } catch (Exception e){ 13 //handle expection 14 } 15 }

Listing 5.18: Instantiating and registering a remote object

Listing 5.19 shows how the Monitor module of the Physical Manager gets a remote ref- erence to the remote object (lines 7-8) and then calls the updateResourceView() method on this object (line 16). This method receives as parameters the hostname of the physical machine and the local resource view, encoded in a hashtable-based data structure. This local resource view is then used in the Cluster Logic to update the global resource view.

103 5.2. The Framework Chapter 5. Implementation

ClusterActuator ClusterMonitorUpdaterServerThread PMMonitor RMI Registry ClusterMonitorUpdaterImpl

start

instantiate

bind

get remote reference

remote reference

updateResourceView()

updateHostResourceView()

Figure 5.10: The Physical Manager-to-Cluster Manager communication

1 private void sendUpdateToClusterManager() { 2 try { 3 localRersourceView.put(Identifiers.RESOURCE_VIEW_MEMORY, localMemoryView); 4 5 if (cmu == null){ 6 try { 7 cmu = (ClusterMonitorUpdater) Naming.lookup( 8 "rmi://localhost:1099/ClusterMonitorUpdater"); 9 } catch (Exception e1){ 10 this.logger.error( 11 this.moduleName, e1.getLocalizedMessage() 12 + " please start the cluster manager"); 13 } 14 } 15 16 cmu.updateResourceView(hostname, localRersourceView); 17 } catch (Exception e){ 18 e.printStackTrace(); 19 } 20 }

Listing 5.19: Updating the gloabl resource view

5.2.5.3 The Cluster Manager-to-Physical Manager Communication

After the Cluster Logic has taken a resource reallocation decision, this decision must be transformed in form of tasks which are then delegated to the corresponding Physical Managers. This delegation of tasks is called the Cluster Manager-to-Physical Manager communication. When a Physical Manager starts, it creates and registers a remote object to the RMI

104 Chapter 5. Implementation 5.2. The Framework

Registry. This object is an instance of the ClusterRequestProcessorImpl class and is used by the Cluster Manager to delegate tasks to the Physical Manager. The ClusterRequestProcessorImpl class extends the UnicastRemoteObject class and implements the ClusterRequestProcessor interface. The corresponding UML diagram is shown in Figure 5.11. The ClusterRequestProcessor interface extends the Remote interface and thus specifies which methods can be remotely called on the remote object.

«interface» java.rmi.Remote

«interface» UnicastRemoteObject ClusterRequestProcessor +setMemoryVM() : string +migrateVM() : string

ClusterRequestProcessorImpl

+setMemoryVM() : string +migrateVM() : string

Figure 5.11: ClusterRequestProcessor UML class diagram

The Cluster Manager-to-Physical Manager communication is shown in Figure 5.12. In this example, the method setMemoryVM() is called on the remote object. The cor- responding source code is shown in Listing 5.20. In this code extract from the Actua- tor module of the Cluster Manager, the name of the virtual machine (lines 3-4) subject to memory reallocation, the name of the physical machine hosting this virtual machine (lines 5-6) and the new amount of memory allocated to this machine (7-8) are read from the message received from the Cluster Logic. The name of the physical machine is then used to determine the remote object instantiated by the Physical Manager that manages this machine (lines 11-12). In line 13, the setMemoryVM() method is called on the remote object, and receives as parameters the name of the virtual machine and the new amount of memory allocated to this virtual machine. For the migrateVM() method, this is done similarly.

105 5.3. The Bottleneck Determination Mechanism Chapter 5. Implementation

PMActuator ClusterRequestProcessorServerThread ClusterActuator RMI Registry ClusterRequestProcessorImpl

start

instantiate

bind

get remote reference

remote reference

setMemoryVM()

setMemoryVM()

Figure 5.12: The Cluster Manager-to-Physical Manager communication

1 private void setVmMemory(ComplexMessage receivedMessage){ 2 3 String vmName = MessageUtility.getValueFromMessage( 4 Identifiers.VM_NAME, receivedMessage); 5 String physicalHostname = MessageUtility.getValueFromMessage( 6 Identifiers.LOCAL_PHYSICAL_MACHINE_NAME, receivedMessage); 7 String vmMemorySize = MessageUtility.getValueFromMessage( 8 Identifiers.VM_TOTAL_MEMORY, receivedMessage); 9 10 try { 11 ClusterRequestProcessor crp = (ClusterRequestProcessor) Naming.lookup( 12 "rmi://localhost:1099/ClusterRequestProcessor" + physicalHostname); 13 crp.setMemoryVM(vmName, vmMemorySize); 14 } catch (Exception e){ 15 logger.error(this.moduleName, e.getLocalizedMessage()); 16 } 17 18 }

Listing 5.20: Delegating a task to a Physical Managers

5.3 The Bottleneck Determination Mechanism

The bottleneck determination mechanism is implemented in the VM Logic module of the VM Manager. In Section 4.2, it was argued that a rule engine that supports JSR 94 is convenient for the bottleneck determination mechanism. For the implementation, the Jess rule engine was used [San]. Although a commercial product, Jess is available free of charge when used for academic purposes.

106 Chapter 5. Implementation 5.3. The Bottleneck Determination Mechanism

The VM Logic determines the parameters that are to be monitored by the SL Monitor and the VM Monitor modules of the VM Manager. These parameters are loaded from the configuration file while the VM Logic module is being initialized. This is shown in Listing 5.21. The parameters for the service are specified in lines 19-20, while the parameters of the operating system to be monitored are specified in lines 23-28. These parameters are loaded and forwarded to the SL Monitor and VM Monitor respectively.

1 ###################################### 2 # VM Controller Module configuration # 3 ###################################### 4 5 controller_module = de.fh_wiesbaden.cs.vs.manager.modules.controller.VMControllerModule 6 7 #Reference to the rule base 8 controller_module.rule = conf/rulebase/bottleneckdetermination.cls 9 10 #The rule service provider 11 controller_module.rule_service_provider = jess.jsr94 12 13 #Reference to the actuator module 14 controller_module.actuator_module_reference = actuator_module 15 controller_module.service_monior_module_reference = service_monitor 16 controller_module.vm_os_monior_module_reference = vm_os_monitor 17 18 #Tomcat mbeans and attributes 19 controller_module.tomcat_mbean_name_1=Catalina:type=GlobalRequestProcessor,name=http-8080 20 controller_module.tomcat_mbean_attribute_1=maxTime 21 22 #VM OS mbeans and attributes 23 controller_module.vm_os_mbean_name_1=java.lang:type=OperatingSystem 24 controller_module.vm_os_mbean_attribute_1=FreePhysicalMemorySize 25 controller_module.vm_os_mbean_name_2=java.lang:type=OperatingSystem 26 controller_module.vm_os_mbean_attribute_2=FreeSwapSpaceSize 27 controller_module.vm_os_mbean_name_3=java.lang:type=OperatingSystem 28 controller_module.vm_os_mbean_attribute_3=TotalSwapSpaceSize

Listing 5.21: Parameters loaded by the VM Logic module

In line 8 of Listing 5.21, the file containing the rule set for the bottleneck determination is specified, while in line 11, the rule engine — Jess — is specified. Two Java beans are used to wrap the monitoring data received from the two mon- itoring modules: TomcatParameter for data received from the SL Monitor and OSParameter for data received from the VM Monitor. When a message is received from one of the two monitoring modules, the corresponding bean is updated. For the TomcatParameter bean, this is shown in Listing 5.22.

107 5.3. The Bottleneck Determination Mechanism Chapter 5. Implementation

1 if(smthChanged){ 2 tomcatParameter = new TomcatParameter(); 3 tomcatParameter.setMaxTime(tomcatParameterValuePair.get(Identifiers.TOMCAT_MAX_TIME)); 4 }

Listing 5.22: Updating a TomcatParameter bean

After an update took place, the rule engine (Jess) is called. This is done in order to determine if the service parameters are still SLA-conform and if not, to determine the resource causing the bottleneck. Listing 5.23 shows how Jess is called from inside the VM Logic module. If the rule engine is called for the first time, a session needs to be initialized (lines 1-6). In this process, a new bean, instance of the Bottleneck class, is created. This bean will be used later on to process the response of the rule engine. A List is then created, which contains the three beans: tomcatParameters and osParameters for the input and bottleneck for the output. In line 13, this list is fed to the rule engine and the rule engine is called.

1 if (!sessionInitialized){ 2 session = (StatelessRuleSession) runtime.createRuleSession( 3 ruleURI, new HashMap(), RuleRuntime.STATELESS_SESSION_TYPE); 4 bottleneck = new Bottleneck(); 5 sessionInitialized = true; 6 } 7 8 List inputList = new ArrayList(); 9 inputList.add(tomcatParameter); 10 inputList.add(osParameter); 11 inputList.add(bottleneck); 12 13 List results = session.executeRules(inputList);

Listing 5.23: Calling Jess from inside the VM Logic module

When Jess is called for the first time, the data from the rule file specified in the configu- ration file is loaded in the working memory of Jess. The content of the rule file is shown in Listing 5.24. This rule is used to determine if memory is the cause of a bottleneck. Memory is determined as the cause of a bottleneck when an SLA violation occurs and in the same time there is little amount of free memory available and a lot of swapping takes place.

In line 10, the counter global variable is initialized to the value of 1. In line 4, it is made sure that the next time Jess will be called in this session, the counter variable retains its old value and is not again set to 1. An SLO is defined in line 15: the value of

108 Chapter 5. Implementation 5.3. The Bottleneck Determination Mechanism

the maxTime Tomcat parameter must be under 2000 ms. This is the only SLO contained in the SLA.

In lines 25-27, 31-33 and 37-39, templates are declared that hold the values of the TomcatParameter, OSParameter and Bottleneck beans respectively. Next, the rule in lines 43-49 states that if the maxTime Tomcat parameter is under the limit defined in line 15, than everything is SLA-conform and thus no bottleneck exists. The rule in lines 54-59 increases the value of the counter, if the value of the maxTime Tomcat parameter is over the limit defined in Line 15, but the value of the counter variable is not bigger than 3. This basically means that three SLO violations are tolerated. If the value of the counter variable is bigger than 3, than the rule in lines 65-74 checks if there is little avail- able free memory (threshold set in line 20) and a lot of swapping takes place (threshold set in line 21), than the memory is causing a bottleneck. Otherwise, if all these symptoms are true, with the exception of the free memory or the swapping threshold, then it means that there is a bottleneck, but this is not memory. This is specified by the rule in lines 80-89.

1 /* 2 Makes sure that the global variables are not reseted 3 */ 4 (set-reset-globals nil) 5 /* 6 This is a counter 7 This counter is used to specify that the maxTime QoS parameter can be violated 3 times 8 before a problem is raised 9 */ 10 (defglobal ?*counter* = 1) 11 /* 12 This is the QoS parameter that specifies which is the maximum time (in s) 13 allowed for processing a request 14 */ 15 (defglobal ?*qosmaxtimelimit* = 2000) 16 /* 17 These to parameters are the results of an empiric analysis of the 18 Tomcat maxTime parameter behavior under different resource availability 19 */ 20 (defglobal ?*freeMemLowerLimit* = 10000000) 21 (defglobal ?*usedSwapUpperLimit* = 30000000) 22 /* 23 Points to the java class 24 */ 25 (deftemplate TomcatParameter 26 (declare (from-class de.fh_wiesbaden.cs.vs.manager.modules.controller.util.TomcatParameter)) 27 ) 28 /* 29 Points to the java class

109 5.3. The Bottleneck Determination Mechanism Chapter 5. Implementation

30 */ 31 (deftemplate OSParameter 32 (declare (from-class de.fh_wiesbaden.cs.vs.manager.modules.controller.util.OSParameter)) 33 ) 34 /* 35 Points to the java class 36 */ 37 (deftemplate Bottleneck 38 (declare (from-class de.fh_wiesbaden.cs.vs.manager.modules.controller.util.Bottleneck)) 39 ) 40 /* 41 This rule basically states that if the maxTime is under the limit, than we have no problem 42 */ 43 (defrule noproblem 44 (TomcatParameter {maxTime <= ?*qosmaxtimelimit*}) 45 ?botfact <- (Bottleneck {havingProblem == TRUE}) 46 => 47 (bind ?*counter* 1) 48 (modify ?botfact (havingProblem FALSE)) 49 ) 50 /* 51 This rule basically states that if the maxTime is over the limit, 52 but the counter has not passed 3, than increase the counter and reset maxTime 53 */ 54 (defrule increasecounter 55 ?fact <- (TomcatParameter {maxTime > ?*qosmaxtimelimit*}) 56 (test (<= ?*counter* 3)) 57 => 58 (bind ?*counter* (+ ?*counter* 1)) 59 ) 60 /* 61 This rule basically states that if the maxTime is over the limit, 62 and if there is little free physical memory and quite a lot of swapping takes place, 63 than memory is bottleneck 64 */ 65 (defrule dosmth 66 ?botfact <- (Bottleneck {memory == FALSE}) 67 (TomcatParameter {maxTime > ?*qosmaxtimelimit*}) 68 (OSParameter {freePhysicalMem < ?*freeMemLowerLimit* && usedSwapMem > ?*usedSwapUpperLimit*}) 69 (test (> ?*counter* 3)) 70 => 71 (modify ?botfact (memory TRUE)) 72 (modify ?botfact (havingProblem TRUE)) 73 (bind ?*counter* 1) 74 ) 75 /* 76 This rule basically states that if the maxTime is over the limit, 77 but there is some free physical memory and there is little swapping taking place, 78 than memory is not the bottleneck 79 */ 80 (defrule dosmthmemnoprob 81 (TomcatParameter {maxTime > ?*qosmaxtimelimit*}) 82 (OSParameter {freePhysicalMem >= ?*freeMemLowerLimit* && usedSwapMem <=

110 Chapter 5. Implementation 5.4. The Cluster Controller

?*usedSwapUpperLimit*}) 83 ?botfact <- (Bottleneck {memory == TRUE || havingProblem == FALSE}) 84 (test (> ?*counter* 3)) 85 => 86 (modify ?botfact (memory FALSE)) 87 (modify ?botfact (havingProblem TRUE)) 88 (bind ?*counter* 1) 89 )

Listing 5.24: The Jess rule

The results of the bottleneck determination performed by the rule engine are saved in the Bottleneck bean. From this bean, the results are parsed and if memory turns out to be the bottleneck, then a message containing this information is sent by the VM Logic module to the Actuator module of the VM Manager. The Actuator module will then send a request for additional memory to the Cluster Manager.

5.4 The Cluster Controller

The Cluster Logic implements the functionality of the cluster controller. For the purpose of this thesis, the two algorithms described in Section 4.3 were implemented. Both of these algorithms are contained in the Cluster Logic component, but only one of them can be used at a time. As it can be observed in Listing 5.25, this is decided based on the controller_algorithm parameter (line 12), which specifies which algorithm should be used. In the example in Listing 5.25, Algorithm B (kbfs_algorithm) is used. As it can be observed in lines 15-16, each algorithm can have its own parameters.

1 ###################################### 2 # Cluster Logic Module configuration # 3 ###################################### 4 cluster_logic = de.fh_wiesbaden.cs.vs.manager.modules.logic.cluster_logic.ClusterLogic 5 6 #reference to the cluster actuator module 7 cluster_logic.actuator_module_reference = cluster_actuator 8 9 #the controller algorithm to use, choose one of the following: 10 # simple_greedy_algorithm 11 # kbfs_algorithm 12 cluster_logic.controller_algorithm = kbfs_algorithm 13 14 #parameters for the kbfs algorithm 15 cluster_logic.kbfs_algorithm.k = 3 16 cluster_logic.kbfs_algorithm.iterations = 3

Listing 5.25: The configuration parameters for the Cluster Logic

111 5.4. The Cluster Controller Chapter 5. Implementation

In the following subsections, the implementation of both algorithms is described.

5.4.1 Algorithm A

Out of the two algorithms, Algorithm A is the simpler one. Its implementation is straight- forward and shown in Listing 5.26. The implementation of Algorithm A is contained in a single method, called startSimpleGreedyAlgorithm(). This method re- ceives as parameters the name of the virtual machine requesting more memory, the name of the physical machine on which the virtual machine is provisioned and the amount of additional memory required by the virtual machine. In line 10-11, the total amount of memory required by the virtual machine is computed.

1 private void startSimpleGreedyAlgorithm( 2 String vmName, String physicalHost, Long addMemory){ 3 4 long localAvailableMemory = ClusterResourceView.getFreeMemory( 5 physicalHost).longValue(); 6 7 String remoteHost; 8 long remoteAvailableMemory; 9 10 long necessaryMemory = ClusterResourceView.getVMMemory(vmName).longValue() 11 + addMemory.longValue(); 12 13 if (localAvailableMemory >= addMemory){ 14 try { 15 localIncreaseVMMemory(vmName, physicalHost, necessaryMemory); 16 } catch (JMSException e){ 17 logger.error(this.moduleName, e.getLocalizedMessage()); 18 } 19 } else { 20 Enumeration physicalHosts = ClusterResourceView.getPhysicalHostNames(); 21 22 physicalHosts=sortBasedOnUtilization(physicalHosts); 23 24 while (physicalHosts.hasMoreElements()) { 25 remoteHost = physicalHosts.nextElement(); 26 remoteAvailableMemory = ClusterResourceView.getFreeMemory( 27 remoteHost).longValue(); 28 29 if (remoteAvailableMemory >= necessaryMemory){ 30 try { 31 migrateVmOnMachine(vmName, physicalHost, remoteHost); 32 33 Thread.sleep(MAX_MIGRATION_TIME); 34 35 localIncreaseVMMemory(vmName, remoteHost, necessaryMemory); 36 } catch (Exception e){

112 Chapter 5. Implementation 5.4. The Cluster Controller

37 logger.error(this.moduleName, e.getLocalizedMessage()); 38 } 39 break; 40 } 41 } 42 } 43 }

Listing 5.26: Algorithm A

If the additional amount of memory required by the virtual machine is locally available (line 13), than the total amount of memory required by the virtual machine is locally al- located from the physical machine on which the virtual machine is currently provisioned (line 15). If this is not the case, than the virtual machine is migrated on the first physical machine (line 31) that provides enough memory (line 29). After the migration is per- formed, the memory allocation of the virtual machine is increased to the required value. Note that the design of Algorithm A from Section 4.3 specifies that the physical machine with the highest resource utilization which provides enough resources must be chosen. This is assured in line 22 by sorting the list of physical machines decreasingly with respect to their amount of free memory.

5.4.2 Algorithm B

Among the two controller algorithms, Algorithm B is the more complex one. The encod- ing of the virtual machine to physical machine mapping in form of a state, together with all the resources in the cluster is implemented by the class ClusterState as described in Section 4.3. An extract of the ClusterState.java file is presented in Listing 5.27, showing the attributes and the two constructors of the class. Each state has a parent state (line 3), from which this state was generated by migrating a virtual machine (line 4) from one physical machine (line 5) to another physical machine (line 6) in the clus- ter. The cost attribute represents the number of migrations performed from the initial state to reach the current state. Furthermore, a state has a global profit (line 9) and a state utility (line 10), calculated by means of the formulas presented in Section 4.3. The resourceView attribute corresponds to the global view of the resources in the cluster, while the allocationView is a helper data structure which encodes only the virtual machine to physical machine mapping. Although a redundant information, this speeds-up the execution of the algorithm.

113 5.4. The Cluster Controller Chapter 5. Implementation

1 public class ClusterState implements Comparable { 2 3 private ClusterState parent; 4 private String migratedVM; 5 private String fromPyhsical; 6 private String toPhysical; 7 private int cost; 8 private double stateUtility; 9 private double globalProfit; 10 private Hashtable>> resourceView; 11 private Hashtable allocationView; 12 13 // for the initial state 14 public ClusterState( 15 Hashtable>> resourceView, 16 Hashtable allocationView){ 17 18 this.parent = this; 19 this.cost = 0; 20 this.allocationView = allocationView; 21 this.resourceView = resourceView; 22 this.globalProfit = calcuateGlobalProfit(); 23 this.stateUtility = calculateStateUtility(); 24 } 25 26 public ClusterState( 27 ClusterState parent, String vmName, String sourcePhysicalName, 28 String destinationPhysicalName){ 29 30 this.parent = parent; 31 this.migratedVM = new String(vmName); 32 this.fromPyhsical = new String(sourcePhysicalName); 33 this.toPhysical = new String(destinationPhysicalName); 34 this.resourceView = adaptResourceView(vmName, sourcePhysicalName, 35 destinationPhysicalName); 36 this.allocationView = adaptAllocationView(vmName, sourcePhysicalName, 37 destinationPhysicalName); 38 this.cost = parent.getCost() + 1; 39 this.globalProfit = calcuateGlobalProfit(); 40 this.stateUtility = calculateStateUtility(); 41 } 42 ... 43 }

Listing 5.27: Extract from ClusterState.java

The constructor in lines 14-24 of Listing 5.27 is used to create the initial state. The initial state is its own parent (line 18), has no migrate costs (line 19), an allocationView (line 20) and a resourceView (line 20) which are given as parameters of the construc- tor. Based on these attributes, the global profit (line 22) and the utility of the state (line 23) are calculated.

114 Chapter 5. Implementation 5.4. The Cluster Controller

The constructor in lines 26-41 of Listing 5.27 is used to create a child state, given a certain parent state and the identifier of a virtual machine which is migrated from a certain source physical machine to another destination physical machine in the cluster. These are all parameters of the constructor, as shown in lines 26-28 . Inside the constructor, the attributes are set to their corresponding values, either directly like in lines 30-33, or by calculating the new values of these attributes like in lines 34-40. The encoding provided by the ClusterState class is then used by Algorithm B in the startKbfsAlgorithm() method, which is shown in Listing 5.28. The beginning of this method is similar to the one of the startSimpleGreedyAlgorithm() method. In fact, if the necessary memory is locally available (line 10), then this method will lo- cally allocate the necessary memory (line 12) and return, thus performing just like the startSimpleGreedyAlgorithm() method. It is when this additional amount of memory is not locally available (line 16) that the differences between the two algorithms become clear. First, two vectors are created, the first containing the states (nodes) that were already expanded (the closedList, line 17) and the second containing the states that were obtained as a result of this expansion (the openList, line 18). Next, the global resource view needs to be updated (lines 21-24), such that it reflects the additional amount of resources required by the virtual machine in cause. An initial state is generated (lines 26) which, thinking in terms used in Section 4.3, is invalid since the total amount of memory required by the virtual machines on one of the physical machines is bigger than the total amount of memory available of that machine. This initial state is then expanded and its children are added to the openList (line 27). Lines (35-39) represent the KBFS core of Algorithm B. The best K children (lines 29 and 35) are expanded and their children are added to the openList. This process is repeated a specific number of times, which is determined by the kbfsIterations variable. Based on the configuration of the managed infrastructure and the constrains set on the total migration time, the value of kbfsIterations variable can be determined so the all the states added to the open list have acceptable total migrations costs. The value of the kbfsK and kbfsIterations variables are read from the configuration file of the Cluster Logic modules, as shown in Listing 5.25, lines 15 and 16 respectively.

115 5.4. The Cluster Controller Chapter 5. Implementation

1 private void startKbfsAlgorithm(String vmName, String physicalHost, Long addMemory){ 2 3 long localAvailableMemory = ClusterResourceView.getFreeMemory( 4 physicalHost).longValue(); 5 String remoteHost=""; 6 long remoteAvailableMemory; 7 long necessaryMemory = ClusterResourceView.getVMMemory(vmName).longValue() 8 + addMemory.longValue(); 9 10 if (localAvailableMemory >= addMemory){ 11 try { 12 localIncreaseVMMemory(vmName, physicalHost, necessaryMemory); 13 } catch (JMSException e){ 14 logger.error(this.moduleName, e.getLocalizedMessage()); 15 } 16 } else { 17 Vector closedList = new Vector(); 18 Vector openList = new Vector(); 19 20 //do this because the vm has now a new size 21 Hashtable>> tmpResourceView 22 = updateResourceView(ClusterResourceView.getResourceView()); 23 Hashtable tmpAllocationView 24 = updateAllocationView(ClusterResourceView.getAllocationView()); 25 26 ClusterState initialState = new ClusterState(tmpResourceView,tmpAllocationView); 27 addChildrentoTheOpenList(initialState, openList, closedList); 28 29 int limit = Integer.parseInt(kbfsK); 30 31 if (openList.size() < limit){ 32 limit = openList.size(); 33 } 34 35 for (int j = 0; j < Integer.parseInt(kbfsIterations); j++) { 36 for (int i = 0; i < limit; i++) { 37 addChildrentoTheOpenList(openList.get(i), openList, closedList); 38 } 39 } 40 41 closedList.addAll(openList); 42 ClusterState finalState = initialState; 43 44 for (ClusterState state : closedList){ 45 if (state.getStateUtility() > finalState.getStateUtility()) { 46 finalState = state; 47 } 48 } 49 50 Stack instructions = new Stack(); 51 52 boolean running = true; 53 while (running){ 54 if (finalState.getCost() != 0) {

116 Chapter 5. Implementation 5.4. The Cluster Controller

55 instructions.push(finalState.getMigratedVM()); 56 instructions.push(finalState.getFromPyhsical()); 57 instructions.push(finalState.getToPhysical()); 58 finalState = finalState.getParent(); 59 } else { 60 running = false; 61 } 62 } 63 64 while (!instructions.empty()) { 65 remoteHost = instructions.pop(); 66 physicalHost = instructions.pop(); 67 vmName = instructions.pop(); 68 try { 69 migrateVmOnMachine(vmName, physicalHost, remoteHost); 70 } catch (JMSException e){ 71 logger.error(this.moduleName, e.getLocalizedMessage()); 72 } 73 } 74 75 try { 76 Thread.sleep(SLEEP_TIME); 77 localIncreaseVMMemory(vmName, remoteHost, necessaryMemory); 78 } catch (Exception e){ 79 logger.error(this.moduleName, e.getLocalizedMessage()); 80 } 81 } 82 }

Listing 5.28: The startKbfsAlgorithm() method

After this process finishes, the state with the highest utility is selected (lines 44-48). Next, the instructions for reaching this state must be determined. This is done by covering the way back from this state to the initial state. On this way, the steps are put in a stack (lines 53-62), in order to reverse them in form of migration instructions (lines 64-73), once the initial state is reached. After all the migrations are performed, the necessary memory is allocated to the virtual machine in cause (line 77).

117 5.5. Code Statistics Chapter 5. Implementation

5.5 Code Statistics

In Table 5.1, the complexity of this implementation measured in lines of code (LOC) is presented. For the sake of completeness, both the Self-Manager core and the JMX- module are present in this table, despite being developed outside of this thesis. This data doesn’t include the code lines of the configuration files.

Component Module Lines of Code Monitor 350 Physical Manager Actuator 324 Xen 782 Total 1456 VM Monitor 380 VM Manager SL Monitor 442 JMX (2273) VM Logic 1011 Actuator 393 Total 4499 Monitor 364 Cluster Manager Actuator 274 Cluster Logic 1413 Total 2051 Self-Manager Core (4961) Total 5733 (12967)

Table 5.1: Code statistics

118 Chapter 6

Evaluation

The previous two chapters described the design and the implementation of the frame- work, the bottleneck determination mechanism and the two controller algorithms. In this chapter, the results of this concept and its implementation are evaluated. First, the testbed used for this evaluation is presented in Section 6.1. Next, in Section 6.2, the management framework, including the bottleneck determination mechanism, is evaluated. Finally, in Section 6.3, the two controllers are evaluated separately and compared to each other.

6.1 The Testbed

The framework was tested in a lab environment using a Java implementation of the TPC- W benchmarking standard [Mar01, TPCb]. The testbed used for this evaluation is de- picted in Figure 6.1. The TPC-W benchmark can be used to simulate the activities of a business oriented trans- actional Web server. The Java implementation of TPC-W is basically a collection of servlets that implement this benchmark for J2EE application server/servlet containers. For this evaluation, the Java TPC-W e-commerce application suite is hosted inside a Tomcat server version 5.5.25. The Tomcat server runs on top of a Fedora Core 7 that runs inside a virtual machine managed by the VM Manager. Two Fujitsu Siemens Esprimo physical machines wre used, equipped with an Intel Core 2 Duo 1,86 GHz processor and 2 GB of RAM. They are operated by Fedora Core 7 containing the Xen hypervisor version 3.1. One of the two physical machines hosts the virtual machine, but both of them have access to the disk image of the virtual machine via NFS. The two physical machines are managed by two different Physical Managers.

119 6.2. Evaluation of the Management Framework Chapter 6. Evaluation

TPC-W Java Physical Machine Load Generator

Cluster Manager

Physical Physical VM Manager Manager Manager

TPC-W Java Web App

Tomcat 5.5.25

Fedora Core 7 Fedora Core 7 Fedora Core 7 (VM) (Dom 0) (Dom 0)

Xen 3.1 Xen 3.1

Physical Machine Physical Machine

Figure 6.1: The testbed

The entire cluster, here formed by the two physical machines, is managed by the Cluster Manager. The two Physical Managers, the VM Manager and the Cluster Manager are hosted on a third physical machine, which also hosts the TPC-W load generator. This third machine is not part of the managed infrastructure and thus not part of the cluster. The TPC-W load generator is a Java application and is also part of the Java implementation of the TPC-W benchmarking standard.

6.2 Evaluation of the Management Framework

In this part of the evaluation, the framework has been tested using the testbed described in the previous section. This consists of two steps: (1) test if the VM Manager is capable of discovering a memory bottleneck and (2) test if the components of the framework work together in order to allocate additional memory when such a bottleneck was determined. For this purpose, tests were first conducted in order to determine a scenario in which memory is the bottleneck. The results of these empirical observations are shown in Fig- ure 6.2. Here, a logarithmic scale is used to present the results of five tests that were

120 Chapter 6. Evaluation 6.2. Evaluation of the Management Framework performed. This tests consisted of measuring the response times of the new_products servlet, which is part of the Java TPC-W implementation. The load for this servlet was generated using the TPC-W Java load generator, which simulated the shopping actions of 30 users. For each test 80, 90, 100 and 110 MBs of memory were allocated to the virtual machine. The response times of the servlet shown in Figure 6.2 clearly decrease with the increase in memory allocation. Thus, we were able to determine a scenario where the amount of memory allocated to a virtual machine can be the cause of a bottleneck.

Sheet3 Response times servlet new_products 10000

1000

100 80 MB 90 MB 100 MB 110 MB 10 120 MB

1 00:00: 00:00:04 00:04: 00:08:49 00:12: 00:17:46 00:25: 00:28: 00:32: 00:35: 00:38:34 00:41: 00:49:33 00:51: 00:54:22 00:58: 01:02:02 01:05: 01:15:15 01:16: 01:20: 01:28:42 01:41: 01:42:58 01:45: 01:50:15 01:54: 01:59: 00 36 02 00 12 03 44 56 17 34 01 59 56 16 42 25 15

Figure 6.2: EmpiricalSeite 33 observations

Upon conducting these experiments, we have set the memory allocation of our virtual machine at 70 MBs and started the TPC-W Java load generator. Again, the load generator was set up to simulate 30 active users which are shopping on the e-commerce site hosted inside the Tomcat server. The results of this evaluation are shown in Figure 6.3. To determine when an SLO violation takes place, the same rules presented in Section 5.3 were used. As it can be observed from Figure 6.3, four SLO violations took place in two minutes, caused by the increase in response times of the Tomcat server. While the first three SLO violations are tolerated, when the forth SLO violation occurred, the VM Manager reacted and requested more memory. This is in accordance to the rules presented in Section 5.3. Up to the point where the fourth SLO violation occurs, the virtual machine had virtu-

121 6.3. Evaluation of the Cluster Controllers Chapter 6. Evaluation

Influence of Memory Allocation on Service Response Time

120000000 2,5

100000000 2

80000000 s

1,5 n VM OS Free o i t

s Memory a e l t

60000000 o y i B V

1 O VM OS used L

40000000 S Swap

0,5 Measured SLO 20000000 Violations per Minute 0 0

0 0 2 2 2 2 2 2 2 2 2 :0 :2 :4 :0 :2 :4 :0 :2 :4 :0 :2 0 0 0 1 1 1 2 2 2 3 3 :0 :0 :0 :0 :0 :0 :0 :0 :0 :0 :0 0 0 0 0 0 0 0 0 0 0 0 Time (s)

Figure 6.3: Influence of available memory on SLO violations ally no free memory available. Furthermore, the Linux hosted in the virtual machine was heavily swapping. As it can be observed from Figure 6.3, after the forth SLO violation occurred, an additional 100 MBs free memory became available to the memory. This demonstrates that the VM Manager was capable of determining a bottleneck and request- ing additional memory, and that the components of the framework were able to work together to allocate additional memory to the requesting virtual machine. After the mem- ory allocation, the level of free memory sank to approximately 80 MBs, since 20 MBs were allocated to the Tomcat server. The swapping level remained the same, which is a common Linux behaviour. After the additional memory was allocated, no SLO violation takes place over the rest of the interval. This shows that the bottleneck determined by the VM Manager was truly a bottleneck and thus that the bottleneck determination mechanism works correctly, at least for scenarios where memory is the cause of a bottleneck (other scenarios were not implemented and thus could not be tested).

6.3 Evaluation of the Cluster Controllers

The cluster controllers, as described in Sections 4.3 and 5.4 were first tested using the previously described testbed. The results shown in Figure 6.3 were obtained using Algo-

122 Chapter 6. Evaluation 6.3. Evaluation of the Cluster Controllers rithm A as a controller. In that scenario, Algorithm A determined that there was enough memory locally available, and thus decided to allocate this memory locally. The same re- sults were obtained using Algorithm B for this scenario. Next, the scenario was modified so that the necessary memory was not locally available, but only on the second physical machine. In this case, only one solution is possible: to migrate the virtual machine on the other physical machine. Both algorithms found this solution, and performed a migration. The total migration time was under five seconds.

However, evaluating these algorithms in a cluster which only contains two physical ma- chines is of little relevance. In both scenarios that were designed (with and without mi- gration), the limitations of Algorithm A over Algorithm B were not visible. Because of the architecture of the framework and the internal architecture of the Cluster Logic, it is possible to isolate the two algorithms and test them by means of simulation. We have designed two scenarios for the simulation, where migration must be performed in order to obtain a valid solution.

Initial Memory Allocation Final Memory Allocation

12 12

10 10

8 8 Free VM 1 6 6 100 MB 100 MB 4 Free 4

Free 2 VM 2 2 VM 2 VM 1

0 0 Host 1 Host 2 Host 1 Host 2

Figure 6.4: A scenario comprising two physical machines

In the first scenario, two physical machines named Host 1 and Host 2 each provision one VM, VM 1 respectively VM 2 (see Figure 6.4). Host 1 possesses a total of 500 MB RAM for VMs, while Host 2 provides 1000 MB. At the beginning, VM 1 uses 300 MB, while VM 2 uses 400 MB. As the TPC-W load generator stresses VM 1, the VM Manager responsible for VM 1 requests additional memory (in this case an extra 300 MB) from the Cluster Manager. In this scenario, both Cluster Logic algorithms deliver the same solution, namely to migrate VM 1 from Host 1 to Host 2 as Host 1 is not able to provide extra 300 MB of memory.

123 6.3. Evaluation of the Cluster Controllers Chapter 6. Evaluation

Initial Memory Allocation Intermediate Memory Allocation Final Memory Allocation

12 12 12

Free Free 10 10 10

Free 8 8 VM 2 8 VM 2

6 6 6 Free 100 MB 4 Free 4 Free 4 VM 3 Free VM 3 VM 1 VM 3 Free 2 VM 2 2 2 VM 1 VM 1

0 0 0 Host 1 Host 2 Host 3 Host 1 Host 2 Host 3 Host 1 Host 2 Host 3

Figure 6.5: A scenario comprising three physical machines

The second scenario comprises three physical machines: Host 1, Host 2 and Host 3 (see Figure 6.5, left chart). At the beginning, each of them provisions one VM: VM 1, VM 2, and VM 3 respectively. In this scenario, Host 1 possesses a total of 500 MB RAM for VMs, Host 2 600 MB and Host 3 1100 MB, while VM 1 uses 300 MB, VM 2 400 MB and VM 3 600 MB. Again, the VM Manager responsible for VM 1 requests 300 MB of additional memory. Unlike in the first scenario, the two algorithms perform differently. In fact, Algorithm A is not even capable of finding a valid solution as neither Host 2 nor Host 3 have 600 MB of free memory available. This scenario clearly shows the limitations of this simple algorithm. The more sophisticated Algorithm B is able to find an optimal solution: VM 2 is migrated to Host 3 before VM 1 is migrated to Host 2. As a side effect, this solution results in a high resource utilization on both, Host 2 and Host 3, while Host 1 provisions no VM and thus can be set offline. The evolution of Algorithm B from the initial state to this final state can be observed in Figure 6.5.

124 Chapter 7

Summary

The goal of this thesis was to design and evaluate self-management approaches for vir- tual machine-based environments. In doing this, an in-depth state of the art analysis was first performed. Upon this analysis, it was observed that although various controller ap- proaches exist for the autonomic management of virtualization-based environments, they use different architectural approaches, manage different infrastructures and are thus im- possible to compare with each other. It was argued that in order to design and evaluate various controller approaches, a generic framework is required in which different con- trollers can be plugged in, thus permitting an objective form of evaluation. For such a framework, a set of requirements was defined. The first contribution of this thesis represents the design and implementation of such a framework using the principles of the autonomic computing paradigm. The resulting framework exhibits a modular architecture and consists of there types of components, that manage the three types of entities found in the cluster:

• The VM Manager manages a virtual machine

• The Physical Manager manages a physical machine

• The Cluster Manager manages the entire cluster

These components are arranged in a hierarchical fashion, with the higher-ranked Cluster Manager placed over the lower-ranked VM Managers and Physical Managers. A VM Manager is responsible for signaling when the virtual machine to which it is associated requires more resources, while the Physical Managers allocate additional resources based on decisions taken by the Cluster Manager.

125 Chapter 7. Summary

Three types of modules are present in the framework: monitor, actuator and “intelligent" modules. Furthermore, two different types of “intelligent” modules have been introduced:

• The bottleneck determination mechanism, located in a VM Manager

• The global cluster controller, located in the Cluster Manager

The modular, loosely coupled approach used to design the framework had the effect of increasing the complexity of the implementation. The main reason for this was the asyn- chronous form of communication between modules that the architecture imposed. How- ever, precisely this architectural approach is what provides the generic character of the framework. Technology-specific modules like the JMX module or the Xen module can be easily replaced with modules developed for other technologies. Furthermore, various cluster controller and bottleneck determination mechanisms can be plugged in the frame- work and evaluated.

The bottleneck determination mechanism is responsible for determining the resource causing a bottleneck in a virtual machine. This is done by monitoring both the virtual machine’s resource utilization level, as well as the response times of the service running inside the virtual machine. Thus, a service level management approach is used instead of a simple, resource-oriented approach. Each of these “intelligent" modules is loosely coupled with the framework and can be plugged in.

For the bottleneck determination mechanism, three possible approaches were analyzed:

• Rule-based

• Case-based reasoning

• Time series

The rule-based approach was selected, designed and implemented. The resulting bot- tleneck determination mechanism represents the second contribution of this thesis. The evaluation performed and described in Chapter 6 shows that this mechanism works well for a scenario where memory is the cause of a bottleneck.

One of the main features of the framework is that it reduces the complex task of man- aging a virtualization-based environment to a resource allocation problem. This level of abstraction significantly simplified the design of cluster controllers. On the other hand,

126 Chapter 7. Summary using such a two-layered hierarchical architecture was a source of overhead for the im- plementation of the framework.

To develop such a controller, an analysis was first performed. The analysis showed that the resource allocation problem that needs to be solved by the controllers belongs to the knapsack family of problems and is NP-hard. It is commonly believed that for such problems, there is no algorithm that finds an optimal solution in polynomial time. For our problem, this means that either heuristics or approximation algorithms can be used to implement the controller, and the results of the algorithm used by the controller are only approximate solutions.

Different techniques were analyzed with respect to how they can be used for the design of cluster controllers. Finally, two different controllers were designed, called Algorithm A and Algorithm B. Both of them use heuristic-based strategies and were implemented and evaluated. The design and implementation of these controllers represents the third contribution of this thesis.

The evaluation of these controllers proved their fitness but also indicated some limita- tions. While Algorithm A is not always capable of finding a solution when solutions exist, Algorithm B addresses this limitation and is always capable to find a solution if one exists. However, both Algorithm A and Algorithm B are heuristic-based algorithms and thus offer no guaranteed performance ratio.

Finally, it is necessary to mention the virtualization technology-agnostic character of the framework. Although the implementation of the framework uses the Xen hypervisor, adapters for other virtualization technologies can be easily developed and plugged into the framework. This is the fourth contribution of the framework and represents an important step towards the management of heterogenous virtual machine-based data centers.

The generic, modular approach of the framework provides a great support for extending the results of this thesis. In the following, possible ways of continuing this work are presented.

The implementation of the framework dealt only with one type of resource, namely mem- ory. Future work could extended the framework so that it deals with all four resource types defined in Section 3.2: memory, CPU, network bandwidth and I/O bandwidth.

Furthermore, as part of the implementation adapters for Xen were developed. Part of future work should be developing adapters for other virtualization technologies, like VMware ESX, Virtuozzo and the ones that will follow. This could also include adapters based on the now emerging standards from the DMTF System Virtualization, Partitioning,

127 Chapter 7. Summary and Clustering Working Group (SVPC WG). Similarly, adapters for various monitoring technologies can be incorporated. In the design and implementation chapters, a rule-based bottleneck determination mech- anism was presented. Besides extending this rule-based approach to deal with all four types of resources, one could also design and evaluate various other approaches, like the ones analyzed in Section 3.4. For example, it might be possible to use time series analy- sis to forecast a future bottleneck, and thus the management system can proactively react before a bottleneck occurs and not only after the fact. One limitation of the current framework implementation is that it only deals with under- provisioned virtual machines that require additional resources, but not with overprovi- sioned virtual machines. However, extending the framework with this capability simply means extending the bottleneck determination mechanism and the cluster controller with a mechanism for detecting resources that are overprovisioned respectively a mechanism to deallocate the amount of resources that are not required. Such future work should be based on the analysis of this problem performed in Section 3.5. Finally, in Section 4.3 we have seen that both Algorithm A and Algorithm B have a serious limitation: they are heuristic-based algorithms. However, we have shown in 3.3 that our problem belongs to the knapsack family of problems. As seen in Section 2.4, knapsack problems have approximation algorithms. This means that, if our problem can be reduced to a concrete knapsack problem that has an approximation algorithm, then we have found an algorithm for our problem that has a guaranteed performance ratio. Future work could thus design, implement and evaluate a controller based on an approximation algorithm. One way to do this is to first reduce the problem to a multidimensional bin packing problem. This can be achieved by using physical machines that have identical hardware configurations. After this reduction, one of the existent multidimensional bin packing algorithms [CKP01] can be adapted and implemented.

128 Chapter 8

Bibliography

[AA06] ADAMS, Keith ; AGESEN, Ole: A comparison of software and hardware techniques for x86 virtualization. In: ASPLOS-XII: Proceedings of the 12th international conference on Architectural support for programming languages and operating systems. ACM Press. – ISBN 1595934510, 2– 13

[ACK+00] AUSIELLO, G. ; CRESCENZI, P. ; KANN, V. ; MARCHETTI-SP ;GAMBOSI, Giorgio ; SPACCAMELA, Alberto M.: Complexity and Approximation: Combinatorial Optimization Problems and Their Approximability Proper- ties. Springer, 2000. – ISBN 3540654313

[AMD05] AMD: AMD64 Virtualization Codenamed "Pacifica" Technology: Secure Virtual Machine Architecture Reference Manual, May 2005

[BCS06] BANSAL, Nikhil ; CAPRARA, Alberto ; SVIRIDENKO, Maxim: Improved approximation algorithms for multidimensional bin packing problems. In: FOCS ’06: Proceedings of the 47th Annual IEEE Symposium on Founda- tions of Computer Science (FOCS’06). IEEE Computer Society, 697–708

[BDF+03] BARHAM, Paul ; DRAGOVIC, Boris ; FRASER, Keir ; HAND, Steven ; HAR- RIS, Tim ; HO, Alex ; NEUGEBAUER, Rolf ; PRATT, Ian ; WARFIELD, An- drew: Xen and the art of virtualization. In: SOSP ’03: Proceedings of the nineteenth ACM symposium on Operating systems principles. ACM Press. – ISBN 1581137575, 164–177

[BJR94] BOX, George ; JENKINS, Gwilym M. ; REINSEL, Gregory: Time Series Analysis: Forecasting & Control (3rd Edition). Prentice

129 Chapter 8. Bibliography

Hall http://www.amazon.ca/exec/obidos/redirect?tag= citeulike09-20\&path=ASIN/0130607746. – ISBN 0130607746

[BKB07] BOBROFF, Norman ; KOCHUT, Andrzej ; BEATY, Kirk: Dynamic Place- ment of Virtual Machines for Managing SLA Violations, 119–128

[CFH+05] CLARK, Christopher ; FRASER, Keir ; HAND, Steven ; HANSEN, Ja- cob G. ; JUL, Eric ; LIMPACH, Christian ; PRATT, Ian ; WARFIELD, An- drew: Live Migration of Virtual Machines. In: Proceedings of the 2nd ACM/USENIX Symposium on Networked Systems Design and Implementa- tion (NSDI). Boston, MA, May 2005, S. 273–286

[CKP01] CAPRARA, Alberto ; KELLERER, Hans ; PFERSCHY, Ulrich: Approxima- tion Schemes for Ordered Vector Packing Problems. In: Lecture Notes in Computer Science 2129 (2001), 63–?? http://citeseer.ist.psu. edu/470961.html

[DC99] DUDA, Kenneth J. ; CHERITON, David R.: Borrowed-virtual-time (BVT) scheduling: supporting latency-sensitive threads in a general-purpose sched- uler. In: SOSP ’99: Proceedings of the seventeenth ACM symposium on Operating systems principles Bd. 33. ACM Press. – ISSN 0163–5980, 261– 276

[Deb04] DEBUSMANN, Markus: Modellbasiertes Service Level Management verteil- ter Anwendungssysteme, Universität Kassel, Diss., December 2004

[EGCGJ97] E.G.COFFMAN, Jr. ; GAREY, M. R. ; JOHNSON, D. S.: Approximation algorithms for bin packing: a survey. (1997), S. 46–93. ISBN 0–534– 94968–1

[FKK03] FELNER, Ariel ; KRAUS, Sarit ; KORF, Richard E.: KBFS: K-Best-First Search. In: Annals of Mathematics and Artificial Intelligence 39 (2003), Nr. 1, 19–39. http://dx.doi.org/10.1023/A:1024452529781.– DOI 10.1023/A:1024452529781

[FZ05] FROMM, Jochen ; ZAPF, Michael: Selbst-Eigenschaften in verteilten Sys- temen. In: Praxis der Informationsverarbeitung und Kommunikation (PIK) 2005 (2005), oct, Nr. 4, 189-198. http://www.vs.uni-kassel.de/ publications/2005/FZ05

130 Chapter 8. Bibliography

[GC03] GANEK, A. G. ; CORBI, T. A.: The dawning of the autonomic computing era. In: IBM Syst. J. 42 (2003), January, Nr. 1, 5–18. http://portal. acm.org/citation.cfm?id=1014770. – ISSN 0018–8670

[GIYC] GRIT, Laura ; IRWIN, David ; YUMEREFENDI, Aydan ; CHASE, Jeff: Vir- tual Machine Hosting for Networked Clusters: Building the Foundations for Autonomic Orchestration

[HDPT04] HELLERSTEIN, Joseph L. ; DIAO, Yixin ; PAREKH, Sujay ; TILBURY, Dawn M.: Feedback Control of Computing Systems. John Wiley & Sons, 2004. – ISBN 047126637X

[HJ04] HANSEN, Jacob G. ; JUL, Eric: Self-migration of operating systems. In: EW11: Proceedings of the 11th workshop on ACM SIGOPS European work- shop: beyond the PC. ACM Press

[HMG04] HERRMANN, Klaus ; MÜHL, Gero ; GEIHS, Kurt: Self-Management - Potentiale, Probleme, Perspektiven. In: Praxis der Informa- tionsverarbeitung und Kommunikation (PIK) 27 (2004), April, Nr. 2, 74–79. http://www.kbs.cs.tu-berlin.de/publications/ fulltext/pik2004.pdf. – (in German). ISBN 3–598–01248–9

[Hor01] HORN, Paul: Autonomic computing: IBM’s Perspective on the State of Information Technology. 2001

[IBM] IBM. – http://www.research.ibm.com/oceanoproject/ (Last visited 20.01.2008)

[Inf] INFOWORLD: Gartner Says Virtualization will Top IT Agenda Un- til 2010. http://weblog.infoworld.com/virtualization/ archives/general_news/

[Int05] INTEL: Intel Virtualization Technology Specification for the IA-32 Intel Architecture, April 2005

[Jac98] JACKSON, Peter: Introduction to expert systems. Boston, MA, USA : Addison-Wesley Longman Publishing Co., Inc., 1998. – ISBN 0–201– 87686–8

[Jav] Java RMI. – http://java.sun.com/javase/technologies/ core/basic/rmi/index.jsp (Last visited 24.02.2008)

131 Chapter 8. Bibliography

[Kan92] KANN, Viggo: On the Approximability of NP-complete Optimization Prob- lems, Royal Institute of Technology Stockholm, Diss., 1992

[Kar91] KARLOFF, Howard: Linear programming. Cambridge, MA, USA : Birkhauser Boston Inc., 1991. – ISBN 0–8176–3561–0

[KBC+00] KUBIATOWICZ, John ; BINDEL, David ; CHEN, Yan ; CZERWINSKI, Steven ;EATON, Patrick ; GEELS, Dennis ; GUMMADI, Ramakrishna ; RHEA, Sean ;WEATHERSPOON, Hakim ; WELLS, Chris ; ZHAO, Ben: OceanStore: an architecture for global-scale persistent storage. In: SIGARCH Com- put. Archit. News 28 (2000), Nr. 5, S. 190–201. http://dx.doi. org/http://doi.acm.org/10.1145/378995.379239. – DOI http://doi.acm.org/10.1145/378995.379239. – ISSN 0163–5964

[KC03] KEPHART, J. O. ; CHESS, D. M.: The Vision of Autonomic Computing. In: Computer 36 (2003), S. 41–50

[KPP04] KELLERER, H. ; PFERSCHY, U. ; PISINGER, D.: Knapsack Problems. Springer, Berlin, Germany, 2004

[LL02] LOHMAN, Guy M. ; LIGHTSTONE, Sam S.: SMART: making DB2 (more) autonomic. In: VLDB ’02: Proceedings of the 28th international conference on Very Large Data Bases, VLDB Endowment, 2002, S. 877–879

[LLS02] LYMBEROPOULOS, L. ; LUPU, E. ; SLOMAN, M.: An Adaptive Policy Based Management Framework for Differentiated Services Networks. In: POLICY ’02: Proceedings of the 3rd International Workshop on Policies for Distributed Systems and Networks (POLICY’02). IEEE Computer Society. – ISBN 0769516114

[LMKB01] LUTFIYYA, Hanan ; MOLENKAMP, Gary ; KATCHABAW, Michael ; BAUER, Michael A.: Issues in Managing Soft QoS Requirements in Dis- tributed Systems Using a Policy-Based Framework. In: POLICY ’01: Pro- ceedings of the International Workshop on Policies for Distributed Systems and Networks. Springer-Verlag. – ISBN 3540416102, 185–201

[Mar01] MARDEN, Morris: An Architectural Evaluation of Java TPC-W. In: HPCA ’01: Proceedings of the 7th International Symposium on High-Performance Computer Architecture. Washington, DC, USA : IEEE Computer Society, 2001. – ISBN 0–7695–1019–1, S. 229

132 Chapter 8. Bibliography

[MB06] MENASCE, Daniel A. ; BENNANI, Mohamed N.: Autonomic Virtualized Environments. In: ICAS ’06: Proceedings of the International Conference on Autonomic and Autonomous Systems. Washington, DC, USA : IEEE Computer Society, 2006. – ISBN 0–7695–2653–5, S. 28

[Mel07] MELLOR, Ewan: The Xen-API. April 2007

[MSMW04] MÜLLER-SCHLOER, Christian ; MALSBURG, Christoph von d. ; WÜRT, Rolf P.: Organic Computing. In: Informatik-Spektrum 27 (2004), Nr. 4, 332–336. http://dx.doi.org/10.1007/s00287-004-0409-6. – DOI 10.1007/s00287–004–0409–6

[PG74] POPEK, Gerald J. ; GOLDBERG, Robert P.: Formal requirements for virtu- alizable third generation architectures. In: Commun. ACM 17 (1974), July, Nr. 7, 412–421. http://dx.doi.org/10.1145/361011.361073. – DOI 10.1145/361011.361073. – ISSN 0001–0782

[Pis95] PISINGER, D.: Algorithms for Knapsack Problems, DIKU, University of Copenhagen, Denmark, Diss., 1995

[pol04] An artificial intelligence perspective on autonomic computing policies .– 3–12 S

[pro00] Process migration. In: ACM Comput. Surv. 32 (2000), September, Nr. 3, 241–299. http://dx.doi.org/10.1145/367701.367728.– DOI 10.1145/367701.367728. – ISSN 0360–0300

[PSZ+07] PADALA, Pradeep ; SHIN, Kang G. ; ZHU, Xiaoyun ; UYSAL, Mustafa ; WANG, Zhikui ; SINGHAL, Sharad ; MERCHANT, Arif ; SALEM, Kenneth: Adaptive control of virtualized resources in utility computing environments. In: EuroSys ’07: Proceedings of the 2007 conference on EuroSys. ACM Press. – ISSN 0163–5980, 289–302

[RG05] ROSENBLUM, M. ; GARFINKEL, T.: Virtual machine monitors: cur- rent technology and future trends. In: Computer 38 (2005), Nr. 5, 39–47. http://ieeexplore.ieee.org/xpls/abs_all.jsp? arnumber=1430630

[RN04] RUSSEL, Stuart ; NORVIG, Peter: Künstliche Intelligenz. Pearson Studium http://www.amazon.de/exec/obidos/redirect?

133 Chapter 8. Bibliography

tag=citeulike01-21\&path=ASIN/3827370892. – ISBN 3827370892

[RRX+06] RUTH, P. ; RHEE, Junghwan ; XU, Dongyan ; KENNELL, R. ; GOASGUEN, S.: Autonomic Live Adaptation of Virtual Computational Environments in a Multi-Domain Infrastructure. In: Autonomic Computing, 2006. ICAC ’06. IEEE International Conference on, 5–14

[RUB] RUBiS. – http://rubis.objectweb.org/ (Last visited 11.12.2007)

[San] Sandia National Laboratories. – http://herzberg.ca.sandia.gov/ (Last visited 31.01.2008)

[Sch07] SCHMID, Markus: Ein Ansatz fuer das Service Level Management in dy- namischen Architekturen. In: BRAUN, T. (Hrsg.) ; CARLE, G. (Hrsg.) ; STILLER, B. (Hrsg.): KiVS 2007 - Kommunikation in Verteilten Systemen - Industriebeitraege, Kurzbeitraege und Workshops, VDE Verlag, March 2007. – ISBN 978–3–8007–2980–7, S. 255–266

[SG06] SCHMID, Markus ; GEIHS, Kurt: Self-Organisation in the Context of QoS Management in Service Oriented Architectures. In: BOUDAOUD, Karima (Hrsg.) ; NOBELIS, Nicolas (Hrsg.) ; NEBE, Thomas (Hrsg.): Proceed- ings of the 13th Annual Workshop of HP OpenView University Association, Hosted by University of Nice at Cote d’Azur May 21 - 24, 2006. Stuttgart : Infonomics-Consulting, May 2006. – ISBN 3000187804, S. 153–164

[SN05] SMITH, J. E. ; NAIR, Ravi: The architecture of virtual machines. In: Computer 38 (2005), Nr. 5, 32–38. http://ieeexplore.ieee.org/ xpls/abs_all.jsp?arnumber=1430629

[Ten00] TENNENHOUSE, David: Proactive computing. In: Commun. ACM 43 (2000), May, Nr. 5, 43–50. http://dx.doi.org/10.1145/ 332833.332837. – DOI 10.1145/332833.332837. – ISSN 0001–0782

[TPCa] TPC-W. – http://www.tpc.org/tpcw/ (Last visited 11.12.2007)

[TPCb] TPC-W Java Implementation. – http://www.ece.wisc.edu/ ~pharm/tpcw.shtml Last visited 11.12.2007

134 Chapter 8. Bibliography

[UNR+05] UHLIG, R. ; NEIGER, G. ; RODGERS, D. ; SANTONI, A. L. ; MARTINS, F. C. M. ; ANDERSON, A. V. ; BENNETT, S. M. ; KAGI, A. ; LEUNG, F. H. ; SMITH, L.: Intel virtualization technology. In: Computer 38 (2005), Nr. 5, 48–56. http://ieeexplore.ieee.org/xpls/abs_all. jsp?arnumber=1430631

[Vaz01] VAZIRANI, Vijay V.: Approximation algorithms. New York, NY, USA : Springer-Verlag New York, Inc., 2001. – ISBN 3–540–65367–8

[Wal02] WALDSPURGER, Carl A.: Memory resource management in VMware ESX server. In: SIGOPS Oper. Syst. Rev. 36 (2002), Nr. SI, 181– 194. http://dx.doi.org/10.1145/844128.844146. – DOI 10.1145/844128.844146. – ISSN 0163–5980

[WHW+04] WHITE, S. R. ; HANSON, J. E. ; WHALLEY, I. ; CHESS, D. M. ; KEPHART, J. O.: An architectural approach to autonomic computing. In: Autonomic Computing, 2004. Proceedings. International Conference on, 2–9

[WTKD04] WALSH, W. E. ; TESAURO, G. ; KEPHART, J. O. ; DAS, R.: Utility func- tions in autonomic systems. In: Autonomic Computing, 2004. Proceedings. International Conference on, 70–77

[ZBG+05] ZHANG, Yuting ; BESTAVROS, Azer ; GUIRGUIS, Mina ; MATTA, Ibrahim ;WEST, Richard: Friendly virtual machines: leveraging a feedback-control model for application adaptation. In: VEE ’05: Proceedings of the 1st ACM/USENIX international conference on Virtual execution environments. ACM Press. – ISBN 1595930477, 2–12

135