DZIANIS BARTASHEVICH Orquestração De Serviços Cloud Com

Departamento de Eletrónica, Universidade de Aveiro Telecomunicações e Informática 2020

DZIANIS Orquestração de Serviços Cloud com BARTASHEVICH Componentes Críticos no SKA

Orchestration of Cloud Services with Critical Components in SKA

Departamento de Eletrónica, Universidade de Aveiro Telecomunicações e Informática 2020

DZIANIS Orquestração de Serviços Cloud com BARTASHEVICH Componentes Críticos no SKA

Orchestration of Cloud Services with Critical Components in SKA

“The greatest challenge to any thinker is stating the problem in a way that will allow a solution”

— Bertrand Russell

Departamento de Eletrónica, Universidade de Aveiro Telecomunicações e Informática 2020

DZIANIS Orquestração de Serviços Cloud com BARTASHEVICH Componentes Críticos no SKA

Orchestration of Cloud Services with Critical Components in SKA

Dissertação apresentada à Universidade de Aveiro para cumprimento dos requisitos necessários à obtenção do grau de Mestre em Engenharia Infor- mática, realizada sob a orientação cientíﬁca do Doutor João Paulo Barraca, Professor auxiliar do Departamento de Eletrónica, Telecomunicações e In- formática da Universidade de Aveiro, e do Doutor (co-orientador) Domingos Barbosa, investigador do Instituto de Telecomunicações da Universidade de Aveiro.

Este trabalho foi apoiado por Financiado pelo Programa Opera- ENGAGE SKA, cional Competitividade e Interna- POCI-01-0145-FEDER-022217 cionalização (COMPETE 2020) e FCT, Portugal o júri / the jury presidente / president Prof. Doutor Joaquim Manuel Henriques de Sousa Pinto professor auxiliar da Universidade de Aveiro vogais / examiners committee Doutor Nuno Pedro de Jesus Silva Technical Manager, Critical Software

Prof. Doutor João Paulo Barraca professor auxiliar da Universidade de Aveiro agradecimentos / This work would not have been possible without the contribution and help of acknowledgements many people and entities, to which I have to thank.

Firstly, I would like to thank PhD Professor João Paulo Barraca for giving me the opportunity of working in my ﬁeld of interest, with which I was able to learn and grow immensely. I thank for the guidance and motivation throughout this year.

A big thanks to the rest of the research group, namely my co-supervisor PhD Domingos Barbosa and PhD Professor Miguel Bergano, for providing a supportive environment with which I learned every day.

To my girlfriend, Inês Tavares, who has gone beyond her capabilities to help me through the most difﬁcult times and to always encourage me to be better, not give up and lose faith in myself.

To my Mother for providing me with this wonderful opportunity to grow and have a purposeful life. For being role model and my biggest supporter. And to my grandparents, their pride and faith in me will never be forgotten and will always push me to be the best professional and overall person I can be.

This work was supported by Enabling Green E-science for the Square Kilometre Array Research Infrastructure (ENGAGE SKA), POCI-01-0145- FEDER-022217 and funded by Programa Operacional Competitividade e Internacionalização (COMPETE 2020) and FCT, Portugal.

Palavras Chave computação em nuvem, openstack, virtualização, sla, monitorização, autom- atização, orquestração, métodos de disponibilidade, estratégias de disponibilidade, métodos de recuperação.

Resumo Esta dissertação propõe métodos de alta disponibilidade para aplicações críti- cas, a fim de manter a sua função normal e se recuperar de falhas inesper- adas. As aplicações podem ser desenvolvidas e alojadas para trabalhar no ambiente de nuvem para obter flexibilidade na manutenção, oferecendo tam- bém a opção de monitorização. Um sistema de monitorização pode vigiar as métricas do sistema, como o uso de CPU ou apenas um serviço de aplicativo específico, esteja ele em execução ou não. Além disso, a criação de alarmes no sistema de monitorização permite acionar a notificação sobre uma ocor- rência não esperada de evento, ajudando o orquestrador a recuperar a situ- ação do estado critico. A ocorrência da falha pode acontecer quando uma determinada métrica está acima do limite estabelecido, onde o SLA (Service Level Agreement) é violado. A solução implementada e testada usa a nuvem privada OpenStack como suporte à infraestrutura e, por meio do orquestrador Heat, do sistema de monitorização TICK Stack e de um mecanismo de re- cuperação, fornece uma solução capaz para o monitorizar o estado das apli- cações, oferecendo alta disponibilidade. Os resultados do teste provaram que a solução é capaz de recuperar o serviço em diferentes cenários de teste, indicando os limites de monitorização do sistema e recuperar o serviço em tempo aceitável sem comprometer outros serviços.

Keywords cloud computing, openstack, virtualization, sla, monitoring, automatic deployment, orchestration, availability mechanisms, availability strategies, recovery methods.

Abstract This dissertation proposes methods of high-availability for critical applications to maintain their normal function and recover from unexpected failures. Appli- cations can be developed and deployed to work within the cloud environment to achieve flexibility in maintenance, also giving the option of monitorization. A monitoring system can monitor system metrics like CPU usage or just a specific application service, whether is it running. Additionally, creating alarms within the monitoring system, allowing to trigger notification upon a failure event occurrence helping the orchestrator to failover. The failure occurrence can happen when a certain metric is above the established threshold where the Service Level Agreement (SLA) is violated. The implemented and tested solution uses OpenStack private cloud as infrastructure support, and through use of the Heat orchestrator, TICK stack monitoring system, and a recovery engine provided with a capable solution for critical application monitoring, providing high-availability. The test results proved the solution worth in different test scenarios indicating monitoring limits of the system and showed the service recovery time to be reasonable without compromising other services.

Contents

Contents i

List of Figures v

List of Tables vii

Glossary ix

1 Introduction 1 1.1 Motivation ...... 1 1.2 Objectives ...... 2 1.3 Contributions ...... 2 1.4 Thesis structure ...... 2

2 State of the Art 5 2.1 Cloud Computing ...... 5 2.2 Virtualization ...... 8 2.3 Service Level Agreement and Availability Metrics ...... 11 2.4 Monitoring ...... 13 2.4.1 Related monitoring solutions ...... 14 2.5 Automatic Deployment ...... 17 2.5.1 Related automated deployment solutions ...... 18 2.6 Availability Mechanisms ...... 20 2.7 Availability Strategies ...... 22 2.7.1 Solutions ...... 22

3 SLA constraints in OpenStack Stacks 25

i 3.1 SLA Characteristics ...... 25 3.1.1 Availability ...... 26 3.1.2 Service Performance ...... 29 3.1.3 Latency ...... 35 3.1.4 Network ...... 36 3.1.5 Application ...... 37 3.2 Solution proposed ...... 38 3.3 Component diagram ...... 39 3.4 Template for SLA compliance ...... 40 3.5 Operation workﬂow ...... 46 3.5.1 Service Deployment ...... 47 3.5.2 Service Recovery ...... 48

4 Implementation 51 4.1 SLA enhanced template ...... 51 4.2 OpenStack implementation ...... 52 4.3 OpenStack Heat template ...... 54 4.4 Service manager ...... 58 4.5 Monitoring subsystem implementation ...... 60 4.5.1 Monitoring dashboards ...... 67 4.6 Recovery engine ...... 68

5 Evaluation and analysis 73 5.1 Scenario ...... 73 5.2 Limitations ...... 77 5.2.1 OpenStack platform ...... 77 5.2.2 SLA metric processing ...... 78 5.3 Results ...... 79 5.3.1 Service deployment steps ...... 80 5.3.2 Multiple service deployment time ...... 81 5.3.3 SLA violation detection time ...... 82 5.3.4 Recovery method comparison ...... 83 5.3.5 Multiple service recovery ...... 85 5.3.6 Network analysis ...... 86

ii 6 Conclusion 91 6.1 Future work ...... 92

A GitLab runner template 95

References 97

iii

List of Figures

2.1 Cloud environment architecture...... 6 2.2 Comparison among three cloud service models ...... 7 2.3 Hypervisors types ...... 9 2.4 Hypervisor performance comparison [11] ...... 10

3.1 Verifying host state at a certain monitoring interval ...... 27 3.2 Swap memory usage by Nexus OSS ...... 31 3.3 IOSTAT real-time disk statistics ...... 32 3.4 Free memory tool in human-readable output ...... 33 3.5 Content of file /proc/meminfo split through four images ...... 33 3.6 Terminal output of mpstat tool of a virtual machine with 46 vCPU ...... 34 3.7 netstat terminal output ...... 37 3.8 iperf3 terminal output ...... 37 3.9 Solution architecture ...... 40 3.10 Service deployment workflow ...... 47 3.11 Stack and SLA workflow from service deployment ...... 48 3.12 Service recovery workflow ...... 49

4.1 OpenStack module distribution ...... 53 4.2 Service manager implementation architecture ...... 59 4.3 Example of a Chronograf dashboard ...... 66 4.4 Monitoring dashboard of SLA availability percentage ...... 67 4.5 Monitoring dashboard of metric statistics ...... 67 4.6 Class diagram of the implemented solution and related classes ...... 69 4.7 Flow of the action inside api_post() function ...... 71 4.8 Recovery engine log ﬁle ...... 71

v 5.1 EngageSKA cluster at the Datacenter of Telecommunication Institute in Aveiro 77 5.2 Simultaneous service stack deployment ...... 78 5.3 Kapacitor delay to process metrics ...... 79 5.4 Duration of each step from service deployment ...... 80 5.5 Deployment time over number of services ...... 82 5.6 Detection time of a SLA violation among different monitoring intervals . . . . . 83 5.7 Recovery method comparison ...... 85 5.8 Service recovery time with different quantity of services ...... 86 5.9 Throughput at monitoring system according to different monitoring intervals . . 87 5.10 Throughput at monitoring system according to a different number of services . . 89

vi List of Tables

2.1 Classes of system availability [14] ...... 12 2.2 Comparison of network monitoring options [17] ...... 14 2.3 Monitoring solution comparison ...... 17 2.4 Comparison of migration techniques [22] ...... 18 2.5 Orchestration system comparison ...... 24

3.1 External SLA monitoring ...... 43

4.1 Some OpenStack resource types ...... 56

vii

Glossary

SKA Square Kilometer Array DSL Domain Speciﬁc Language SLA Service Level Agreement PTP Precision Time Protocol EngageSKA ENAbling Green E-science for CI/CD Continuous Integration and SKA Continuous Delivery LMC Local Monitoring and Control QoS Quality of Service IaaS Infrastructure-as-a-Service SNMP Simple Network Management Protocol PaaS Platform-as-a-Service HA Highly-Available SaaS Software-as-a-Service FT Fault Tolerant AWS Amazon Web Services SMP-FT Symmetric for Multi-Processor HVAC Heating, Ventilation, and Air Fault Tolerance Conditioning IOPS Input/output operations per VM Virtual Machine second OS Operating System BSOD Blue Screen Of Death KVM Kernel-based Virtual Machine TTL Time-To-Live CPU Central Processing Unit WAN Wide Area Network vCPU Virtual CPU MTU Maximum Transmission Unit IO Input/Output UPS Uninterruptible Power Supply YAML YAML Ain’t Markup Language NTP Network Time Protocol HOT Heat Orchestration Template HPC High-Performance Computing SDP Science Data Processor RAID Redundant Array of Inexpensive Drives RAM Random Access Memory PSU Power Supply Unit TICK Telegraf, InﬂuxDB, Chronograph and Kapacitor NFS Network File System RTT Round Trip Time SSD Solid-State Drive

CHAPTER 1

Introduction

This dissertation work is intended to develop mechanisms for service deployment and high-availability, using a monitoring system to maintain the service working accordingly to Square Kilometer Array (SKA) requirements. The SKA project uses Tango devices, which require special attention. A Tango device can control dishes, arrays of antennas, and more. The availability of those devices is critical, and software that controls them cannot fail. Moreover, the cloud environment allowed projects like SKA to quickly provision the software, either for test or production purposes. However, the cloud environment does not possess a way to monitor the deployed software. In case of a failure, there are no alerts about it. Even if someone notices the failure, it will still require manual attention. With all these issues, the software will suﬀer a high amount of downtime. In some scenarios, this can be critical. For example, failure of the emergency mechanisms to position the dish in a safe position under high wind speed may result in a dish falling to the ground. To achieve the desired objectives in this dissertation, we abstracted components such as the cloud environment and monitoring system by using OpenStack and TICK Stack, respectively.

1.1 Motivation

ENAbling Green E-science for SKA (EngageSKA) is a project funded by FCT/POCI (POCI-01-0145-FEDER-022217), P2020 and developed at Telecommunication Institute located in Aveiro. The main goal of the project is to evaluate state of the art with a sustainable plan for green e-science, by fostering infrastructure and participating in the ESFRI SKA project along the Big Data and Green Power axis. At the same time, it aims at securing contributions in the SKA consortia, with strong opportunities

1 to participate in the construction and scientific exploration. The current aim of the project in SKA consortia is to provide the infrastructure with cloud computation for scientific development and testing. The cloud computation enables faster and flexible deployment, also simplifying the datacenter architecture. Performing tasks, operations, and maintenance become much more comfortable with cloud technologies as resources are centralized. Due to virtualization, the number of hardware in a datacenter is less, although more powerful for deploying more services, more testing, and more development on just one server.

1.2 Objectives

The main dissertation objective can be split into three parts. The first is to define a structure for the template file to describe the service and SLA rules. The service description must include the required resources to provision and the installation script. The SLA rules must consist of the description of metrics to monitor with a respective threshold, also including recovery methods associated with the monitoring metrics. The second objective is to perform template analysis and deploying the service within the cloud environment using the orchestration mechanism. The third objective is to monitor the service continuously, and if one of the SLA metrics goes against the established threshold, trigger the recovery method by the order specified in the template. Upon recovery failure, notify the user about the recovery failure.

1.3 Contributions

This dissertation work has contributed to the development of the SKA project by providing the template speciﬁcation and conﬁguration for future services deployed by SKA members [1]. Another contribution is the state of the art for monitoring solutions, Local Monitoring and Control (LMC) [2]. Furthermore, support the CI/CD mechanisms using OpenStack infrastructure of developed work in the paper presented at the ICALEPCS’19 conference [3].

1.4 Thesis structure

To familiarise the reader about the discussed topics in this dissertation, Chapter 2 presents the current state of the art, introducing the latest available technologies and their usability for the work developed in this dissertation. Afterwards, Chapter 3 presents the solution, explaining from a general point of view the architecture and the required components. Chapter 4 shows the used technologies and the methods developed to achieve the architecture established in Chapter 3. The obtained results and

2 the performed analysis are described in Chapter 5. Finally, ending with a conclusion in Chapter 6, where future work is also presented. Appendix and References are presented lastly.

CHAPTER 2

State of the Art

This chapter reviews the current technologies and methodology related to this work, describing diﬀerent solutions and showing the advantages and disadvantages of each solution. Some of the presented solutions are for academic purposes only while Open- Source or commercial products currently used by the community. This chapter will help to better understand the available strategies for availability, creating robust availability strategies, and choosing the right tools to use during the implementation phase.

2.1 Cloud Computing

Cloud computing is a relatively new field, consisted of providing resources such as computing power, storage space, and networking for users to use in their own PCs, Smartphones, or tablets via remote infrastructure [4]. Using a cloud environment simplifies the network infrastructure required to do the job of small or large companies, reducing equipment number and maintenance cost [4]. Cloud computing uses a distributive architecture, where resources are centralized, and through virtualization (described at Section 2.2) can be quickly provisioned on-demand [5]. In cloud computing, the available resources are shared as well as the cost associated with it, where users only pay for what they need at a given time. Figure 2.1 illustrates a simple cloud computing architecture. Infrastructure is the hardware layer of cloud computing and consists of computing servers, networking, and storage units. Service is the software layer, and it runs on top of the infrastructure. The management is responsible for the work coordination among different modules like infrastructure and service layer. All modules presented in a cloud environment must be of restricted access, where only authorized users should be able to access it, verified by security modules. There is an application running on the cloud for the user to access the application, in which

5 the browser requests the server. If the request is valid, the browser shows the received answer.

Figure 2.1: Cloud environment architecture.

Since resources can be shared, meaning that the same hardware can be used between multiple users, increasing the efficiency of resource usage [5]. This can be seen as an advantage, but also as a disadvantage if not used properly. For example, if one of the users compromise their system, other users can also be affected [6]. The cloud environment deployment can differ depending on the user requirements and purpose [7], and there are three cloud service models: Infrastructure-as-a-Service (IaaS), Platform-as-a-Service (PaaS) and Software-as-a-Service (SaaS). IaaS consists of resource renting, such as processing, storage and network capacity, providing users with ways to remotely run and develop applications [7]. With this service, the hardware maintenance will not be a concern for the user, only the Operating System (OS) configuration will be of their responsibility [4]. Amazon Web Services (AWS) and Windows Server are examples of this. The PaaS model allows both hardware and OS abstraction, where developers only need to focus on application design, development, testing, implementation, and hosting. Examples of this cloud model include Google App Engine and Windows Azure. The SaaS model is directed to the common users as it uses a browser as a platform to run the web application hosted on the cloud [7]. This is the most frequently used cloud model [4] and has several popular examples: social networks (Facebook), blogging platform (Blogger), webmail services (Gmail), Internet banking portals, online payment systems (PayPal) and more. Figure 2.2 shows the summarized comparison among the

6 three cloud service models previously described. Some applications make the use of multiple cloud models, for example, the SkyDox web application (SaaS), which uses Engine Yard (PaaS) for documentation collaboration, which is deployed on AWS (IaaS) [4].

Figure 2.2: Comparison among three cloud service models

Depending on the company requirements, ﬁnance, and data privacy, there are several diﬀerent cloud environments to go for. Public cloud is available for general public usage, sharing the same cloud resources among all the clients [6], where the cloud is owned and managed by a third party cloud provider [4]. In contrast, the private cloud is devoted to the internal usage of one company, which might be hosted locally at the company or be outsourced to a third party to manage it [6]. The private cloud brings higher data privacy and the advantage of resource usage only by people inside the company or trusted members. Nonetheless, the costs to maintain a cloud running can be high in some cases. On the other hand, the cloud managed by a third-party cloud provider, releases the company from the responsibility of maintaining it, thus lowering the cost, while focusing on the development. As a downside, public cloud resources are used by a variety of users, and data privacy, cannot be fully assured. Some applications might use both public and private clouds. Hosted at the public cloud, but sensitive data is stored at the private cloud for security purposes. This type of cloud environment is called a hybrid. Another example is when the application is hosted at the public cloud, but the data are stored locally [4].

7 Cloud computing can often be described as elastic or utility computing [6]. It follows the usage-based model and a user only pays for what needs. In case of requiring more computational resources, those can be requested and acquired from the cloud [6]. Some of the examples are Amazon S3 and Amazon EC2 [6]. As mentioned before, this type of model brings the advantage of lowering up-front investments, using on-demand service and paying only for what is used. For Web applications, it brings the advantage of simultaneously supporting at one instance 100 users and at another 10,000 users without resource wasting, by provisioning and de-provisioning [6]. To better explain cloud computing utility, an example was presented at [6]. For example, 100 servers are needed for 3 years. The ﬁrst possible solution is to lease the servers, costing approximately 0,40$ per hour each, which will have the total cost of 100 servers * $0,40 unit/hour * 3 years * 8760 hours/year = $1,051,200. The second possibility is to buy the servers. Assuming that each server will cost approximately $1,500, it would be necessary to hire two staﬀ members at $100,000 per year to manage the servers, and each server consumes 150 watts at the cost of $0.10 per kilowatt-hour, in a total of $13,140 per year. As the sum of all associated costs, it would result in a total cost of

100servers ∗ $1, 500 + 3years ∗ $13, 140electricity/year

3years ∗ 2staff ∗ $100, 000salary/year = $789, 420

. In conclusion, server leasing can be a little more costly when compared to owning them. But there are more factors, which were not taken into account of server ownership such as, space renting, Heating, Ventilation, and Air Conditioning (HVAC), hardware upgrade, and repair.

2.2 Virtualization

Virtualization has changed the architecture of the cloud environment as it was previously known, where a single computational server was used exclusively to run a single OS. Virtualization enabled a single physical machine to deploy multiple Virtual Machine (VM) to execute multiple applications and services. This allows running multiple and diﬀerent OS, isolated from each other, on the same physical machine, abstracting from the infrastructure level [5]. One of the reasons why it would be advantageous to adopt this technology is that it keeps up with the growth of large information transfers and it increases datacenter capabilities [8].

8 Figure 2.3: Hypervisors types

To achieve virtualization, it is required to have a hypervisor present on the physical machine, and there are two main types of hypervisors: Bare-metal (Type-1) Hypervisor, and Hosted (Type-2) Hypervisor. The bare-metal hypervisor is running directly on top of the hardware and does not require an OS. It has the disadvantage of requiring a specific hardware configuration, lacking in hardware flexibility. On the other hand, the hosted hypervisor allows for a wide variety of hardware configurations as it runs on top of the OS. Yet, as shown in Figure 2.3, it has an extra layer compared to the bare-metal type, thus not being as efficient. The virtualization layer itself has three possible types: Full Virtualization, Paravir- tualization, and OS Level Virtualization. Full Virtualization has direct access to the resources of the physical machine, allowing quicker resource access. Paravirtualization is a type of virtualization at which the OS is aware of being executed inside a VM. Where instead of making direct request operations as in full virtualization, it will perform hypercalls and explicit calls to the hypervisor. The OS Level Virtualization does not require a hypervisor, the OS plays the role of the hypervisor. This type of virtualization forces all virtual machines to share the same OS, usually used for containerization. Containerization is an alternative to VM, conceding developers ability to increase their productivity [9]. Containers create an abstraction layer to deploy services quickly and can be started in a matter of seconds, much faster compared to the VM, which can take minutes. As said before, on the downside, all containers must share the same OS, resulting in poorer isolation compared to the VM, and lack in OS compatibility. [9]. The use of virtualization improves resource usage, quickly providing isolated environments for testing purposes, increasing flexibility and utility. Also, it removes

9 dependencies of the OS level from the hardware level, where the same hypervisor can be executed on different OS [10]. Hwang et al. [11] presented a comparison between the four most popular virtualization platform technologies: Hyper-V, Kernel-based Virtual Machine (KVM), vSphere and Xen. Their methodology for hypervisor performance comparison is to use specific benchmark workloads for each resource with 1 Virtual CPU (vCPU) and 4 vCPU. The Bytemark benchmark was used to stress the capabilities of the Central Processing Unit (CPU), from which showed similar result performance among the hypervisors, only minor difference noticed. The Ramspeed1 benchmark was used to measure cache and bandwidth memory. As a result, it showed a similar performance, as well. How- ever, KVM at multiple vCPU levels showed a 25% performance decrease compared to the other hypervisors. The Bonnie++2 benchmark was used to test disk throughput, which revealed similar performance behavior, except in the case of Xen, that showed a decrease in performance in the test with character level Input/Output (IO). They run an additional FileBench3 benchmark that confirms a decrease of performance for Xen. In this benchmark, KVM showed the best performance out of all. The Netperf4 benchmark was used to test network performance, showing similar performances across most hypervisors but not for Xen, which had 22% lower performance compared to the others. In their discussion, they stated that there is no perfect hypervisor choice. As illustrated in Figure 2.4, it presents a comparison between the hypervisors at different levels, where the best-obtained performance belongs to vSphere and KVM. Different applications could benefit particularly with certain hypervisors, but in general, the vSphere had the best performance at most benchmarks.

Figure 2.4: Hypervisor performance comparison [11]

1https://github.com/cruvolo/ramspeed-smp/ 2https://linux.die.net/man/8/bonnie++/ 3https://linux.die.net/man/1/filebench/ 4https://linux.die.net/man/1/netperf/

10 2.3 Service Level Agreement and Availability Metrics

The need for the clients to use cloud computing is increasing on a worldwide scale, creating demand for differentiated service quality. Different clients have different expectations and requirements, forcing cloud providers to establish a Service Level Agreement (SLA), a commitment between the provider and the client to assure on agreed quality of service. The cloud consumer establishes a set of requirements according to the cloud capabilities and available metrics, resulting in an agreement. If the cloud service provider does not obey the established SLA, penalization can occur. For example, Salman Baset [12] compared the SLA of different cloud providers, where Amazon EC2 pays per second 10% of the customer bill if the availability is below 99.95%, and Azure Compute pays 10% if the availability is below 99.95%, or 25% if the availability is below 99%. To ensure the established set of requirements are being fulfilled, cloud providers must have a monitoring system performing periodical metric collection. Monitoring metrics can be used by the monitoring system to verify VM state, performance and other information. The service performance can be differentiated using Quality of Service (QoS). To establish a differentiated QoS, the monitoring system will use different thresholds accordingly to the clients requirements. One physical server can provision multiple VMs, which will demand from server more resources, and high load, leading to resources shortage, creating the need to monitor each server resource utilization, and balance between all available servers [13]. Resource usage is not the only metric to be monitored, as service availability and performance are also relevant. Some of the important metrics to monitor on a server to ensure the performance and availability are: Pages/sec, Number of CPUs, Guest and Host Memory Usage, Memory Swap In/Out Usage, VM Disk Read and write rate, VM Network Data Receive and Transmit Rate, Virtual Machine Configuration, Virtual Machine State, Response Time, VM Startup and Release Time, Up Time, CPU Usage, Page Faults/Sec, Available Memory, among others [13]. Due to hardware evolution, availability has been improved since 1980, when a typical server had the availability of 99% [14]. While this amount of availability may sound ideal, it results in 100 minutes of downtime per week. This kind of downtime can be acceptable for back-office computers where the work is done asynchronously. However, for critical and online applications, they cannot undergo such downtime. For each instance of time, the service is unavailable, the business can be losing profit, or for some businesses, it ruins the client experience, consequently, the company’s reputation. Ideally, this type of application requires at least high-availability, which is 99.999% of availability, resulting in a maximum of 5 minutes of service denial per year, roughly 5

11 Unavailability Availability Availability System Type Category (minutes/year) (percentage) Class Personal Unmanaged 50,000.00 90 1 clients Entry-level Managed 5,000.00 99 2 business systems Well-managed 500.00 99.9 3 E-commerce Fault-tolerant 50.00 99.99 4 Datacenter Telephone High-availability 5.00 99.999 5 network Military Very-high- 0.50 99.9999 6 defense availability systems

Table 2.1: Classes of system availability [14] seconds per week. An excellent example of high-availability is a telephone network that requires the availability of level 5 (five nines, 99.999%), resulting in a maximum of 2 hours of downtime within 40 years. Furthermore, Table 2.1 described other availability classes with respective categories in which they are inserted. Buyya et al. in [15] proposed a dynamic SLA-oriented build using Manjrasoft Aneka [16]. With it, for a given task, it is established a certain amount of time to finish the task, with the possibility of extra resource allocation. The experiment had one task with four different deadlines: 1 hour, 45 minutes, 30 minutes and 15 minutes. The experiment had 4 initial static machines, and when the job began, it executed a provisioning algorithm of Aneka performing cost-optimization where it calculated a minimum required amount of resources to execute the task in the predefined time. If it required more resources to finish the job on time, it could allocate additional VMs at an extra cost. As a result, for a task to finish in 1 hour, were not required any extra resources finishing the task in 1:00:58. For the deadline at 45 minutes, were required 2 extra VM’s at a total cost of U$ 0.17, finishing in 0:41:06. For the deadline at 30 minutes, were required 6 extra VM’s at a total cost of U$ 0.51, finishing in 0:28:24. For the deadline at 15 minutes, were required 20 extra VM’s at a total cost of U$ 1.70, finishing in 0:14:18. Concluding from the experiment that if a certain task required a shorter deadline, it is possible to provision extra resources and divide the task across the resources to achieve an early end. With more resources, faster is the result and more will it cost. This method can be used to provide extra resources on a higher peak of usage to load-balance the work across the resources.

12 2.4 Monitoring

A monitoring system has the purpose of monitoring the host, and issue alerts about the changes. The main components behind the system are ate collector, storage, presentation, and alerting. To retrieve the monitoring metrics from the host, there are two methods: push and pull [17]. The retrieved metrics from the hosts under monitoring are stored and processed at the collector node. It is essential to choose an adequate method to store the monitoring data. Choosing the right storage method can improve performance and scalability. Usually, time-series databases are commonly used in monitoring systems because they can quickly scale and have better usability [17]. The stored metrics can then be visualized at a dashboard in the form of graphs, tables, and more. The presentation helps administrators to analyze historical events and detect abnormalities in the network or host behavior. It can also display real-time statistics [17]. Angelopoulos et al. at [18] presented a visualization software called Grafana. It is an open-source tool for visualization of metrics and time series, also querying and supporting multi-tenant dashboards. It also supports notification generation in case of specific metric exceeds the defined threshold. A typical example of the notification generation is when the host becomes unreachable or unresponsive. Those notifications should be treated intelligently, when a specific metric bounces around the threshold, the notifications should be aggregated and only sent once, avoiding notification spamming [17]. Another purpose of monitoring is preventing security breaches, suspicious events, ensuring the QoS to comply with the established SLA, and more. To keep up with the cloud growth, the monitoring system also has to become more complex, requiring flexibility and scalability to keep the costs low. Monitoring systems can add overheads to the network resulting in a loss on performance, an extra cost depending on the number of monitoring points and a bigger difficulty to maintain. Monitoring data gathering is made through agents. There are two types of monitoring agents: agent-based and agentless. Table 2.2 shows a short comparison between the two types [17]. An agent-based monitoring system is a third-party software deployed on every host under the monitorization. This system is difficult to implement and maintain as every newly added host requires the individual installation of the agent. In the future, if the configuration of the monitoring system changes, it will require changes to be made on each host. Nevertheless, it can use authentication protocols and only transmits the necessary data, lowering the amount of used bandwidth in the network. To completely remove the network overhead, it is required to separate the monitoring traffic from

13 Feature comparison Agentless Agent-based Deployment Easy Hard Security Good Better Network overhead Yes Less Breadth and depth of monitoring Limited Extensive

Table 2.2: Comparison of network monitoring options [17] the general network, by adding another network exclusively dedicated to monitoring. However, adding another network for monitoring causes issues in scalability, as it adds more complexity upon adding new hosts, requiring more maintenance [17]. The agentless monitoring system is easy to implement and does not require to be installed on every host. It uses a well-known protocol, Simple Network Management Protocol (SNMP), to push the data to the network so that the collector host can retrieve it. The collector host is a monitoring service responsible for gathering information from the network and store it. The data sent by the host under monitorization is not filtered, and sometimes may send irrelevant information, creating a network overhead and higher bandwidth usage. However, this type of agent has security issues. Agentless monitoring allows remote access to the server using SNMP protocol. Not only enabling the user to retrieve the server’s performance information but also management access, performing actions such as reboot. Authentication protocols exist, but they are not as effective as the ones used by an agent-based system [17]. As previously stated, the monitoring information gathering can be made through pull or push actions. In pull action, the collector asks the host for the specific metrics during a certain interval of time. In the push action, the host sends metrics on a certain interval of time or event to the collector. This method can be adequate to keep the network overhead low if used properly [17].

2.4.1 Related monitoring solutions Rodrigues et al. at [19] presented an overview of generic monitoring solutions such as Cacti, MRTG, and Nagios, while also comparing them. Cacti and MRTG are used to measure network link consumption, but neither of them provides a self-configuration method or support for host discovery. Nagios was designed for traditional environments, it has the main feature of supporting plugins for multiple metrics gathering, thus adding flexibility, and allowing monitoring in virtually any type of environment. As a downside, Cacti, MRTG and Nagios do not meet the elasticity and do not support self-configuration ability. Some of the presented solutions at [19] focus on specific software such as Amazon WatchCloud, the monitoring system for Amazon Web Services (AWS). It allows easy metric monitoring and alert management and, unlike the three

14 previously described solutions, this one has the self-configuration method, allowing users of AWS to configure their clouds easily. The downside of this solution is the restriction it has of only working with AWS products. Angelopoulos et al. [18] presented possible monitoring solutions within the 5G environment. They presented the advantages of generic solutions such as Zabbix, Nagios, Senso, and Consul. Zabbix is an enterprise-level distributed monitoring system for both network and software applications. It uses an agent-based collection method, having the capability of reporting through push or pull. Its downside is not using a time-series database, producing many false positives, and triggering false alarms. Nagios has numerous add-ons to be used and providing features like multi-tenancy. However, scalability is a big issue in this monitoring system and health checks are difficult to manage in large infrastructure environments [19]. Sensu supports automatic host discovery and metric processing is made by using a queuing system. The scalability is superior to Nagios, but it suffers from a single point of failure. Consul performs better health checks compared to Nagios by using event-based push messages, reducing the network overhead and computational resource allocation. This solution is more scalable comparing the three previous, but it has the disadvantage of the agent availability not being monitored. Another monitoring solution proposed is Monasca, which is integrated within the OpenStack platform. Monasca offers multi-tenancy, scalability, excellent performance and fault-tolerant monitoring-as-a-service communication through REST API. Additionally, it has an alarm and notification engine. The control and monitoring of the scientific data provided by the SKA telescope are one of the main challenges of the SDP [20]. The SDP consortium from SKA is expected to invest in Openstack, and Monasca is an option as a monitoring tool because it has direct integration with OpenStack. Yet, as a downside, it creates unnecessary network overhead, creating a higher possibility of losing critical information and reducing compatibility with non-Monasca collecting agents, thus requiring higher investment costs in the integration of custom collection agents. As another monitoring framework option besides OpenStack Monasca was considered Collectd, Graylog, Prometheus, and ELK Stack. Collectd is an old metric collection service with numerous integration plugins for a specific application, but it lacks in metric visualization platform and log collection service. Graylog is a tool for log collection and analysis, and it uses ElasticSearch for storage and Fluentd for log collection, but it lacks in monitoring features because it mainly aims for log analysis. Prometheus is a monitoring system featuring metric and alert solutions. It has an adequate performance when used for container monitorization and can be integrated with Grafana for information analysis and visualization, but it lacks in logging solutions. ELK Stack offers monitoring and logging solutions based on ElasticSearch for storage, on LogStash for logs and metrics

15 collection and on Kibana for visualization and analysis5. Another option for the monitoring and logging solutions for the SKA SDP usage where ELK Stack with Prometheus can provide a full solution for monitoring and logging, but possess the disadvantage of not supporting multi-tenancy. Also, Prometheus lacks in metric push method, which would be useful for specific applications. It is also stated that Monasca Agent is a metric collection method that is designed to work with Python 2 and outside of a container. Thus, a problem occurred when trying to run the agent on recent Python 3, and because of Monasca agent plugin use detection routing that is incompatible within the container by default. Detection routing had to be reconfigured manually in order to work properly6. M. Brattstrom, P. Morreale et al. [17] reviewed the InfluxData Platform solution, also known as TICK Stack. This solution is agent-based, mostly constituted by open- source code and it comes with all necessary tools for system monitoring. The TICK Stack is composed of four components: Telegraf, InfluxDB, Chronograf, and Kapacitor (TICK). Telegraf is a plugin-driven server agent for collecting and reporting metrics, InfluxDB is a time-series database for metric storage that uses SQL-Like queries to interact with the collected data, Chronograf is a platform for data presentation, and Kapacitor is a data processing engine for alert management. From all presented and reviewed monitoring solutions, only OpenStack Monasca, Nagios, ELK Stack, Zabbix, and TICK Stack were selected as promising ones where Table 2.3 shows the comparison of main features between them. The TICK Stack implementation has the lowest complexity of them all due to its modular and straightforward components, also adding more value due to the plugin support feature. Except for Monasca, the remaining monitoring solutions presented in Table 2.3 also support plugins. The monitoring data collection can be performed both agentless and agent-based using Nagios. Remaining monitoring solutions only support agent-based. The usage of a time-series database is helpful for future event predictions by analyzing past behavior. Monasca and TICK Stack use it while others do not. A multi-tenancy feature, while possible in Nagios, it is paid. In ELK Stack, it is not available, and the rest of them support it. Scalability is an essential feature in a cloud environment. Unfortunately, only Nagios and Zabbix do not support it. Additional monitoring can be performed by reading log files, only ELK Stack, Zabbix, and Nagios support it. However, using logging feature in Nagios require additional costs. In summary, there is no one perfect monitoring solution. It will depend on the setup, the environment, and the requirements.

5http://ska-sdp.org/sites/default/files/attachments/sdp_memo_053_-_monitoring_ and_logging_for_the_sdp_part_1_-_signed.pdf 6http://ska-sdp.org/sites/default/files/attachments/sdp_memo_068_p3-alaska_ monitoring_logging_prototyping_part_1_-_signed.pdf

16 ELK TICK Monasca Nagios Zabbix Features Stack Stack [18][20] [19][18] [18] [20] [17] Complexity Medium Medium High Medium Low Plugin support No Yes Yes Yes Yes Open-source Yes Yes Yes Yes Yes Agent/Agentless Agent-based Both Agent-based Agent-based Agent-based Uses time-series Yes No No No Yes Multi-tenancy Yes Paid No Yes No Scalability Yes No Yes No Yes Logging No Paid Yes Yes No

Table 2.3: Monitoring solution comparison

2.5 Automatic Deployment

There is a variety of OS to be run, and it is impossible for cloud administrators to manually deploy a massive amount of virtual machines in a short amount of time. Therefore orchestrator components were developed to assist humans in this task. The orchestrator is responsible for assigning the cloud resources to each VM instance, such as volume, memory and computing resources [21]. Although the orchestrator can provision resources, it requires the user to specify resources to provision in the form of a template. The orchestrator’s purpose is not only to provision virtual resources, but also to react to unexpected failures, and perform scheduled maintenance. As a fundamental feature, an orchestrator can perform VM migration, which consists of moving one VM from one host to another. As all the resources are virtual, it is easy to perform the migration, and there are two methods: live migration and cold migration [22]. Live migration occurs while the VM is still running. Usually, this method of migration occurs in load-balancing situations. To migrate the VM from one host to another, a set of checks must be made. During migration, new resources are provisioned at the new host and most importantly, they should match in terms of specifications before the start of OS migration. After having passed all of the checks successfully, the migration of the OS takes place and the network traffic is redirected to the new VM, acknowledging the old instance of the successful migration, removing it. During the live migrations, users are not aware of any change, although there are an instantaneous and insignificant latency increment [22]. Cold migration occurs while the VM is powered OFF. Since the VM content is stored in a file or volume, it can be easily migrated from one location to another. The new VM instance is then associated with the new host and, after the migration, the old

17 Migration VM state Advantages - quick migration Live Migration Powered ON - facilitates cloud maintenance - simple architecture and implementation Cold Migration Powered OFF - shared storage not required

Table 2.4: Comparison of migration techniques [22] instance is deleted. On the occasion when the host requires maintenance, the operator needs to perform the migration of VM’s to another host. With the live migration, the VM is in powered ON state, the migration occurs to another host without downtime, not aﬀecting the client’s service. In the case of cold migration, the VM is in a powered OFF state, VM is stored in a ﬁle or volume that can be easily transferred to another host. This method is more straightforward in architecture and more comfortable to implement. Table 2.4 describes the summary of the advantages of the two migration methods previously explained [22].

2.5.1 Related automated deployment solutions Kovács József and Péter Kacsuk [23] presented Occopus, a multi-cloud orchestrator that deploys and manages complex scientific infrastructures. Occopus has features such as multi-cloud support, multiple configuration management tools support, health monitoring, multiple node definition, scaling, on-the-fly dynamic reconfiguration of the infrastructure, interfaces, and error reporting support. They also reviewed some academic orchestration prototypes (Roboconf, Live Cloud, SALSA, IM by GryCAP) and commercial orchestration prototypes (Cloudify, Heat, CloudFormation, Terraform). Roboconf [24] is a cloud orchestrator that supports service deployment, maintenance, and migration between cloud and multi-cloud systems. However, Roboconf lacks in support of configuration management tools. Live Cloud [25] is a management framework that provides service and resource provisioning. This orchestrator lacks in a multi-cloud support system. SALSA [26] is a framework for dynamic infrastructure provisioning. It supports not only single-cloud but also multi-cloud systems. It uses its own configuration management tool, providing fine-grained configuration at different levels. However, this orchestrator consists of many services, creating a complex system (in terms of architecture and usage) requiring more human effort to configure and maintain it up-to-date. In contrast to SALSA, Occopus uses third-party configuration management tools, allowing the user to choose the most convenient and adequate configuration for the application.

18 Infrastructure Manager (IM) by GryCAP [27] is closer to Occopus than the others, providing a user-friendly interface, by hiding irrelevant details from the user and focusing on Ansible as a configuration management tool. However, it is not flexible enough for the user, who may prefer other configuration management tools, such as Chef7. Also, IM contains cloud-specific details/attributes, eliminating multi-cloud portability. Cloudify8 was released in 2014, making it one of the latest orchestrators. It allows the user to set up a life cycle for services and applications, including monitoring of all details of the application, detecting problems and automatically fixing them. Nevertheless, some advanced features such as Web UI are only available in the premium edition. Heat9 is a template-based orchestrator that provides auto-scaling if integrated with the Telemetry module and allows configuration management tools such as Chef or Puppet. Interaction with Heat can be done via CLI, API or Horizon Dashboard. However, Heat only supports OpenStack clouds. CloudFormation10 is not an open-source solution, but it is the most mature and heavyweight orchestrator in comparison to the previous ones. This orchestrator was developed by Amazon for their AWS clouds with complete exclusivity. Terraform11 is an open-source solution, focusing only on infrastructure deployment. There is no lifecycle-management, scaling or error handling. There is no UI, only CLI can be used and the learning curve is much steeper compared to the ones in other orchestrators. Cloud portability was not designed, making it significantly difficult to move to another cloud. Orchestrator can be also used at the container level. Most commonly used are: Swarm12, Kubernetes13 and Apache Mesos14. However, they only perform on already existent resources, resource allocation (containers) is not yet supported [23]. The orchestrator can be locked-in to a particular configuration management tool to install and configure service on the deployed resources. Hochgeschwender et al. at [28] made a comparison between existent configuration management tools. Those tools are mainly used to automate development and system administration tasks such as deployment, testing, and maintenance [23]. The most popular and mature are Chef and Ansible15. Other analyzed tools are Salt16, Puppet17,

7https://www.chef.io/ 8https://cloudify.co/ 9https://wiki.openstack.org/wiki/Heat/ 10https://aws.amazon.com/cloudformation/ 11https://www.terraform.io/ 12https://docs.docker.com/engine/swarm/swarm-tutorial/ 13https://kubernetes.io/ 14http://mesos.apache.org/ 15https://www.ansible.com/ 16https://www.saltstack.com/ 17https://puppet.com/

19 and roslaunch18. The main difference between them is the requirement of an agent to operate. Ansible and roslaunch use an agentless architecture, using an SSH session to access the deployment environment and execute the commands. In order to install software, Ansible uses the environment package manager or SCP to copy files. Chef, Salt, and Puppet use an agent-based architecture, using the master node to control, and a client daemon running on the deployment environment to execute commands. In comparison to the agent-based configuration management tool, the agentless does not require to install the agent on the host, only require SSH connection. Making this tool the most popular.

2.6 Availability Mechanisms

Availability mechanisms are essential for overall system availability. Without them, the system upon an error will stop working, resulting in a service failure, and for some businesses can mean profit loss. A generic strategy for dealing with errors is manual intervention. Where the system administrator tries to figure out what happened and how to fix the problem, using this strategy will occur in a significant downtime, and service becomes unreliable. Other mechanisms are required, more automated, without human interaction to deal with errors. Moreover, in order to maintain a system available, it is essential to know how and where it could fail. Nabi et al. [29] presented state of the art in cloud availability, where they reviewed different types of failures at different levels. Those levels are:

• power failure: loss of energy given to an infrastructure, causing failure of the devices such as networking, computing or storage • hardware failure: loss of main or secondary system components such as CPU, memory, disk, ventilation and more • network failure: total or partial networking failure of devices such as a router, switch, ﬁrewall, or a virtual network function • VMM or hypervisor failure: VM manager or hypervisor failure will translate into failure of the virtual instances associated with it • VM failure: the OS running on top of the VM can fail, leading to unavailability of the VM • application failure: the application itself can malfunction or fail due to individual component failure

Not all of the failure types can be within the cloud provider scope. For example, the IaaS cloud provider takes responsibility for power, hardware, network, hypervisor,

18https://www.ros.org/

20 and VM availability, but the application level is out of their scope, being the user responsible for components residing at that level. To prevent failures, the cloud provider can use mechanisms to increase availability. Those availability mechanisms can be categorized into three groups: fault tolerance, protective redundancy, and overload protection mechanisms. Fault tolerance means that service availability suffers a short to no downtime through certain actions. Those actions can consist of failover/switchover, where the failure is fixed by redirecting workload from the failed node to a redundant healthy node [30]. Another action could be restarting the failed node, which will clean the actual failed state of the node to return to an initial condition. Rollback and Roll-forward consist of switching from the failed state to a state that is known to be healthy and correct, such as backup or snapshot. Protective redundancy consists of having redundant elements, not required when the system is working correctly. The redundant element, which enables the service to work correctly in case of failure of the active node, also can be referred to as a standby node. Within standby nodes, there are two types, Hot Standby and Cold Standby. Hot standby is aware of the current state of the active node and, in case of failure, it replaces it with low to no downtime. Cold Standby, also known as a Spare, is a redundant element, which can be instantiated or uninstantiated on demand. Since a Cold Standby node has no track of the active node state (as it is shut off), it will require some time before it can replace the failed node, thus translating into additional downtime. However, the spare node is not powered on, thus using fewer resources, lowering energy consumption. This type of standby is more effective for applications independent of the current state, in other words, stateless applications. By a combination of different available methods and strategies, different levels of availability can be achieved. For instance, a geographical redundancy model can protect services even against natural disasters, such as flooding, earthquakes, and hurricanes, by distributing the services across different geographical sites [31]. Having multiple computational locations avoids single-point-of-failure, fastens the response to requests, and distributes the workload among them. As an example, Google uses clusters of servers in a distributive architecture across the world, resulting in high-performance, high-availability, high-throughput, and management of a significant number of requests from any part of the world. Overload protection consists in protecting the system against exceeding components limitations by way of auto-scaling and load-balancing. The auto-scale can also be referred to as elasticity, which is a mechanism of provision and de-provision resources on a schedule demand, or at the threshold of a specific workload. Auto-scaling can work side-by-side with the load-balancer in order to balance the workload between all nodes while optimizing the resource usage [32][33][34]. Load-balancer is a mechanism used to

21 distribute the workload across the available nodes. In the case of a node being highly loaded, the excessive workload is transferred to another node that is lightly loaded, ensuring every node has an equal amount of work [35].

2.7 Availability Strategies

The availability strategy is used to maintain the service or system available accordingly to a set of policies. The strategies proposed in [36] are about balancing the infrastructure resource usage, triggering automatic migration to the next available host under high load. The migration should be triggered on the hosts that are marked as a hotspot, due to being under high computational demand. A hotspot is a host at which the CPU utilization, memory, or bandwidth occupied are above the established threshold. The next available host is a host where the resource usage is below the threshold and has lower resource usage compared to other available hosts. This way, the resource shortage is avoided, and all the available hosts distribute the workload [36]. The strategy proposed in [37] is about ensuring the established SLA with the client. SLA constraints can be different for each service and if the host cannot fulfill SLA established by the user, migration is triggered to the next available host that is fulfilling SLA. Both strategies are well suited to be used simultaneously to provide better coverage in the overload protection. Both translate into an adequate proposal to obtain a high-available system for supporting critical services against failures.

2.7.1 Solutions Wubin Li and Ali Kanso in [38] compared containers to virtual machines in evaluating the achievement of high availability. From all of the compared solutions, the following were found as the most relevant for this dissertation: Docker Swarm, Kubernetes, and VMware. VMware is a commercial solution that allows virtualization using a bare-metal (type 1) hypervisor without requiring an underlying OS. The strategy of VMware to handle failover is to use clustering, requiring all virtual disks to be on shared storage. Every host must have a Highly-Available (HA) agent communicating heartbeats to the cluster, to acknowledge its presence and state. VMware HA can protect against three types of failures: host failure, guest OS failure, and application failure. Upon host failure, VMware will restart the VM on another host. If Guests OS stops working or the application fails, VMware resets the VM on the same host. VMware also features a continuous availability functionality, which is a failover mechanism allowing zero downtime. The zero downtime is only possible if there are two VMs, on diﬀerent hosts, where the ﬁrst VM synchronizes every hardware instruction with the second VM through VMware Fault Tolerant (FT) link. This is possible using vLockStep technology

22 supported by Intel and AMD processors. So far, VMware FT only supports a single vCPU per FT-enabled VM, and requires a fast network of at least 1 Gbit/s. However, having only one vCPU is not enough for the majority of critical services. At VMworld 2012 and 2013, Jim Chow et al.19 presented Symmetric for Multi-Processor Fault Tolerance (SMP-FT). SMP-FT allowed FT to support multiprocessing, up to 8 vCPUs per FT-enabled VM, but it requires even faster network link, with speed of at least 10 Gbit/s. The SMP-FT uses a new technology, called FastCheckpointing, allowed to increase the amount of vCPUs by only executing the CPU instructions at the primary VM and only sending the instructions result to the secondary [39]. In the same article [38], Kubernetes, a containerization solution that provides service high-availability is analyzed. Kubernetes groups the tightly coupled containers into pods, and the loosely ones into key/value labels. The labels consist of metadata, which describe the semantic service function. The master node is responsible for maintaining the cluster status and the communication between the resources. Kubernetes also has the principle of ReplicaController, where a given pod has a predefined number of replicas always running. In the case of a pod or service failure, traffic is redirected to a replica, and the failed pod is re-instantiated on a healthy node. In the case of a master node failure, traffic is redirected to a replica master, covering the host failure situation. Kubernetes is equipped with failure detection and failover mechanisms. Those mechanisms allow recovery from host, guest OS and application failure. Although Kubernetes provides high-availability, it lacks in mechanisms for state preservation for service continuity, however being adequate for stateless applications. Richter et al. [40] presented Docker Swarm as a solution for a microservice architecture. Docker Swarm has a Swarm manager, which controls the cluster of containers. The manager can have multiple replicas of itself to failover in case if the primary manager is unavailable, covering the host failure scenario. The communication protocol used is RabbitMQ, which allows to distribute and replicate messages across all RabbitMQ nodes to prevent message loss. Heidari et al. [41] presented Heat, an orchestration service designed to work with OpenStack clouds. Heat provisions VM in a stack accordingly to the description provided in the template file and can monitor the VM/App state. Heat can restart the instance on VM failure or when the application is not responding, but it cannot perform failover. During the stack deployment, Heat sends the configuration to the metadata server, which communicates directly with the VM to configure the monitoring tools inside the instance. Periodically, VM reports the information about the VM and service state to the metadata server. Then, Heat will pull the monitoring information

19https://www.youtube.com/watch?v=gu69VjLhkdw/

23 Docker VMware Kubernetes Heat Features Swarm [38][39] [38] [41] [38][40] Live migration Yes No No Yes Checkpoint/Restore Yes No No Yes Failure detection Yes Yes Yes Yes Failover management Yes Yes Yes Yes Guest OS Any Linux Linux Any Recover from host failure Yes Yes Yes Yes Recover from Guest OS failure Yes Yes Yes Yes Recover from application failure Yes Yes Yes Yes Service continuity Yes No No No

Table 2.5: Orchestration system comparison from the metadata server. However, if Heat does not receive information from VM in time, it will assume that the VM failed and will try to recover by restarting the stack. The interval for the monitoring can be deﬁned in a Heat template. The application layer works in a similar way. If Heat does not receive information from the application, it will restart the application. In case of it still not receiving the application status, Heat will escalate to a VM failure and restart the whole stack. Although Heat can protect against application and VM failures, it does not provide service continuity in case of failure. Table 2.5 provides a comparison of the availability feature between the most relevant orchestrators. In a containerized environment (Docker Swarm and Kubernetes), live migration and restoring methods are not being possible. The failure detection and failover management are possible on four orchestrators. As VMware and Heat provision VMs, thus can support deployment of all OS types. In contrast, Docker Swarm and Kubernetes allow only Linux based OS. Every presented orchestrator supports the recovery features, but only VMware can assure the service continuity. In summary, VMware has overall better availability support, but it is a commercial solution and is not Open-Source. The second best orchestrator is Heat, an Open-Source solution, and free to use.

24 CHAPTER 3

SLA constraints in OpenStack Stacks

There are numerous ways of keeping service running correctly. This chapter will focus on the architecture design that complies with SLA requirements and approaches upon found failures. SLA is a commitment between the provider and the client to ensure on the agreed quality of service. As described previously, some of the SLA attributes are Availability, Performance, Latency, Support, Network, and Application. The support is the availability time of the support team to assist the customer when it requires help. The support attribute will be out of the scope of this dissertation since the cloud provider itself provides support. This dissertation’s objective is to cover the IaaS cloud model, where the critical application should be guaranteed to run according to an established set of SLAs. Also, at the IaaS cloud model, the service provider is responsible for ensuring the SLA of the infrastructure, middleware, and application layers. To cover all those levels accordingly to SLA, it is essential to monitor various aspects of the system, from hardware to software layer. To better formulate the solution towards the objective, it required to settle on the cloud environment technology. Creating a generic architecture solution without a speciﬁc cloud environment could lead to major changes and further investigation during the implementation phase. Resulting in choosing OpenStack, an open-source could environment because it will also be used by SKA members, since SKA SDP also plans to use it.

3.1 SLA Characteristics

This section will describe the SLA characteristics, explaining how to quantify them, what are the metrics associated with them, how to monitor them, and how to reestablish them from an unexpected failure.

25 3.1.1 Availability Availability can be described as the probability of the system working correctly when it is required to. Having a more reliable system helps to improve the availability, the higher the reliability is, the higher is the chance of a system to be available. It is worth noting that reliability is only useful when the system has not failed yet. When the system fails, the important aspect to take into account there is the maintainability, which is the amount of time required to recover from a failure. If one of those aspects is not considered, it will cause a decrease in availability, just as choosing cheaper hardware will cause a decrease in reliability, or creating a very complex system architecture will decrease the probability of quickly recovering from a failure. On the other hand, enhancing only one of the aspects without changing the other will increase the overall availability. In conclusion, in order to achieve better availability, we must take into account the reliability and maintainability by evaluating and choosing the tools according to our needs and budget. Service availability is the ability for a user to have access to the information at all times if they have clearance to it. A system malfunction or a compromise of data security will affect the availability negatively when the information is not secure or not easily available. Thus, availability must be part of the SLA attributes, which internally are constituted by rules and thresholds used to maintain the service running as it is required. Service reliability can also be increased by adding redundant equipment that prevents service failure by switching or redirecting the traffic to a redundant and healthy system. Those redundant devices are usually used at the scale of data-centers, which can be classified into tiers accordingly to the ANSI/TIA-942 Data Center Standards1. A Tier I data-center has the lowest requirements, having an availability of 99.671%, and lacking in redundant equipment. Tier II is a more complete system, with the availability of 99.741%, with one spare equipment (N+1), as well as power and cooling redundancy. Tier III has the availability of 99.982%, where all IT devices are fault-tolerant, duplicated (N+1), and dual-powered from different energy sources. The network links are also duplicated, with two different service providers. Tier IV is the highest of them all, providing availability of 99.995%, where equipment redundancy is 2N+1, with even more power, network, and cooling redundancy. This tier is usually used for mission-critical systems, and designed for large organizations with a maximum downtime of 26 minutes per year. The percentage of availability can be determined by monitoring the service at a particular time interval. The monitoring interval for metric collection can be set 1http://www.tia-942.org/

26 according to the needs. Nonetheless, within a given interval, the service could be found unexpectedly unavailable, even if only for a short period of time, as if it were a reboot, and then become available again. Figure 3.1 shows that a situation where the service is found unavailable between monitoring intervals at which speciﬁc metrics are collected. The monitoring system will never know that the service was unavailable during that time, as the state of the service at collection time is always positive. The obvious solution would be to monitor the service with higher frequency, but this can be costly in an environment with multiple services, due to the need to process and store the retrieved information, requiring more processing power and storage space, also causing additional network overhead.

Figure 3.1: Verifying host state at a certain monitoring interval

Another method to determinate the service state can be by checking the service uptime. The formula of availability calculation at certain monitoring interval is the uptime multiplied by 100 (%) divided by the monitoring interval value, returning the percentage of availability in that interval. In summary, if the uptime is greater than the monitoring interval, the service was available during that interval of time. If the uptime is lower than the monitoring interval value, it means the service was rebooted for some reason. Even if the service was unavailable for a particular interval, it does not mean that it suffered a failure. Service downtime can be justified by scheduled maintenance, a programed update that required a reboot, or a manual reboot issued by the client. This should also be taken into account when calculating the availability. The service uptime is not the only one to take into account, the host where the application is hosted should also be monitored, as the application depends on the host state. If the host is down, so is the service. Maintenance is important to be performed to the host when required. It is recommended to have the system up-to-date for several reasons, such as to patch security flaws, data protection, software obsolescence, and to obtain better hardware performance. Moreover, maintenance should not be considered for SLA availability time. As

27 physical or virtual hosts may suffer downtime due to maintenance, and if we would monitor accordingly to SLA, the host would appear as unavailable, and there would be a reduction in availability time. To solve this issue, it is possible to establish a flexible SLA, only applied for the business hours (for example, between 7 am and 6 pm). This way, a maintenance window can be scheduled, and metric collection only happens during business hours, meaning that if we start the maintenance to a server at 8 pm for 5 hours, the SLA requirements will not be violated as the contract ensures availability only for business hours. This example can be applied for companies such as accounting and those that are working only in one time zone. Despite this, the online services, accessed at a worldwide scale, cannot undergo such a large maintenance window, as it needs to be available at all times, including weekends. Another important consideration is to understand the time when the host can be considered to be available or unavailable. In order to respond to that question, we need to know all possible states. Excluding states when the host is clearly unavailable: powered OFF or in an unknown state. The host can be in three main states: • host is powered ON • host is powered ON and the operating system is working correctly • host is powered ON with working OS and the service is responding correctly The first state is primitive, where the host is only powered ON without running the operating system. Unless the host is a physical server and only used to power an external device through a USB port, it should be considered as unavailable. The second state is when the host is powered ON, and it is running a fully operational operating system. Here we would need to perform checks to the host at OS level. This state is usable and can be considered as available in some circumstances. For the IaaS cloud model, it is only required to ensure the host is up and the operating system is working correctly, ensuring the service level availability is out of their scope, as they are not responsible for the service maintenance. The third and last state is where the host is up, the operating system is working, and the service is responding correctly. This state requires a more complex verification compared to the second state because we need to verify the health of the service. The checks for the correct service execution are more complex, as each service works differently, and the way to check will need to be adapted for each service. So far, we have classified the service availability in a binary way, available or unavailable, but there might be a state where the availability can be partial, where the degradation of service can be used to achieve it. Since, in most cases, the service is constituted by multiple smaller services, and the failure of one small service does not mean complete outage of the main service. In the example of the ATM (cash machine), the failure of the money dispenser is critical, but a failure of the receipt printer does not

28 necessarily compromise the main service. If the ATM is unable to communicate with the central station, the service degradation reaches 100%, as the ATM is completely unusable and is considered unavailable. In order to calculate availability, there are specific methods that can be followed. There are two types of methods to determine the availability, which is getting the status by active monitoring, issuing a request to obtain the host status, and then waiting for the answer. Alternatively, through passive monitoring, where the objective is to observe and analyze the resource usage by that host. The first method is a more direct one, where through well-defined interfaces or APIs, a request is formulated and sent to obtain the state of the host, OS, or service. The host state can be obtained through hypervisor statistics, where the state of all VM and related components should be available. Verifying connectivity between hosts can also provide information about availability, and to some extent, verify the OS health. In order to respond to the request, some essential OS components are required to work. Also, verifying if the host is listening to a particular port can determine if a specific service is working. For example, verifying if the host is listening to port 22 means the SSH service is available. If listening to port 80, it means that web service might be running, but it does not mean that it is working correctly. Further verification needs to be performed to understand if the service is running correctly. The second method is an indirect way of knowing the availability by monitoring resource usage, which indicates whether the host is available or not. The RAM usage can inform if the host is using memory or it is shut down. Disk Input/output operations per second (IOPS) write or read operations may also be used to know the host state. If the host is using the disk, meaning that the host is up. The number of network packets sent or received can be a factor to conclude about the host state. Only CPU activity alone cannot conclude with certainty the host availability, because if the OS is stuck at the BIOS, the CPU will generate some workload and, in case of Windows OS, if stuck during booting time at Blue Screen Of Death (BSOD) the CPU can generate workload up to 100%.

3.1.2 Service Performance Service performance can be described as how fast the service can process the received request. The faster the service responds, the lower is the response time. Within the SLA, the cloud provider can agree with the customer about the maximum time the service has to respond to a request with a particular simultaneous workload. In order to comply with the SLA, the response time must be monitored by the cloud provider. If response time is higher than established, the SLA is violated, and the cloud provider must ﬁnd a way to lower the response time. For the cloud provider to know how to

29 lower the response time, they must know why the response time got above the threshold. The response time can be influenced by excessive CPU workload, intensive disk usage, lack of RAM or network latency. We must understand how each of those resources affects the service response time and how can it be improved. Each of those resources have different metrics, and each of them will be individually analyzed. In order to be executed on the cloud, the service needs to be either run on a single machine or be distributed by several virtual instances to ease future scalability. Assuming that the service is running on a single virtual instance and there are more virtual instances running other services on the same hypervisor, meaning all hypervisors resources will be shared between them, to be more specific, disk, RAM, CPU, and networking. Although the quantity of the resources can be limited to a specific instance, the hypervisor does not entirely limit the usage intensity. If one VM uses all of the disk performance, other instances will have higher latency to access the disk resulting in low service performance and response time. When using shared storage, all instances from every hypervisor host will share the same disks making the situation even more complicated. Not only disks can be overloaded, but also the network if there is no separation between storage and internet connectivity. To prevent this occurrence, disk performance and health should be monitored. When measuring the disk performance, there are three essential metrics: Latency, IOPS, and queue length. Disk latency is the amount of time between the request for data and the return of it. The higher the amount of time to retrieve the information is, the higher the latency and lower the performance will be. IOPS refers to the number of reading and writing operations made to the disk. If the limit of disk IOPS is reached, the excess requests will go into a queue, where they will wait for their opportunity to be processed. The longer the request waits in a queue, the slower the response will be and lower performance the client will experience. Every time the disk queue is not empty, the disk is being overloaded with requests that are postponed. The monitoring system can pick up short term spikes of intensive disk usage during certain intervals only. Those spikes should be carefully analyzed. Usually, some of them could mean a backup of the system, defragmentation, or something else. However, if the disk usage is always high, the swap memory used in the virtual instances or host should be verified. This can happen in a scenario where the instance is under-provisioned, and the available resources are not enough, so it has to compensate by using swap memory that uses the disk as additional RAM. The disk access speed used by swap memory compared to RAM is much slower, causing slower service delivery, and on top of that, as it uses swap, which writes to disk, it may also slow response time for other services that may try to access to the disk. Figure 3.2 shows the output of the htop tool with CPU and RAM statistics of a virtual machine that currently is executing Nexus Repository OSS

30 service. The green bars represent the used memory pages, blue bars are buﬀered pages, and yellow bars are the cache pages. Although the RAM is not being all used, the majority of it is cached. Without available RAM, when OS ask to allocate memory, the VM will use the swap memory, slowing down signiﬁcantly the performance.

Figure 3.2: Swap memory usage by Nexus OSS

To actually monitor the disk performance metrics, there must be a tool that allows getting real-time disk statistics for the following metrics: Latency, IOPS, and queue length. There are many open-source and commercial options available. The ﬁrst example is iostat2. It displays real-time information about the read and write speed, access disk latency, percentage of used swap memory, and the average queue size. Figure 3.3 shows some of the available metrics, such as:

• %user - percentage of CPU utilization at user level • %nice - percentage of CPU utilization at user level with nice priority • %system - percentage of CPU utilization at system level • %iowait - percentage of time CPU were idle during outstanding I/O request • %steal - percentage of time vCPU waited while the hypervisor was servicing another vCPU • %idle - percentage of time CPU was idle without outstanding I/O request • rrqm/s - number of read request per second merged into the queue • wrqm/s - number of write request per second merged into the queue • r/s - number of read requests issued per seconds • w/s - number of write requests issued per seconds • rkB/s - number of kilobytes read per second • wkB/s - number of kilobytes written per second • avgrq-sz - average size of the requests issued per sector • avgqu-sz - average queue size of the issued requests • await - average time in milliseconds for I/O requests to be served • r_await - average time in milliseconds for read requests to be served • w_await - average time in milliseconds for write requests to be served • svctm - average service time in milliseconds of the issued I/O requests • %util - percentage of CPU time used for I/O requests 2https://linux.die.net/man/1/iostat/

31 The disk performance monitoring is not only required to be done for the hypervisor host, but also for virtual instances. This will help to understand which instance is using most of the disk performance or which lacks in it.

Figure 3.3: IOSTAT real-time disk statistics

As disk, RAM is also shared between all the virtual instances provisioned on the same hypervisor host. The allocation of the RAM should be prudent, as sometimes hypervisor can over-provision, resulting in a total allocated RAM for the instances being higher than what the hypervisor owns. This is the widely used method since, on average, the instance will never use all the available memory. The three main metrics to monitor RAM performance are: • Memory pages: showing how often the physical disk is used to compensate for the RAM shortage • Free memory: showing the available RAM to be used by processes at that moment • Memory pressure: showing the percentage of physical memory in use divided by the total amount of memory In a scenario where a service tries to allocate more memory, the virtual instance will request more memory from the hypervisor host. If the hypervisor denies, the instance will start to swap the memory pages, using the physical disk as additional memory. In that scenario, the free memory is close to none, memory pages, and memory pressure are at 100%. Using swap memory lowers the performance since the physical disk is not as fast as the RAM. Memory monitoring does not require additional tools. Usually, in the Linux OS, to check the RAM, the ﬁle /proc/meminfo can be used, as it contains all the relevant information about the RAM. Figure 3.5 shows the content of the ﬁle of a physical server with 64GB of RAM. The metrics considered relevant to be used: total memory (MemTotal), free memory (MemFree), total swap memory (SwapTotal), and free swap memory (SwapFree). Another way to obtain RAM statistics is through a Linux native tool named free. This tool shows the information about the memory statistics, but with much less detail, as some metrics are missing. Figure 3.4 shows the output of the free tool in human-readable output, where each metric means:

• total - total installed memory • used - used memory • free - unused memory

32 • shared - memory used by temporary file storage • buff/cache - sum of buffered and cached memory • available - estimation of available memory for starting new applications

Figure 3.4: Free memory tool in human-readable output

(a) (b)

Figure 3.5: Content of ﬁle /proc/meminfo split through four images

After covering disk and memory, there is CPU performance, which can also inﬂuence performance and service response time. The ﬁrst obvious metric to check is CPU usage that shows the percentage of the used CPU at that time by the service or virtual instance. A related monitoring metric to CPU usage is CPU demand, which is the amount of CPU workload the instance is requiring. In an ideal scenario, the CPU demand and usage should be about the same, but usually, the demand is much higher, especially if the hypervisor host is CPU over-provisioned. The hypervisor is responsible for receiving and managing CPU usage requests coming from virtual instances. Upon receiving a CPU usage request, the hypervisor will decide which logical CPU will process the request. If the hypervisor’s CPU usage is low or high yet used asynchronously, all requests are processed at the arrival time. However, if the CPU workload is intensive and occurring during the same interval of time, some of the requests will have to wait until the CPU is free. This event can be monitored through the CPU ready time metric. More requests to process, less CPU is available, the higher the CPU ready time will be.

33 A more generic metric for CPU performance is the CPU wait time per dispatch. This shows the amount of time the CPU takes to process the request. Although this can show the CPU performance, it is not detailed enough to understand the root problem for low performance. Even after checking all of the above CPU metrics if the CPU usage is below the limit, and the performance is still unacceptable, this could be due to the hypervisor CPU scheduler reaching the limit number of requests. This situation can happen when the instance is creating a large number of requests with small instructions, which do not saturate the physical CPU, but do saturate the hypervisor scheduler, moving the remaining requests into a CPU queue. The CPU queue metric can only be obtained at the OS level of the virtual instance. Some CPU metrics can only be obtained on a specific hypervisor. CPU demand and CPU ready time are specific for VMware vSphere hypervisor. The CPU wait time per dispatch metric is specific for Hyper-V hypervisor. Those metrics can be helpful, however other metrics are just enough. If our hypervisor does not have any specific CPU metric, we can use a tool named mpstat to measure CPU statistics in real-time. Figure 3.6 shows the terminal output of the mpstat tool of a virtual machine with 46 vCPU. This tool shows the available CPU metrics at OS level. The most relevant are CPU usage (%usr and %sys), CPU idle time during outstanding disk I/O request (%iowait) and CPU steal time, time the vCPU has to wait for the physical CPU while it is being used by the hypervisor to serve another virtual instance (%steal). An interesting and valuable feature of mpstat is the ability to see statistics of individual CPU cores.

Figure 3.6: Terminal output of mpstat tool of a virtual machine with 46 vCPU

If the customer complains about the performance of the service, but the monitoring systems show that disk, memory and CPU performance are adequate, it is important to check network performance. The service can require to communicate with local or

34 even internet services and if the network connectivity is not stable or fast enough, it can delay the service workﬂow, causing higher response time.

3.1.3 Latency The network performance is often measured in megabits or gigabits of information per second pushed through the network connection to reach the other end, also called by throughput. Network speed is an important aspect, but it is not the only one. Latency also must be taken into account. Network latency can be described as the amount of time measured in milliseconds between sending and receiving a packet or a group of packets. Some applications can be latency tolerant, where the connection speed is what is most relevant, but the list of latency-sensitive applications has been growing. Latency can severely affect service usability and customer enjoyability. The lower the service latency is, the more enjoyable the customer experience will be. Latency can be challenging to predict and measure, since each customer may take a different network route to reach the service. Different network routes may lead to various necessary router hops, where each hop is an additional delay. A particular router along the way can be overloaded with requests, leading to another delay. Service latency can be affected by many other reasons, which can be categorized into types of latency such as network, Internet, virtualization, and interrupt latency. Network latency is the amount of time it takes between a request for data and the return of the requested data. Three factors are contributing to the network latency: propagation, transmission, and router processing time. The propagation time is the amount of time required to send one bit of information from one end of the medium to another, and it increases proportionally to distance. The transmission time is the amount of time required by a network device to push one packet into the medium. The larger the network packet, the higher the transmission time, and the higher the latency will be. Each time the packet travels through a router or gateway, the latency increases, as devices have to take time to process the packet and possibly change the header. The header change can, for example, increase the hop count in the Time-To-Live (TTL) field. Internet latency is a more specific type of latency, that is part of the network latency. As the Internet is a Wide Area Network (WAN), there are many hops, routes, and devices involved, where each hop has a different latency. The same factors of network latency affect internet latency but on a broader scale. Virtualization latency is the time the hypervisor needs to process the received request from the virtual instance, and forward it to the physical component. Any given virtual instance must pass through the hypervisor to communicate with the Internet, which then redirects the network request to the network device.

35 Interrupt latency is the time the computer takes to act upon the interrupt. The interrupt tells the operating system to stop the task until it can decide what to do in response to the event. This latency type affects the overall latency, but this dissertation will not cover this matter because interrupts are a low-level point of view. All of those types of latency are difficult to track and monitor without a specific tool, except network latency, that can be monitored using standard Linux tools. The simplest way of measuring the network latency is calculating the time spend from the moment a packet is sent to the moment it returns, also called by Round Trip Time (RTT). The tool traceroute can measure the latency and show the path the packet took.

3.1.4 Network Network performance can be aﬀected by numerous factors, such as by a packet drop due to network congestion, or by a high number of packets arriving out of order, and many others. Those situations can be avoided by monitoring network performance. The metrics that can determine the network performance are latency, bandwidth, throughput, jitter and error rate. Network latency is the amount of time the data takes to travel from one host to another. In an ideal world, the latency is close to zero. Unfortunately, the speed is limited by the medium through which the data is traveling. Throughput is the quantity of data a certain device can send or receive. Throughput can be mistaken with bandwidth, despite being diﬀerent concepts. Bandwidth is the maximum amount of data per second that a certain medium can transport from one side to another. To simplify the matter, bandwidth can be compared to a bus and throughput to its passengers. The bus has a capacity of 100 passengers, but there are only 75 passengers to transport. Thus, the bus will only transport 75 even though it could transport more. The bandwidth will always limit throughput. The network jitter is the variation of the latency over time. The more change in latency occurs, the higher the jitter will be. It is normal to have some jitter, but it is a useful metric to determine the network performance and identify network abnormalities. Network errors are a common thing to happen, especially on the Internet, since there is no control over the Internet router devices. Those errors will degrade the network performance. The common network errors that occur are packet drop, out-of-order, loss, and retransmission. Some of the network performance metrics can be obtained by reading the network device statistics, like the number of packets received or sent, the number of dropped packets, the number of corrupted packets (error) or the number of packets which the device could not receive or send. Those metrics can be obtained using netstat, a network statistic tool. Figure 3.7 shows the terminal output of netstat -i command, with an MTU column. Maximum Transmission Unit (MTU) is the packet size that can

36 be supported by that device. RX-OK and TX-OK columns are metrics for the number of received and sent packets marked as correct, respectively. RX-ERR and TX-ERR are incorrect packets, a possible cause is data corruption. RX-DRP and TX-DRP are dropped packets. RX-OVR and TX-OVR are packets that the interface could not receive or send.

Figure 3.7: netstat terminal output

The previously demonstrated network metrics are passive, as they do not cause any network overhead. They are retrieved from statistics saved in the network interface, only obtaining a few metrics required to analyze the network performance. The network jitter, throughput, packet loss and out-of-order can only be actively monitored by creating network workload and adding network overhead. Figure 3.8 shows the iperf3 tool that injects packets into the network and measures the network performance. This tool has a client-side that sends the packets and a server-side that receives them, reporting statistics back to the client. The result of active network performance monitoring in Figure 3.8 shows 9.90 Gbits/s of bandwidth, 0.009 ms of network jitter and less than 1% of lost packets.

Figure 3.8: iperf3 terminal output

3.1.5 Application Multiple processes can constitute the service, and each process can be or not crucial for service functionality. In order to avoid service disruption, individual processes must be

37 continuously monitored. There are two types of application availability: application is running and application is functioning correctly. To verify if the application is running, depending on the application type, we could confirm if the process is running or not or, in case the application uses a port, we could verify if the port is in listening mode. This can only verify that the application is running, it does not confirm that it is not crashed or stuck. To verify application functionality, the application itself needs to have a method of self-diagnosis or to perform a normal operation, comparing to the expected result. For a web service, for example, performing a GET request of the main page can determine the status. If the returned answer contains the 200 HTTP code, this means the service is functioning as intended. If we receive a 500 HTTP code, this would mean the service is experiencing Internal Server Error. Another method to monitor service is by reading the log file. Frequently, services produce a log file where its status is posted, as well as event updates. Monitoring the application log file can be useful to look for error events. The error event does not mean the application is currently in a failed state, but it could translate into a failure. Creating test methods of all possible cases for the application can be very challenging, and log monitoring can help identify those errors. The log file contains the log message and accompanied by its type of message, which can be an error, warning, and info. The error message is clear, stating something has gone wrong and requires immediate attention. The warning message serves the purpose of warning the customer about a possible problem that might occur in the future, it is not an immediate problem but should be taken into consideration. The info messages are just standard informative texts about the application state.

3.2 Solution proposed

The objectives, as described in Chapter 1, can be split into three parts. The first is to define a structure for the template file to describe the service and SLA rules. The service description must include the required resources to provision and the installation script. The SLA rules must include the description of metrics to monitor with a respective threshold, also including recovery methods associated with the monitoring metrics. The second objective is to perform template analysis and deploying the service within the cloud environment using the orchestration mechanism. The third objective is to monitor the service continuously, and if one of the SLA metrics goes against the established threshold, trigger the recovery method by the order specified in the template. Upon recovery failure, notify the user about the recovery failure. To reach the first objective, we have to choose an appropriate template format and

38 structure that will describe the service and the SLA rules. Since the service description will include the required resources to be deployed by the orchestrator, it is better to adapt the SLA enhanced template structure to the template used by the orchestrator. This way, it will be easier to separate the orchestrator related information from the SLA enhanced template. To reach the second objective, we have to choose a tool to verify template integrity accordingly to the structure and format. On template integrity success, trigger the service deployment. For the deployment action, we must have in mind an orchestrator technology to provision the infrastructure as described in the template to support the service. The third and last objective requires a monitoring system to monitor SLA rules as described in the template, also requiring a recovery engine to perform recovery methods according to SLA deﬁned in the template. In summary, the solution will require four main components: service manager, cloud environment with an orchestrator, monitoring system, and recovery engine. The next section presents a component diagram explaining the interactions between them and describing in detail each of them.

3.3 Component diagram

Figure 3.9 shows the proposed architecture for the solution proposed. The four main components are service manager, cloud environment with an orchestrator, monitoring system, and recovery engine. The service manager receives a service template, verifying and parsing, returning four individual sections to be later used by other components. The first section is the orchestrator template used to provision resources, the underlying infrastructure. The second is the software template used to configure the service on top of the underlying infrastructure. The third is SLA rules used by the monitoring system to check for the SLA violations. The last section is the recovery methods, stored at the recovery engine, and used to instruct the orchestrator on how to recover the service. The orchestrator template, SLA rules, and recovery methods are sent as a template respectively to the cloud environment, monitoring system, and recovery engine while the software template must await the resource provision completion and then perform the installation accordingly to the template. The cloud environment provides on-demand resources, allowing flexibility and proper resource utilization. One of the required in-cloud services is the orchestrator, enabling automatic resource provisioning. The orchestrator will be used to provision resources accordingly to the template received from the Service Manager. As resources provisioning finish, the Service Manager will perform the software configuration on top of the resources accordingly to the software template. At the end of software

39 Figure 3.9: Solution architecture deployment, a monitoring agent is conﬁgured, reporting metrics to the monitoring system. The orchestrator is not only used for resource provisioning but also to manage them, performing actions like rebooting the VM, when receiving an instruction from the recovery engine. The monitoring system monitors the service health and alert in case of a failure. It receives the monitoring metrics from the cloud environment and SLA rules from the service manager. Metrics are stored in the database, allowing present and past metric visualization through a dashboard. Most importantly, the monitoring system compares the received metrics to the SLA rules, sending an alert to the recovery engine when the SLA rule violation occurs. The recovery engine receives the alert information from the monitoring system and the recovery methods from the service manager. Matching the alert information to the available recovery methods, resulting in sending an appropriate instruction to the orchestrator performing recovery action to the service.

3.4 Template for SLA compliance

The final product operation details can be described by the client using the template. The template is a configuration file used to deploy the service, the underlying infrastructure,

40 and monitor the service. The client completely configures the template according to the need and the available template variables. We propose using the YAML Ain’t Markup Language (YAML)3 template format because it is the template format used by Heat orchestrator. As it was decided to use OpenStack for the cloud environment, choosing the orchestrator was easy. Since OpenStack has the Heat orchestrator, which is fully compatible with OpenStack, it would not make sense to use any other orchestrator. This way, it is possible to combine two different configurations (Infrastructure and SLAs) in one, without additional effort. Besides, YAML is a human-friendly data serialization standard and has a minimal syntax using Python-style indentation to indicate nesting. Using the Heat template and adding the extra required field such as SLA requirements and software configuration resulted in an SLA enhanced template as depicted in Listing 1. The heat_template_version, description, type and properties are Heat variables. Throughout the years of Heat development, new features were added, and some were discontinued. In order for Heat to know which specifications are valid and which are not, the property heat_template_version must be set with the Heat Orchestration Template (HOT) release date, e.g., 2015-10-15. Another Heat template related variable to set is heat_stack_name property, used to distinguish between other Heat stacks. This property is not part of the original HOT file, it is one of the added modifications. The description variable is optional, but it is encouraged to use. It describes what users can do with this template when shared with someone not aware of it. When provisioning resources with Heat, all resources must be named (resource_name) and added under the resources property. Each resource has a type property to specify the resource type to provision. In this dissertation, the SLA rules will be only applied to OS::Nova::Server resource type, virtual instances. Al- though other resource types can still be used, for example, OS::Cinder::Volume, to provision Cinder volumes. The resource type can be further configured in properties property. To embed the SLA rules into the template, we divided them into three different properties: external, internal and recovery. The external and internal properties are responsible for service monitoring from an external and internal point of view, respectively. The recovery property is used to specify the recovery methods in case of SLA non-compliance triggered by external or internal property.

3https://yaml.org/

41 heat_template_version: Date heat_stack_name: String

description: String

resources: resource_name: type: OS::Nova::Server sla: external: List # external methods internal: List # internal methods recovery: List # recovery methods properties: # virtual instance characteristics content: ansible: file: String extra-vars: String

Listing 1: Solution template

External monitoring External monitoring allows performance monitoring from an external point of view. Monitoring System is also dedicated to receive network traffic and report network metrics such as bandwidth, jitter, and package loss, using either the TCP or UDP protocols. There is also a possibility of measuring network performance between virtual instances defined in the template. The external network monitoring can be done using either TCP or UDP protocol. Each of them can return different metrics. TCP monitoring can measure available bandwidth in Megabits per second. UDP monitoring same as TCP returns available bandwidth and also the associated jitter in milliseconds and the percentage of lost packets. Each SLA rule has the name property associated, to distinguish among the other rules. To create a network monitoring rule, the property probe_bandwidth_mb must be set with higher value comparing to min_bandwidth_mb property, which is minimal acceptable SLA bandwidth. Whenever the actual bandwidth is lower than min_bandwidth_mb an alert is issued and it is used recovery_tag to perform an action trying to solve the problem. The recovery_tag property is described further in this section. The protocol property only accepts string value of ’tcp’ or ’udp’. Accordingly to the protocol, there are different available metrics, as shown in Table 3.1. In the case of the property host not defined or not used, the default value is used, which is the monitoring

42 Name Deﬁnition name Unique name for external SLA rule protocol Network protocol used to exchange streams of data. host Target host to perform network performance (Optional) probe_bandwidth_mb Bandwidth in Megabits to send to the target host min_bandwidth_mb Minimum accepted bandwidth max_jitter_ms Maximum accepted latency jitter in ms (Only for UDP) max_lost_percentage Maximum accepted percentage of lost packets (Only for UDP) probe_interval Delay interval between monitoring probes recovery_tag Name of recovery methods to use when alert is triggered monitoring_function Aggregation method for monitoring data

Table 3.1: External SLA monitoring system already existent for the external monitoring purposes. Otherwise, the user can use a custom host. The accepted value can be an IP address (e.g. 8.8.8.8), resource name of another OS::Nova::Server type or can be tested locally, setting host value to localhost. The property monitoring_function is used to aggregate the monitoring data, having properties as such: • period: aggregation time (e.g. 60s, 5m, 2h) • every: aggregation delay (e.g. 30s, 5m) • function: aggregation function (e.g. median, sum, max) Listing 2 shows the external SLA monitoring deﬁnition structure within the SLA enhanced template.

43 ... sla: external: network: # TCP monitoring -{ name: String, protocol: 'tcp', host: String, probe_bandwidth_mb: Float, min_bandwidth_mb: Float, probe_interval: Integer, recovery_tag: String, monitoring_function:{ period: String, every: String, function: String }}

# UDP protocol -{ name: String, protocol: 'udp', host: String, probe_bandwidth_mb: Float, min_bandwidth_mb: Float, max_jitter_ms: Float, max_lost_percentage: Float, probe_interval: Integer, recovery_tag: String, monitoring_function:{ period: String, every: String, function: String }} ...

Listing 2: External SLA requirements

Internal monitoring The internal SLA monitoring will be used for the health and status of local services. With the purpose of verifying the correct functionality from the internal point of view. Listing 3 shows the Internal SLA monitoring definition structure within the SLA enhanced template. The property net_response is responsible for verifying the port status of the host. It sends a request using defined protocol to the host to a certain listen_port. If the response reach before the timeout and the host responded with success, meaning the host is listening at that listen_port. On the contrary, the listen_port is marked as unavailable, triggering the recovery_tag action. If the recovery action is email notification, the name property is used to distinguish SLA between each other, helping the user to identify the issue. The host property only accepts IP address or DNS name. The property http_response is responsible for verifying HTTP/HTTPS connections. It sends a request with a certain HTTP method to the url and compares the response code to the return_code. It also has an option to compare the page content with return_content if set. The use of return_content property is optional. If the response does not arrive before the response_timeout, the url is marked as unavailable,

44 triggering the recovery_tag action. Properties return_code and return_content mismatch also trigger the recovery_tag action. Extra properties can be used if required, such as follow_redirects, if set as True, follows the server redirects and insecure_skip_verify, if set as True, skips chain and host veriﬁcation. The property procstat is responsible for verifying system service status. It will obtain information about the service and compare it to the desired state. If the state does not match with the obtained information, the recovery_tag action will be triggered.

... sla: internal: net_response: -{ name: String, host: String, protocol: String, listen_port: Integer, timeout: String, recovery_tag: String }

http_response: -{ name: String, url: String, return_code: Integer, return_content: String, response_timeout: String, method: String, follow_redirects: Boolean, insecure_skip_verify: Boolean, recovery_tag: String }

procstat: -{ name: String, state: String, service: String, recovery_tag: String } ...

Listing 3: Internal SLA requirements

Recovery methods The recovery methods are separated from monitoring alerts enabling to use the same recovery method for multiple monitoring alerts. Those methods have a unique recovery_tag_name and added under recovery property. Under recovery_tag_name should be a list consisted of recovery actions, ordered by the desired recovery action order.

45 Currently there are 4 recovery actions: service_restart, instance_soft_restart, instance_hard_restart and notify. In each recovery action can be specified number of retries of that action and delay_between_retries. Those properties are optional and if not used the default value for reties is 1 and for delay_between_retries is 60, in seconds. The property service_restart will remotely restart the service. From available methods to restart the service, we propose to use SSH. With shell access, it is possible to perform remotely the systemctl service restart command. The properties instance_soft_restart and instance_hard_restart are executed by OpenStack API, the soft restart is executed using cloud-init tool which sends the command to the instance to perform a reboot, while when using hard reboot the virtual instance process is killed at the level of the compute host and then started again. The last recovery action is notify, consisting of sending an alert to the specified email with the information about the alert issue. To give some time between recovery actions, the delay property can be used to create a delay time between retries. After the execution of the last recovery action and still not succeeding to recover, the service will be marked as FAILED and will require manual intervention to recover. Listing 4 shows the recovery definition structure within the SLA enhanced template.

... sla: recovery: recovery_tag_name: - service_restart: { service: String, retries: Integer, delay_between_retries: Integer } - delay: { delay_between_retries: Integer } - instance_soft_restart: { retries: Integer, delay_between_retries: Integer } - delay: { delay_between_retries: Integer } - instance_hard_restart: { retries: Integer, delay_between_retries: Integer } - delay: { delay_between_retries: Integer } - notify: { email: String } ...

Listing 4: Recovery methods for SLA requirements

3.5 Operation workflow

This section describes the proposed workflow to meet these work objectives. The workflow will be divided into two subsections. The first subsection is about Service Deployment, where it is explained each step of how the service should be deployed. The second subsection is about Service Recovery, where it is explained the workflow and steps about what should be happening during the recovery.

46 3.5.1 Service Deployment Service deployment consists of a series of steps from provisioning the infrastructure to configure the service to be available. The workflow in Figure 3.10 depicts the steps required to deploy the service, where a detailed workflow of the composite states is depicted in Figure 3.11. The starting point is the configuration of the SLA enhanced template by the user, which will be used to deploy and monitor the service according to defined properties. The template is then properly validated. If the validation fails to parse the template, it will issue a template structure error. On approval, it is transferred to the Stack deployment state. This state is responsible for deploying the needed infrastructure to support the service and the installation of the software. Before creating a stack consisted of virtual instances, it is verified the stack existence by the name defined in the template. If a stack by that name already exists, the system will through a warning and stop the entire workflow. If such does not happen, the workflow will continue to the next state, which is to deploy the resource stack using the OpenStack Heat orchestrator, according to the defined properties in the template. The stack deployment could take some time, depending on the number of virtual resources required to deploy. Thus, before continuing, we must ensure the stack deployment is complete.

Figure 3.10: Service deployment workﬂow

On stack completion, we can proceed with the workflow and install software on each instance of the stack, according to the template. By this stage, the Stack deployment state is complete, the service is initializing and the workflow has continued to the next state, SLA deployment. This state is responsible for the deployment of a service monitoring system, where a monitoring agent is installed and configured on each virtual instance from the stack. Afterwards, the agent receives plugins with the instructions to monitor and report SLA metrics to the monitoring node accordingly to the template.

47 Figure 3.11: Stack and SLA workﬂow from service deployment

Then, the SLA alert deﬁnitions are stored in the monitoring node database and actions to recover the service are stored in the recovery engine database. This state is the last of the workﬂow. If all goes successfully, the service is running, monitoring is active, and in the end, IP addresses of all virtual machines from the stack should be displayed.

3.5.2 Service Recovery The workflow in Figure 3.12 shows the steps required to recover the service. The starting point is Monitor SLA metrics, where the monitoring system checks the received metrics from the agent inside the virtual instance for possible SLA violations. If the metric value is not in agreement with the alert threshold, the system moves to the next state to obtain the associated recovery methods with the alert. Otherwise, the monitoring system continues in the first state, checking for SLA violation. According to the recovery methods established in the SLA enhanced template, the system should execute the recovery method and return to the first state. If the system

48 manages to recover the service, meaning that the metric value comes into agreement with the alert threshold, the service is marked as recovered. Otherwise, the system should move to the next available recovering method that is associated with the alert until the service is recovered. If the attempt to recover the service fails, the service enters into the Failed to recover state. And if there are no available recovery methods at all, the alert enters into the Failed to recover state.

Figure 3.12: Service recovery workﬂow

Software Configuration Even though the HOT file supports shell script deployment (property user_data) which can be sufficient for some software configuration, but it is limited and can be challenging to set up complex environments. Many developers have good knowledge of Ansible and use it regularly to set up the environment or for the software deployment due to a high amount of available configuration modules. As one of the dissertation objectives is to create a more straightforward method to deploy software, it was decided to add the Ansible option into the HOT file. For the orchestrator to execute the Ansible playbook, the property ansible must be set with the path to the Ansible playbook and the path must be relative to the location of the SLA enhanced template. Listing 5 shows the Ansible script definition within the SLA enhanced template.

49 ... resources: resource_name: type: OS::Nova::Server properties: ... content: ansible: file: String extra-vars: String ...

Listing 5: Deﬁnition of Ansible script

50 CHAPTER 4

Implementation

As described in the previous chapter, the proposed architecture is composed of four main components: service manager, cloud environment, monitoring system, and recovery engine. This chapter will focus on the implementation of those components and will explain the reasons behind the choices of the used technologies.

4.1 SLA enhanced template

This section serves as an index page for the implemented template, directing each template property to the respective section where it is explained in detail. Listing 6 depicts the implemented SLA enhanced template in a YAML format. The choice of YAML format for the SLA enhanced template was based on the orchestrator template. Since Heat orchestrator uses the YAML template format, it would be easier to extract the orchestrator related information during the SLA enhanced template split, not requiring additional changes or transformations. Besides, YAML is human a friendly data serialization standard, making it easier for client usage. The final result, depicted in Listing 6, is the modified Heat template, adding sla and content properties. The SLA enhanced template is analyzed and processed by the service manager (further explained), splitting it into four individual sections. The first section is the OpenStack Heat orchestrator template, describe the required resources to provision, the underlying infrastructure. Prop- erties heat_template_version, heat_stack_name, description, resources, resource_name, type and properties are orchestrator related. This template part is further explained in Section 4.3. The second section is the Ansible template used to configure the service on top of the underlying infrastructure. Properties content, ansible, file and extra-vars are

51 software conﬁguration related. This template part is used by the service manager to conﬁgure the service, further explained in Section 4.4. The third section is SLA rules used by the TICK Stack monitoring system (further explained), to check for the SLA violations. The external and internal properties are responsible for service monitoring from an external and internal point of view, respectively. This template part and the monitoring system is further explained in Section 4.5. The last section is the recovery methods, used by the recovery engine (further explained) to instruct the orchestrator on how to recover the service. The recovery property is recovery engine related, further explained in Section 4.6.

heat_template_version: Date heat_stack_name: String

description: String

Listing 6: SLA enhanced template

4.2 OpenStack implementation

To achieve the solution described in Chapter 3, considering the work complexity and data dimension, the best strategy for prototyping is to use a cloud environment, dividing resources into modules, such as computing, networking, storage, and others. This way, it is possible to use available resources more eﬃciently. OpenStack is considered to be the best-ﬁt platform for managing resources because it allows Virtual Machine (VM) deployment, use of container technologies and even running VMs on top of bare metal for better performance. Another reason for choosing OpenStack was based on the SKA project decisions. Since SKA SDP was planning to

52 use OpenStack in their implementation, it would make sense to follow their decision and contribute if possible. OpenStack is an open-source software platform used for cloud computing to create Infrastructure-as-a-Service (IaaS) with the ability to interconnect separated modules into a single cloud service. Therefore, providing a reliable abstract platform using physical infrastructure. This allows the dedication of the entire server for a single resource, and in case of using high availability, creating an exact server replica and load balance the traffic between them. This allows to spread the traffic across the servers evenly, and in case of failure, redirect all traffic from the failed server to a healthy one. Figure 4.1 represents the currently deployed OpenStack structure of the platform.

Figure 4.1: OpenStack module distribution

The OpenStack setup required the installation and conﬁguration of several modules to allow the work development towards the dissertation objective. The main OpenStack module is the computing service, called Nova, responsible for VM deployment. In order to deploy VM, the Nova module has to use virtualization. Accordingly, to the State of the Art in Chapter 2, there are three types of virtualization: full virtualization, paravirtualization and OS level virtualization. In this dissertation work, performance is

53 critical, leading to choose full virtualization, as it is the only type that allows direct hardware access, resulting in quicker operations. Having a virtualization type in mind, what is left is to choose an appropriate virtualization technology. Accordingly, to the State of the Art, the virtualization technology vSphere, and Kernel-based Virtual Machine (KVM) showed better performance results out of the four tested hypervisors. Since the chosen cloud environment was OpenStack, vSphere had to be removed from from the hypervisor candidate list, because it is not compatible. Remaining KVM, which runs on top of Linux OS, converting it into a bare-metal (type-1) hypervisor, ideal for achieving better performance. For the VM, to access the internet or to communicate with other VMs, the Neutron module was configured. As Neutron does not require a lot of computation or storage resources, it can be moved to the same node where the controller and monitoring system reside. Some VMs might require extra storage space, and for that, required to have the Cinder module. The used file system for the storage node was GlusterFS. The choice of GlusterFS as the underlying file system for Cinder volume storage was based on the fact that it can handle 64 Terabytes (TB) per node, significant volume size that can be easily scaled, because GlusterFS stores information in blocks that can be easily moved, changed and even mirrored between storage nodes. The implementation of the cloud environment is part of the established requirements. Another requirement is the implementation of an orchestrator to automatically provision resources required to run the service. After the choice of OpenStack for the cloud environment, choosing the orchestrator technology was easier. OpenStack has an official orchestrator, Heat, fully compatible with OpenStack. It would not make sense to use any other orchestrator as it would never be as fully compatible comparing to Heat. OpenStack also provides a web-based dashboard to interact with the cloud, called Horizon, allowing clients to create their own environments with little knowledge.

4.3 OpenStack Heat template

This subsection describes the Heat Orchestration Template (HOT) file structure and some of the available parameters. This information was retrieved from the OpenStack HOT guide1 and OpenStack Resource Types2. A HOT file uses the YAML format, and is processed by the OpenStack Heat module to provision resources, with the possibility of having more than one resource setup, and allowing to build the whole project architecture from a single configuration file. Also, 1https://docs.openstack.org/heat/rocky/template_guide/hot_guide.html 2https://docs.openstack.org/heat/rocky/template_guide/openstack.html

54 it is possible to configure networks, subnets, assigning ssh key-pairs to the instance, assigning floating IPs, provision and manage cinder volumes, deploy a monitoring system, and more. Throughout the years of Heat development, new features were added and some were discontinued. For Heat to know which specifications are valid and which are not, property heat_template_version must be set with the HOT release date, e.g. 2015-10-15. Another Heat template related variable to set is heat_stack_name property, used to distinguish between other Heat stacks. This property is not part of the original HOT file, it is one of the added modifications. The description variable is optional, but it is encouraged to use. It describes what users can do with this template when shared with someone not aware of it. When provisioning resources with Heat, all resources must be named (resource_name) and added under resources property like shown in Listing 7. Each resource has a type property, used to specify the resource type to provision. The most commonly used resource types are displayed in Table 4.1. After choosing the right resource type, it is possible to define properties associated with the resource, in case of OS::Nova::Server the most common properties are: • flavor - name of the pre-defined OpenStack flavor, used to define virtual instance characteristics, such as CPU, RAM and more. • image - name of OS image for virtual instance to boot • key_name - name of ssh keypair injected into the virtual instance • networks - list of networks connected to the virtual instance • user_data - user data script to be executed by cloud-init during boot

heat_template_version: Date heat_stack_name: String

description: String

resources: resource_name: type: OS::Nova::Server properties: flavor: String image: String key_name: String networks: List user_data: String

Listing 7: Basic Heat template

In OpenStack, flavor defines the compute, memory and storage capacity of the virtual instance, in summary, the available hardware configuration the instance can

55 OpenStack resource type Definition OS::Nova::Server Manage virtual instances OS::Nova::KeyPair Manage SSH key pairs OS::Cinder::Volume Manage volumes OS::Cinder::VolumeAttachment Manage volume attachments OS::Neutron::FloatingIP Manage floating IP OS::Neutron::FloatingIPAssociation Manage floating IP association

Table 4.1: Some OpenStack resource types use. Having the hardware, it is required an underlying OS to run tasks. The Glance module is responsible for managing the OS images, which can be available in a variety of formats, the most used is qcow2. Some of the properties of OS::Nova::Server are exemplified in Listing 7. This template will provision a virtual instance with default tools if the OS image was not changed. In some cases, users might want to set up their own environment inside the virtual machine. To achieve that, a software configuration script can be used with Heat. Specifying shell script content in user_date property, will pass to the instance and executed by cloud-init during the booting time. Cloud-init is a standard method used in cloud environments to initialize the virtual instance accordingly to a set of defined parameters. It can also execute custom commands specified by the client. To remotely access the virtual instance, either the OS image comes with a predefined remote access password, or it is required to associate an ssh key. An SSH key-pair is created by OS::Nova::KeyPair resource, associated with a name property to be used further in the file. To import an already existent key, the public_key field must be set with the content of the public SSH key. If the public_key field is not specified and save_private_key field set as true, will be created a new key-pair and saved the private key. The example of how to define the KeyPair resource is demonstrated in Listing 8.

resources: my_key_name: type: OS::Nova::KeyPair properties: save_private_key: Boolean public_key: String name: String

Listing 8: Create a key pair

To increase the storage space, VM supports multiple volume attachments, managed by the OpenStack Cinder module. By default, during volume creation, if the image

56 field is not set, the created volume is empty. Setting the image field with a valid OS image will create a bootable volume. When creating a new volume, it is required to specify the size property of the volume, unless, the new volume is originated from another volume (source_volid propriety), a snapshot volume (snapshot_id propriety) or a backup volume (backup_id propriety), the size of the source volume will be used. During the volume attachment, property volume_id must be set with a valid ID of the volume and instance_uuid property with a valid ID of VM to which the volume will be attached. Listing 9 exemplifies the volume provisioning and attachment. resources: new_volume_name: type: OS::Cinder::Volume properties: description: String backup_id: String image: String snapshot_id: String source_volid: String size: Integer

volume_attachment: type: OS::Cinder::VolumeAttachment properties: volume_id: String instance_uuid: String

Listing 9: Create and attach a volume to an instance

In most cases of cloud environments, the number of public IPs available is very limited. Requiring the use of internal networks for VMs. If required to expose VM to the public, FloatingIP can be used. OS::Neutron::FloatingIP resource is responsible for the allocation of ﬂoating IP’s from any network. The property floating_network must be set with a valid network name to allocate IP from. To associate the IP to VM, OS::Neutron::FloatingIPAssociation resource is used, where property floatingip_id must be set with the ID of the Floating IP and property port_id set with the network port ID associated to VM. In the situation when VM is deleted, FloatingIP will be released to the IP pool or will be reassigned to another VM upon user request. Listing 10 shows how to allocate a ﬂoating IP and associate it to the VM.

57 resources: floating_ip: type: OS::Neutron::FloatingIP properties: floating_network: String

association: type: OS::Neutron::FloatingIPAssociation properties: floatingip_id: String port_id: String

Listing 10: Allocate and associate a ﬂoating IP to an instance

4.4 Service manager

The service manager is the component where all the action will start. Figure 4.2 depicts the class diagram of the service manager implementation and the directly associated classes. The main variables of the class are openstack_auth, the OpenStack authentication credentials to interact with the OpenStack API, template_path, the path for the SLA enhanced template and action_type, to specify if the service should be deployed or destroyed. The variable action_type value can be either, deploy to trigger service deployment, or remove, to remove the service. Function service_deployment(), is the main function of the service manager, while other functions are secondary, used by the main function. After template submission by the client, it will be analyzed using template_analysis() function and using template_split() function will be split into four parts: orchestrator template, Ansible template, SLA template and recovery template. Function template_analysis() uses yaml.load() function, to verify the template structure. If the function yaml.load() executes without errors, it means the template structure analysis has successfully passed. Function template_split() will split the template by extracting properties that are not Heat related, meaning every variable that is not used by Heat orchestrator, will be extracted from the template to a separate one. The SLA enhanced template is available in Section 4.1, Listing 6. Property content is extracted from the template, creating an Ansible template, later used by the service manager to conﬁgure the service using the ansible-playbook tool. Within the sla property, external and internal are extracted to create an SLA template. This is used by the monitoring system to create monitoring metrics, alert deﬁnition and dashboard statistics. The remaining property within sla is recovery, also extracted, creating a recovery template used by the recovery engine. After the extraction of the previous properties from the template, the only properties left are

58 Figure 4.2: Service manager implementation architecture

59 Heat related, resulting in a HOT file, used by OpenStack Heat orchestrator to deploy the underlying infrastructure. After template analysis, if variable action_type is set to deploy, service manager using OpenStack API verified if stack already exists (stack_exist()), and if not OpenStack Heat receive the HOT file and using deploy_stack() will deploy the stack. A stack is a collection of provisioned resources by Heat. If action_type is set to remove, with property heat_stack_name from the template, OpenStack Heat will execute remove_stack(String heat_stack_name) to remove the stack. At the stack provision completion, the service manager will install and configure the service, using the ansible-playbook tool, accordingly to the Ansible template. After the service manager finishes to configure the service using perform_software_installation() function, using perform_agent_configuration() function will deploy Telegraf monitoring agent in the VM environment, explained in Section 4.5, used to report SLA metrics to the monitoring system. Following the monitoring agent configuration, we define in the monitoring system the thresholds for each alert from the SLA template using the Kapacitor store_sla_rules() function. Also, using Chronograf create_sla_dashboard() function, create a visualization dashboard to view current SLA monitoring statistics. If variable action_type is set to remove, using remove_sla_dashboard() function, the dashboards associated with the service will be removed. Finally, the recovery template is stored in the OpenStack database, using the store_recovery_methods function, and IP the address to access the service is displayed.

4.5 Monitoring subsystem implementation

The technology chosen for the monitoring system implementation is the TICK Stack, offering low complexity, supports a high number of plugins and uses a time-series database. TICK Stack monitoring is constituted by Telegraf, InfluxDB, Chronograph and Kapacitor (TICK). Telegraf is an agent deployed in each VM to collect and report measured metrics. Telegraf agent uses plugins to monitor different metrics, and allows the creation of generic plugins, allowing the implementation of custom monitoring scripts. A complete list of official plugins can be found in the documentation of the InfluxDB official website3. The creation of generic plugins helps to obtain certain metrics such as network performance, not supported by official Telegraf plugins. As network performance is part of our solution, we had to implement it, creating a Python script using the iPerf2 3https://docs.influxdata.com/telegraf/v1.11/plugins/plugin-list/

60 tool to obtain the network performance, returning required metrics, such as bandwidth, network jitter and percentage of the lost network packets. The reason for choosing iPerf2 instead of iPerf3 (newer version) is due to the possibility of receiving multiple network measurement requests simultaneously, which was discontinued in a newer version due to no good reason to support its existence4. The developed script accepts variables host, protocol, interval, bandwidth and name as program arguments. The script output varies accordingly to the protocol. If the protocol is tcp, the only received output is the tcp bandwidth. And if the protocol is udp, the output metrics are udp bandwidth, network jitter, and percentage of lost network packets. This Python script is then executed by the generic Telegraf plugin depicted in Listing 11, gathering the output and reporting to the monitoring node.

[[inputs.exec]] ## Commands array commands = [ {% for detail in external.network %} "/bin/python /etc/telegraf/scripts/iperf-monitor.py {{ detail.host }} {{ detail.protocol }} {{ detail.probe_interval }} {{ detail.probe_bandwidth_mb }} {{ detail.name }}", {% endfor %} ]

## Timeout for each command to complete. timeout = "10s"

## Data format to consume. data_format = "influx"

Listing 11: Custom Telegraf plugin

In our case, property commands is filled with a command to execute the Python iPerf2 monitoring script, and the rest of the plugin configuration is made using Ansible, through a Jinja2 configuration template. With Jinja2, we configured the custom Telegraf plugin accordingly to the SLA template. Each SLA template parameters translate to the plugin configuration. For example, name and protocol parameters shown in Listing 12, translate into detail.name and detail.protocol shown in Listing 11. Those and other parameters define the monitoring metrics reported to the monitoring system. Except for min_bandwidth_mb, max_jitter_ms, max_lost_percentage, recovery_tag and monitoring_function parameters, used to create alerts in the Kapacitor module. Those alerts are used to monitoring SLA violations, reporting them to the recovery engine. A for loop inside commands property, allows configuration of multiple iPerf monitoring, in case external monitoring has more than one rule. 4https://code.google.com/archive/p/iperf/issues/130/

61 ... sla: external: network: # TCP monitoring -{ name: String, protocol: 'tcp', host: String, probe_bandwidth_mb: Float, min_bandwidth_mb: Float, probe_interval: Integer, recovery_tag: String, monitoring_function:{ period: String, every: String, function: String }}

Listing 12: External SLA requirements

After the command execution, the returned output needs to be in the right format to be interpreted by the monitoring system and stored in the database. If data_format is set to influx, the output format of the executed command should be as shown in Listing 13. The first parameter to define is the monitoring name associated with the monitoring plugin e.g. external-monitoring, following by a comma and the tags associated with the monitoring e. g. alert_name=database_monitoring, ending with a space character, and the metrics with values separated by comma. The value field can be a float, integer, string, or boolean. By default, InfluxDB assumes all numerical values are floats, to specify integer value, add i after the number. The string value requires double quoting.

monitoring_name,tag1=value metric1=value,metric2=value

Listing 13: Inﬂux output format

The official Telegraf plugin implementation is similar to the generic, but for each plugin, there are different properties to define. There are three official plugins we used for this dissertation:

62 • Network Response Input Plugin (net_response SLA template property) • HTTP Response Input Plugin (http_response SLA template property) • Procstat Input Plugin (procstat SLA template property)

Network Response Input Plugin This plugin is responsible for verifying the port status of the host. It sends a request using a defined protocol to the host to a certain listen_port. If the response to the request reaches before the timeout and the host responded with success, meaning the host is listening at that listen_port. On the contrary, the listen_port is marked as unavailable, triggering the recovery_tag action. If the recovery action is email notification, the name property is used to distinguish between other alerts, helping the user to identify the issue. The host property only accepts an IP address or a DNS name. The plugin configuration is made using the Jinja2 templating language and deployed with the ansible-playbook tool. For each net_response property, a new monitoring metric will be created. Template parameters in Listing 15 are translated to the plugin configuration depicted in Listing 14. Except for name and recovery_tag parameters, used for alert definition in the Kapacitor module.

{% for detail in internal.net_response %}

[[inputs.net_response]] protocol = "{{ detail.protocol }}" address = "{{ detail.host }}:{{ detail.listen_port }}" timeout = "{{ detail.timeout }}"

{% endfor %}

Listing 14: Network Response Input Plugin conﬁguration ﬁle

... sla: internal: net_response: -{ name: String, host: String, protocol: String, listen_port: Integer, timeout: String, recovery_tag: String } ...

Listing 15: Template properties used to conﬁgure network response input plugin

63 HTTP Response Input Plugin This plugin is responsible for verifying HTTP/HTTPS connections. It sends a request using a certain HTTP method to the url, and compares the response code to the return_code. It also has an option to compare the page content with a return_content. However, the use of the return_content property is optional. If the response does not arrive before the response_timeout, the url is marked as unavailable, triggering the recovery_tag action. Properties return_code and return_content mismatch also trigger the recovery_tag action. Extra properties can be used if required, such as follow_redirects, if set as True, follows the server redirects and insecure_skip_verify, if set as True, skips chain and host verification. Plugin configuration is made using the Jinja2 templating language and deployed with the ansible-playbook tool. For each http_response property, will be created new monitoring metric. Template parameters in Listing 17 are translated to the plugin configuration depicted in Listing 16. Except for name, return_code and recovery_tag parameters, used for alert definition in Kapacitor module.

{% for detail in internal.http_response %}

[[inputs.http_response]] address = "{{ detail.url }}" response_timeout = "{{ detail.response_timeout }}" method = "{{ detail.method }}" follow_redirects = {{ detail.follow_redirects }} insecure_skip_verify = {{ detail.insecure_skip_verify }}

{% if detail.return_content is defined and detail.return_content|length %} response_string_match = "{{ detail.return_content }}" {% endif %}

{% endfor %}

Listing 16: HTTP Response Input Plugin conﬁguration ﬁle

64 ... sla: internal: http_response: -{ name: String, url: String, return_code: Integer, return_content: String, response_timeout: String, method: String, follow_redirects: Boolean, insecure_skip_verify: Boolean, recovery_tag: String } ...

Listing 17: Template properties used to conﬁgure HTTP response input plugin

Procstat Input Plugin This plugin is responsible for verifying system service status. It will obtain information about the service and compare it to the desired state. If the state does not match with the obtained information, the recovery_tag action will be triggered. Plugin configuration is made using the Jinja2 templating language and deployed with the ansible-playbook tool. For each procstat property, a new monitoring metric will be created. Template parameters in Listing 19 are translated to the plugin configuration depicted in Listing 18. Except for name, state and recovery_tag parameters, used for alert definition in Kapacitor module.

{% for detail in internal.procstat %}

[[inputs.procstat]] systemd_unit = "{{ detail.service }}.service"

{% endfor %}

Listing 18: Procstat Input Plugin conﬁguration ﬁle

Telegraf plugins can be configured in a single configuration file, but for organization purposes, we separated them, one plugin per file. Using an Ansible playbook, Telegraf plugins are dynamically configured in each VM, and the SLA requirements are loaded as Ansible variables. Then accordingly to the existence of the variables, a plugin is deployed with the defined properties. After the Telegraf metric collection and format accordingly to the correct data format, the data is sent to the monitoring system. The used database is InfluxDB and it was set up out-of-the-box, without additional configuration made. InfluxDB is a time-series and high-performance database, specially

65 ... sla: internal: procstat: -{ name: String, state: String, service: String, recovery_tag: String } ...

Listing 19: Template properties used to configure procstat input plugin made for time-series data, handling high write and query loads. To visualize the stored metrics in InfluxDB, TICK Stack provides a visualization tool called Chronograf, which can also be used to interact with Kapacitor, to create and manage alert rules. Some of the main features of Chronograph is creating dashboards to visualize different metrics on the same page simultaneously. Figure 4.3 shows an example dashboard with iPerf metrics.

Figure 4.3: Example of a Chronograf dashboard

Creating Kapacitor alerts requires selecting the metric to monitor and configure the threshold value. When the metric value goes above the threshold, an alert is triggered. The action for the trigger is a POST request to the recovery engine with a Header Key, specifying the recovery_tag associated with the alert, and a Header Value specifying the metric name. With the received information, the recovery engine will act accordingly to reestablish the SLA rules. Kapacitor uses Domain Specific Language (DSL) language, also named by the TICKscript template, to define tasks related to alert creating and trigger action. In this dissertation, we used TICKscript to create alerts and trigger actions dynamically.

66 The TICKscript is sent to Kapacitor, using shell command, where the developed script accepts SLA template parameters as arguments.

4.5.1 Monitoring dashboards During the service deployment, not only the alert rules are created. The dashboards are also automatically deployed to visualize the real-time information about the service status, including the dashboard showing the availability of SLA metrics in percentage, shown in Figure 4.4 and the dashboard of metric statistics shown in Figure 4.5.

Figure 4.4: Monitoring dashboard of SLA availability percentage

Figure 4.5: Monitoring dashboard of metric statistics

The dashboard in Figure 4.4 can be displayed to the client, ensuring that the SLA requirements are being met. On the other hand, Figure 4.5 shows the dashboard only

67 available to the cloud provider to help understand when the failure happened and what kind of error occurred. The two dashboards are related, where the main diﬀerence between them is the amount of the information displayed. The dashboard in Figure 4.4 only shows the percentage of the time the service was considered available. The dashboard in Figure 4.5 shows the exact state at the exact time of the service metric.

4.6 Recovery engine

The recovery engine was designed and developed due to the need to recover the service from an SLA violation. The recovery methods are associated with the SLA metrics and defined in the SLA enhanced template. The essence of the recovery engine is upon a received alert with the detailed information about the violated SLA metric, the recovery engine will match the SLA metric to the associated recovery methods. Knowing the recovery methods, by the order defined in the template, the recovery engine will start trying to recover the service. The recovery engine will only stop the recovery in two scenarios: if the monitoring system reported that SLA violation was successfully solved and the service state changed to OK, or the recovery engine run out of recovery methods to try and the service state is changed to FAILED. The focus of this section is explaining the implementation of the recovery engine and the flow of the action within it. Figure 4.6 depicts the implemented class diagram and the directly-associated classes. The Web Server class is what enables the recovery engine to receive requests from other components through POST requests. The main variables of the class are openstack_auth, the OpenStack authentication credentials to interact with the OpenStack API and the database_auth, the OpenStack database credentials exclusively for recovery engine use, to store and retrieve the recovery methods. The reason for using the OpenStack database, rather than deploying a separate one, would be, first, the addition of another database to deploy and maintain, and secondly, availability. The OpenStack database is deployed in a highly-available mode, with better availability mechanisms compared to if we would deploy a separate database. The recovery engine operation was separated into three primary functions. The function api_post(), is the main one. Upon receiving a POST request from the monitoring system, it triggers a specific recovery action accordingly to the rules stored in the database. As the monitoring system only reports metric state changes, meaning it will only report if the SLA metric changed status from OK to CRITICAL and another way, the function api_post() would only attempt to recover the service once. During the first service recovery attempt, if the service does not recover, the monitoring system will not alert. To make sure the service is recovered, the second function, check_alert_status(), in periods of 10 seconds verifies the status of the service recovery status that was previ-

68 Figure 4.6: Class diagram of the implemented solution and related classes

69 ously initiated by the api_post() function, and if possible will trigger another recovery method. The last function is store_recovery_methods(String json_methods), used to store the recovery methods received from the service manager during the service deployment. The recovery engine also has secondary functions: send_email() and service_restart(). The send_email() function is used to notify the client by sending an email with the information about the alert (email_content) to the email_address specified in the template. The information can be either, failure to recover the service or a warning message about the SLA violation. The service_restart() function restarts the service, however variables ip_address and ssh_user are obtained from OpenStack Nova service. The send_email() and service_restart() functions are some of the implemented recovery methods. The remaining recovery methods are perform_vm_soft_reboot and perform_vm_hard_reboot. These methods are performed using the OpenStack API, resulting in a VM reboot. The soft reboot is executed using the cloud-init tool, which sends the command to the VM for it to perform a reboot. When using a hard reboot, the VM process is killed at a hypervisor level, and then started again. To demonstrate the implemented work, Figure 4.7 depicts the flow of the action inside the api_post() function, which is the main function. The initial state is waiting for POST requests coming from the monitoring system. With no new requests, the program will remain in the same state. After receiving a new request, the function will extract the alert information from the request and obtain from the database the associated recovery rules. If the recovery rules do not exist in the database, the function will respond to the monitoring system with a rules_not_found message and 200 success HTTP status code. The reason for the 200 code is because the monitoring system only verifies if the alert message was successfully received or not. Responding to the request with an error code will generate an error in the monitoring system log. Unexistence of recovery rules happens if, during the SLA template specification, the client did not specify any recovery rules to deal with the SLA metric violation. Alternatively, something wrong could have happened during the service deployment, and the recovery methods were not stored in the database. For that reason, every significant event happening in the recovery engine is being logged into a file. It can be useful to debug the system or detect and understand wrong system behaviors. Figure 4.8 depicts the content of the log file produced by the recovery engine. Upon success in retrieving rules from the database, the function will check the alert level associated with the received request. If it is OK, the recovery status is set to OK, recovery counter and recovery delay are reset, storing the information in the database and logging this event. The recovery counter is the number of recovery attempts

70 Figure 4.7: Flow of the action inside api_post() function

Figure 4.8: Recovery engine log ﬁle

71 performed so far. Each method is associated with the maximum amount of times it can be performed. The recovery delay is the amount of time required between recovery attempts. Alternatively, if the alert status is CRITICAL, a recovery method will be selected by the order speciﬁed in the template. Furthermore, the function will verify if the selected method meets conditions to be executed. Those conditions are:

• the method was not executed more times than speciﬁed in the template • from the last recovery attempt, have the delay time passed

If any of the conditions are not satisfied, the currently selected recovery method is skipped, and the function will try to obtain the next recovery method. If there are no more methods to retry, the recovery status is marked as FAILED. Otherwise, another method will be selected, and if it passed the conditions, the function would execute the recovery action. If the recovery action is notify, the send_email() function from the recovery engine is used to send alert information to the client. If the recovery action is instance_soft_restart or instance_hard_restart, the OpenStack API is used to perform a soft or hard reboot, respectively. If the recovery action is service_restart, the OpenStack API is used to obtain the SSH user and IP address of the VM where the service is hosted. Having SSH user and the IP address of the VM, a remote connection is made to execute the systemctl service restart command to restart the service. The check_alert_status() function is similar to the api_post() function. The first change is Wait for POST requests from the monitoring system to wait 10 seconds. The state Extract information from received alert is removed, and instead of the Incoming request state, the function checks the database for any recovery status that is not OK. If it does not find anything, return to the wait 10 seconds state. Otherwise, it will perform the remaining workflow in the same way as the api_post() function.

72 CHAPTER 5

Evaluation and analysis

To evaluate and validate the dissertation work, we performed tests at different levels to measure different aspects. Through tests, it is possible to find limiting aspects before releasing it to production.

5.1 Scenario

The scenario that was used for all of the tests is the scenario currently deployed in the SKA project. This scenario consists of deploying a GitLab runner service that is used by Continuous Integration and Continuous Delivery (CI/CD) components of the GitLab platform, to perform desired tests related to the SKA repositories. The service configuration is performed using an Ansible script with extra variables set in the template to automatically connect the runner to the desired GitLab repository. The service is running on top of a virtual machine with an m1.small flavor that consists of 1 vCPU, 2 GB of RAM and 10GB of the root disk. Additionally, there is a 10GB of Cinder volume to store the service-related data. The OS used is from a CentOS- 7-x86_64-GenericCloud-1604-UP image, with the latest updates already installed to lower the setup time. The reason for a low resource specification is to lower the number of resources used and to perform faster tests. A separate virtual network was set up exclusively for these tests to isolate from other virtual machines present in OpenStack, which are not related to this work. The separate virtual network also allows a high range of available IP addresses, which is useful as some tests require up to 100 IP addresses. The SLA monitoring is set up to monitor network performance and monitoring sshd and gitlab-runner services. The recovery methods associated with the gitlab-runner service are service restart, instance soft reboot, and instance hard reboot, by that order. In case of failure, a recovery method will be performed and, if unsuccessful, another recovery method will be attempted with 60 seconds of delay. The GitLab runner service

73 is not being used by CI/CD during the tests, but it is configured and active for the GitLab usage. The complete template file is available in Appendix A. As part of the SKA project objective, it was intended to create a Highly Available OpenStack platform, where the total number of servers in the cluster is 12. From those 12, 8 servers are compute dedicated, 2 are dedicated to the controller functions, and 3 servers serve as storage. From the 3 storage servers, 2 servers are used for high volume storage, and the last server is used as root disc for virtual instances, with SSD high-speed access. The network module (Neutron) is configured in the servers dedicated to the controller nodes. Each server has installed CentOS 7 Minimal Edition with only necessary tools in it and the OpenStack platform of the newest stable release at the time of the launch of this dissertation, Pike. Figure 5.1 shows the EngageSKA cluster composition. At the physical level, the cluster is inside a datacenter with redundancy features such as power, network, and HVAC redundancies. The power redundancy has UPS (Uninterruptible Power Supply) for short power outage, holding up to 30 minutes. For more extended periods, there is a diesel generator outside of the building, providing a larger amount of time to work. Two core switches provide the network redundancy with two independent uplinks and, at the cluster level, there are two 10G switches connected to two ports of each server. They are configured to load-balance the traffic at the OS level.

Controller node The controller service is constituted by two servers to provide redundancy and high availability, to control OpenStack modules. The controller node will be responsible for providing an authentication method (Keystone) and for providing metadata storage (MariaDB) for each module (like Keystone, Cinder, Nova, and others). By using an API, the controller node manages other nodes like Nova (Compute Node), Cinder (Storage Node), and Neutron (Network Node) in order to get them to work in synchrony with each other. It is essential to have a synchronized rhythm. Otherwise, due to the wrong timing, the modules may issue errors. With the help of the Network Time Protocol (NTP), all nodes are synchronized and get timing information regularly. The NTP server could be local or, like in our case, it is provided through the University of Aveiro Computing Centre NTP server. Precision Time Protocol (PTP) could also be used. RabbitMQ is an open-source message broker software that is responsible for the communication between nodes in an OpenStack infrastructure. The RabbitMQ server is located at the controller node. In addition, the controller node has:

74 • An image service (Glance) that allows to ﬁnd, create and retrieve virtual machine images; • A web-based user interface (Horizon) that allows easy maintenance and interaction with the OpenStack platform; • An orchestration mechanism (Heat) that allows the automation of resource provisioning.

Compute node The compute service is constituted by eight servers, each with two Intel Xeon E5-2650 v4 processors. Each processor has the base clock speed of 2.2GHz, 30MB of Intel Smart Cache memory, and a system speed bus of 9.6 GT/s with two QPI links. With the technology of Intel Turbo Boost, the clock speed can go up to 2.9GHz when there is a need for more computing power. Additionally, it has 12 cores and 24 threads provided by Intel Hyper-Threading Technology, allowing for a high-threaded application to finish the work more quickly. In total, the cluster has 192 CPU cores and 384 threads for the compute service. For virtualization purposes, this processor has the feature of Intel Virtualization Technology for Directed I/O (VT-d), allowing a virtual machine to have direct access to the hardware components, while also having the benefit of improving virtualization performance. One of the most significant advantages of virtualization is the possibility of over-provisioning resources, thus gaining the possibility of duplicating the available resources. Each server has two Intel S3710 Enterprise Performance SATA 3.0 solid-state drives, each with a capacity of 400GB. This drive interface can provide bandwidth of up to 6Gbps, and high writing performance and endurance, ideal for High-Performance Computing (HPC) configured in RAID (Redundant Array of Inexpensive Drives) 1 configuration. Among eight servers, 3 have 8 RAM blades with 64GB each, having a total of 512GB of RAM, while the five remaining ones have four RAM blades with 32GB each, having a total of 128GB of RAM. All RAM blades are type RDIMM with a speed of 2400 MT/s, dual rank, and x4 Data Width. Adding all available RAM gives us a total of 2176 GB (≈2.18 TB). Each server is equipped with Intel(R) 2P X710/2P I350 rNDC network adapter with two dual-port 10 Gigabit SFP+, each port connected to each Dell N4032F switch stacked together, allowing 10 GigE link redundancy. In addition, the network adapter also has dual-port 1 Gigabit Ethernet, which it is not used at the moment, but could also be used to increase redundancy in case of the loss of both 10 Gigabit SFP+. For power redundancy, there is two Power Supply Unit (PSU) with 750W each, giving the

75 possibility of distributing the power load between them or of continuing regular server operation in case of one of the PSU fails. The compute node has the purpose of providing the virtual machine with computing resources, which are deployed and run on it. Examples of these resources are the RAM and processing unit. To be able to create virtual machines, the technology that allows virtualization is required. Concerning that requirement, Libvirt, hypervisor type 2, is being used, which is an open-source API that can manage virtualization platforms like KVM, Xen, VMware ESX, QEMU, and other virtualization technologies.

Storage node Two different storage nodes constitute the storage service. Both will use GlusterFS because this technology can handle significant volumes of data and easily scales up, having 64 Terabytes (TB) per node, as recommended. Unfortunately, OpenStack stopped supporting the GlusterFS driver since the Newton release. However, it is possible to expose volumes through an Network File System (NFS) server. The first storage service is powered by one server with eight 1.92 Terabyte (TB) Solid-State Drive (SSD), having a total of 15.36TB, to provide higher IOPS and access time. The second storage service is powered by two servers, each with eight 1.8 Terabyte (TB) SAS disks, having a total of 14.4 Terabytes (TB) per server. This storage service can have one of two purposes: providing a high backup volume or an high-availability volume. In the case of creating backup storage, GlusterFS will create a distributed volume between two servers, giving a total of 28.8 Terabytes (TB) of disk storage. If there is more demand for high-available storage, GlusterFS will create a replicated volume, mirroring data between the two servers, and, in case of failure, it provides a replica.

76 Figure 5.1: EngageSKA cluster at the Datacenter of Telecommunication Institute in Aveiro

5.2 Limitations

During the initial tests, we found some limitations related to Nova and Heat API. Also, during the measurement of recovery time, we found a signiﬁcant delay between the detection of SLA violation and alert processing.

5.2.1 OpenStack platform This limitation consists of only allowing 10 simultaneous requests to the Nova or Heat API. If the request limit is exceeded the service crashes, canceling the present requests and also the previous not yet the completed. This issue was found during an attempt to deploy more than 15 service stacks simultaneously using Heat API. The same happened when trying to gather information such as IP address, hostname, and other characteristics from Nova API about the virtual instances. Issuing around 15

77 requests resulted in a services crash, returning a 503 error code. The connection to the Heat and Nova API was made using Python heatclient and novaclient libraries, respectively. Figure 5.2 depicts the limit of the simultaneous stack deployments using the Heat API. In this test, we found a limitation where the deployment of more than 11 simultaneous stacks resulted in a crash of Heat service. This test has the purpose of demonstrating a crash service during deployment of 11 stacks, not to measure the time to deploy multiple stacks. To prevent the service crash, we limited the maximum simultaneous requests to 5.

Figure 5.2: Simultaneous service stack deployment

5.2.2 SLA metric processing This limitation was found during the measurement of SLA violation detection time compared to the monitoring interval. When the monitoring interval is relatively small (0 to 10 seconds), the registered service recovery time is about the same. This is caused by the Kapacitor module, as it only processes alerts in intervals of 10 seconds, thus evaluating metrics for SLA violation only 7 times in 1 minute. For example, if the evaluation starts at 10:00 and finishes at 10:01, the Kapacitor checks in the following times: at 10:00:00, 10:00:10, 10:00:20, 10:00:30, 10:00:40, 10:00:50 and 10:01:00. We were not able to find any variable in the configuration file to change that interval. Figure 5.3 shows the results of tests that consisted of measuring the average time between the detection of SLA metric violation and the alert from the monitoring system received by the recovery engine. This test was performed 20 times, the first 10 as

78 initial test results, and the second 10 times to conﬁrm the test results. The graphical representation of obtained results has a v-shaped line, where the odd monitoring intervals (5, 15, 25, 35, 45, and 55) have lower delay time, on average 8 seconds. Furthermore, the even intervals (10, 20, 30, 40, 50, and 60) have a higher delay, always a 10 seconds delay. This can be explained with the following examples.

Figure 5.3: Kapacitor delay to process metrics

In case of an odd interval, if the SLA metric arrives at 10:00:05, the next Kapacitor check is at 10:00:10, taking 5 seconds. If the metric arrives at 10:00:40, although the metric arrived at 10:00:40, the Kapacitor will not process that in time, and only being able to process at 10:00:50, taking 10 seconds. This will result in an average of 8 seconds to be processed an odd monitoring interval. In case of an even interval, the SLA metric will always arrive at the same time as the Kapacitor will start to process it. For example, the metric arrives at 10:00:30. The Kapacitor will only be able to process at 10:00:40, taking 10 seconds and resulting in always taking 10 seconds to process an even monitoring interval. From this test, we can conclude that an odd interval results in faster overall service recovery time due to the delay in processing the SLA metrics.

5.3 Results

The results were obtained in an automated way using Python 2.7.5 and Bash scripting, allowing test consistency, and the repetition of the test in case of a failure. This section will explain the performed tests and the obtained results.

79 5.3.1 Service deployment steps In this test, we measure the amount of time required to deploy the service and each step of it. The objective of this test is to obtain the time required to deploy the service, dividing the time into independent steps. With the obtained time for each step, we can predict how much time will be needed to deploy a service. For this test, we used the template available at Appendix A. The service deployment is divided into four significant steps. The first step is the template analysis, verifying for possible structural errors. The second step is deploying the virtual resources to support the service in the form of a Heat stack. After the virtual resources being ready, the third step is to perform a software content configuration using Ansible playbook. The last step is the configuration of the monitoring agent on the virtual resources and establishing SLA rules in the monitoring node. The performed test was repeated ten times, presenting average values in Figure 5.4. The template analysis took 0.15 seconds, stack deployment took 39.8 seconds, software configuration took 96.11 seconds, and monitoring configuration took 40.58 seconds.

Figure 5.4: Duration of each step from service deployment

The template analysis time is insignificant, not even taking one second. The stack deployment time dependents on the number of virtual resources to deploy. In the used template, only two virtual resources are deployed: a virtual instance and a volume space. Deploying more resources will create additional time to process this step. The software configuration can take a different amount of time, depending on the content to

80 install, Internet connection, virtual instance configuration, and others. The monitoring configuration time depends on the number of resources required to be monitored and on the internet connection, which is required to install the monitoring agent. On average, the service deployment time is around 182 seconds (3 minutes). More complex template configurations can take more time to deploy, but from nothing to infrastructure support, service configuration, and monitoring agent with SLA rules, we consider 3 minutes a suitable and satisfying result. Another important time is service removal, the time required to remove the infrastructure and monitoring rules at the monitoring node. This step took around 18.6 seconds.

5.3.2 Multiple service deployment time In this test, we measure the amount of time required to deploy multiple services. The objective of this test is to obtain the time required to deploy a high number of services and also analyzing the evolution of time while incrementing the number of services, trying to reach a limit. With the obtained results, we could understand the limits of the developed work and estimate the deployment time of multiple services. For this test, we used the template available at Appendix A. As explained in Section 5.2, the maximum number of simultaneous requests to the Nova and Heat API is around 10. So we decided to round it down to 5 of maximum simultaneous requests because sometimes, the test would fail using 10 simultaneous requests. Since the maximum simultaneous service deployment is 5, the number of deployed services in this test and others is a multiple of 5. Meaning, deploying 1 or 5 services will result (on average) in the same deployment time, as they are deployed in parallel. Figure 5.5 depicts the obtained results of this test. As the cluster is shared, currently it is also used by members of SKA, this test was only performed once. This test required 1 x 100 = 100 vCPU, 2GB x 100 = 200 GB of RAM, 10GB * 100 = 1 TB of root disk, 10GB * 100 = 1 TB of Cinder volume and 1 IP * 100 = 100 IP addresses. From all the performed tests, this is the most resource-demanding and time-consuming one. Another reason for not repeating the test is the obtained results being as expected. As we expected, the growth of the service deployment time is linear to the increment of the number of services to deploy. Time to deploy 5 services is around 3 minutes and 12 seconds, and to deploy 100 services took around 1 hour and 6 minutes. The limitation of 5 simultaneous services, inﬂuence the deployment time for a high number of services greatly. The deployment time for a high number can be improved by increasing the number of simultaneous requests, but careful veriﬁcations are needed for not exceeding the maximum allowed number of requests. Increasing the

81 Figure 5.5: Deployment time over number of services maximum allowed number of requests in the Nova and Heat API will also help decrease deployment time signiﬁcantly. However, in the current situation, it is better to stay in safe 5 simultaneous requests.

5.3.3 SLA violation detection time In this test, we measure the amount of time required for monitoring service to receive the information about the SLA violation during different SLA monitoring interval. The objective of this test is to obtain the required time to detect an SLA violation with different monitoring intervals from 5 to 60 seconds. With the obtained results, we could estimate the average time required for the monitoring system to detect an SLA violation. This way, each service could have an adequate monitoring interval for SLA monitoring accordingly to their requirements. For this test, we used the template available at Appendix A. The relevant part of the template for this test is the SLA monitoring definition of the GitLab Runner, depicted in Listing 20. If the service state is different from running, the monitoring system will mark the SLA metric as violated, issuing the alert. The test consists of creating an SLA violation by stopping a gitlab-runner service and measuring the time between the service stop and monitoring service receiving information about the SLA violation. The use of service stop for issuing the SLA violation was used due to the simplicity of creating an SLA violation. Violation, for example, in network bandwidth, is more complex and will result in the same SLA violation interval because every SLA metric is evaluated at the same time.

82 ... resources: resource_name: ... sla: internal: procstat: -{ name: 'gitlab-runner', state: 'running', service: 'gitlab-runner', recovery_tag: 'gitlab-runner_recovery' }

Listing 20: GitLab Runner SLA monitoring

Figure 5.6 depicts the results obtained for this test. This test was repeated 20 times, using average values for result presentation. The detection time is increasing as the monitor interval increases, which is expected. Although there is a monitoring interval of 40 seconds, taking on average 16 seconds to detect an SLA violation and 35 seconds monitoring interval taking on average 21 seconds. In order to improve the test results, more tests are required. To obtain the detection time, we stored the time of service failure and manually compared it with the time of the SLA violation at the monitoring system. This test could not be completely automated. However, with the obtained results, we can see that the average values of violation detection time tend to be half of the monitoring interval time.

Figure 5.6: Detection time of a SLA violation among diﬀerent monitoring intervals

5.3.4 Recovery method comparison In this test, we measure the amount of time required to recover the service from an SLA violation with diﬀerent recovery methods. The objective of this test is to obtain the

83 required time per recovery method to recover the service and find which is the fastest. With the obtained results, we could estimate the average time to recover the service. This way, each service could have an adequate recovery method accordingly to their requirements. For this test, we used the template available at Appendix A. The relevant part of the template for this test is the recovery method definition for the GitLab Runner, as depicted in Listing 21. In this test, to trigger service_restart method it has to be the first in the list of gitlab-runner_recovery under recovery property, as depicted in Listing 21. To trigger instance_soft_restart, we changed the recovery order by putting the instance_soft_restart in the first place. For instance_soft_restart we did the same. If the service state is different from running, the monitoring system will mark the SLA metric as violated, issuing the alert to the recovery engine. The recovery engine selects the recovery method and issues the request to the orchestrator.

... resources: resources_name: ... sla: internal: procstat: -{ name: 'gitlab-runner', state: 'running', service: 'gitlab-runner', recovery_tag: 'gitlab-runner_recovery' } recovery: gitlab-runner_recovery: - service_restart: { service: 'gitlab-runner', retries: 1, delay_between_retries: 60 } - instance_soft_restart: { retries: 1, delay_between_retries: 60 } - instance_hard_restart: { retries: 1, delay_between_retries: 60 }

Listing 21: Recovery methods for GitLab Runner service

There are three available recovery methods which are: service restart, instance soft reboot, and instance hard reboot. To perform a comparison between the three recovery methods, we used one gitlab-runner service, issuing the service stop command and waiting for it to be available. The time to detect an SLA violation is not taken into account in this test. The measurement was performed 10 times for each recovery method. The test results are depicted in Figure 5.7. From the test, we can conclude that the fastest recovery method is the service restart, taking on average 6 seconds. However, if the SSH access is compromised, it is only possible to soft or hard reboot the instance. Between those two, the measured time on average is similar, taking 23 to 24 seconds.

84 Figure 5.7: Recovery method comparison

5.3.5 Multiple service recovery In this test, we measure the amount of time required to recover multiple services from a simultaneous SLA violation. The objective of this test is to understand if the developed system can handle multiple simultaneous SLA violation and how it influences the recovery time. With the obtained results, we could understand the limits and the system potential to counteract simultaneous failures. Also, this test could help to find and establish a limit of services under the system monitorization before reaching a system failure. For this test, we used the template available at Appendix A. The test was repeated 10 times and consisted of creating an SLA violation by simultaneously stopping all deployed gitlab-runner services, measuring the time between when the service stopped working and got back online. The use of service stop for issuing the SLA violation was used due to the simplicity of creating an SLA violation. Violation, for example, in network bandwidth, is more complex and will result in the same measurements. The monitoring interval used in this test is 30 seconds, the default value of the monitoring system. We started by measuring from 1 to 5 simultaneous service failures and then increased by order of 10, starting in 10 and ending in 50 services. The test results are depicted in Figure 5.8. The first six obtained results were expected to be similar, as the number of services is low, with a recovery time of 35 seconds on average. After that, the recovery time started to increase. The increase of

85 time between 20 and 10 simultaneous SLA violations on average is 5 seconds, between 30 and 20 is 13 seconds, between 40 and 30 is 9 seconds, and the biggest difference is between 50 and 40, with a difference of 22 seconds. During the test, all of the services were recovered with a 100% success rate. However, they required a different amount of time to recover accordingly to the number of services to recover. We assume those time increments are due to reaching the maximum of Nova API simultaneous requests (limitation described in Section 5.2), which is used to obtain the service IP address to perform the recovery. Exceeding the Nova API limit, some requests will go into a queue and will wait until the number of requests is below the limit. This test proves that the dissertation work can handle a high amount of simultaneous service failure.

Figure 5.8: Service recovery time with diﬀerent quantity of services

5.3.6 Network analysis The monitoring system uses the same network as regular services, generating additional network traffic. The following test has the purpose of analyzing that additional network traffic and how harmful it is to the regular network traffic.

Network bandwidth according to different monitoring intervals In this test, we measure the throughput that is incoming to the monitoring service and outcoming of it. The objective of this test is to understand how the monitoring system affects the network load with different monitoring intervals. With the obtained results, we could predict the additional network load caused by the monitoring system and understand whether it is harmful to the regular traffic on the network or not. The regular traffic would be the traffic caused by the service.

86 The service periodically reports SLA metrics to the monitoring system, generating additional traffic in the network. The monitoring system also outputs information to the service, confirming the received SLA metrics. By observing the monitoring system input and output network traffic, we can understand the amount of monitoring data that is traveling across the network in a given interval. This test was repeated 10 times and consisted of measuring the average throughput in the monitoring system during the 30-minute interval, generated by one deployed gitlab-runner service template available at Appendix A. The throughput is calculated using the number of bytes received (rx_bytes) or transmitted (tx_bytes), obtained from the network interface of the monitoring system. Furthermore, after 30 minutes, we obtain the difference, which translates into the number of received or transmitted bytes in a 30-minute interval. We then divided the difference by the 30 minutes interval and converted it to the Kilobyte unit. After obtaining the throughput, we changed the monitoring interval and performed the same procedures. The current number of received or transmitted bytes at the network device in Linux can be obtained from a file. The number of bytes received is available at file /sys/class/net/eth0/statistics/rx_bytes and the number of bytes transmitted at /sys/class/net/eth0/statistics/tx_bytes. The test results are depicted in Figure 5.9. An obvious observation of this test is a decrease in both received and transmitted throughput, while the monitoring interval increases. This behavior was expected, the higher the monitoring interval is, the fewer times the service reports SLA metrics to the monitoring system (received throughput), resulting in fewer responses to the service confirming the reception of the SLA metrics (transmitted throughput). We also consider the transmitted throughput irrelevant. Because, on average, it is 235 times smaller in comparison to the received throughput.

Figure 5.9: Throughput at monitoring system according to diﬀerent monitoring intervals

Monitoring SLA metrics of one service, with a 5-second interval, resulted in a

87 received throughput of, on average, 550 KB/s. This should not affect the regular traffic in the network. However, if we assume we had 100 services and each additional service would add 550 KB/s of monitoring traffic to the network, it would result in a total throughput of 55 MB/s. In a 100 MB bandwidth network, that would cause a problem as the monitoring data occupies 55% of the available bandwidth, leaving only 45% of the bandwidth to be used by the regular traffic of 100 services. However, in our scenario, we have 10 Gigabit (1.25 GB) of available bandwidth. The monitoring data would occupy around 4.4% of the network throughput and leave around 95.6% of available bandwidth to be used by the regular traffic of 100 services. The monitoring of the SLA metrics with a 5-second interval gives greater precision compared to other monitoring intervals, but the majority of the services may not require such precision. The default value of the monitoring interval is 30 seconds, resulting in a received throughput of 93 KB/s. If we multiply this by 100 services, we obtain 9.3 MB/s throughput. That amount of throughput can be supported in a 100MB network without disrupting the regular network traffic. It will occupy 9.3% of the available bandwidth and leave 90.7% to the regular traffic. We cannot decide on one perfect monitoring interval, as services have different requirements. While the critical services might require more frequent monitoring checks to reach high-availability, others might not require it. We decided to leave the monitoring interval with the default value (30 seconds) and, if more frequent checks for critical services are necessary, we change it manually for that service alone.

Network bandwidth according to service quantity In this test, we measure the throughput that is incoming to the monitoring service and outcoming of it at a different quantity of services. The objective of this test is to understand how the monitoring system affects the network load with a different number of services. With the obtained results, we could predict the additional network load caused by the monitoring system and understand whether it is harmful to the regular traffic on the network or not. The regular traffic would be the traffic caused by the service. This test is very similar to the previous one, at which we obtained the network throughput at different monitoring intervals. However, in this test, we are varying the number of services and fixing the monitoring interval to a default value, 30 seconds. The method to obtain the throughput is described in the previous test, Network bandwidth accordingly to different monitoring intervals. The test depicted in Figure 5.10 was repeated 10 times. An obvious observation of this test is an increase in both received and transmitted throughput while the number of services increases. This behavior was expected, as more services reporting SLA metrics,

88 means more SLA metrics received by the monitoring system (received throughput), resulting in more responses to the services conﬁrming the reception of the SLA metrics (transmitted throughput). We also consider the transmitted throughput irrelevant. Because, on average, it is 350 times smaller in comparison to the received throughput. Another observation from this test is the linear growth of the throughput. Per each additional service, the received throughput increases in 93 KB/s, and the transmitted throughput increases on average, 0.22 KB/s. The reason to add the 0 service number to the test was to understand how much throughput the monitoring system is receiving and transmitting without any SLA metrics, which is 0.23 KB/s of receiving throughput and 0.22 KB/s of transmitted throughput.

Figure 5.10: Throughput at monitoring system according to a diﬀerent number of services

Monitoring the SLA metrics of 10 services, with a 30-second interval, resulted in a received throughput of, on average, 930 KB/s. This should not affect the regular traffic in the network. However, if we assume we had 1000 services and each additional service would add 93 KB/s of monitoring traffic to the network, it would result in a total throughput of 93 MB/s. In a 100 MB bandwidth network, that would cause a problem as the monitoring data occupies 93% of the available bandwidth, leaving only 7% of the bandwidth to be used by the regular traffic of 1000 services. However, in our scenario, the monitoring data would occupy around 7.45% of the network throughput, and leave around 92.55% of available bandwidth to be used by the regular traffic of 1000 services. With these two tests, we concluded that the number of services is not the limiting aspect, but rather the monitoring interval. The lower the monitoring interval, the more monitoring information is flowing in the network. So, the monitoring interval should be chosen accordingly for each service need.

CHAPTER 6

Conclusion

This work presented the path of joining the benefits of using a cloud environment to provision virtual resources on-demand, and a monitoring system to monitor and ensure the availability of the service. Also, by adding the automatic recovering methods, the service availability is reestablished in case of a failure. Being part of the SKA project and observing the developed work created the motivation to build the mechanisms for a high-availability of critical services hosted in the cloud environment. Thus, we committed to creating a solution able to solve the availability issues for critical services, leading to the objective that can be split into three parts. The first is to define a structure for the template file to describe the service and SLA rules. The service description must include the required resources to provision and also include the installation script in Ansible format. The SLA rules must include the description of metrics to monitor with a respective threshold. Moreover, it should include recovery methods associated with the monitoring metrics. The second objective is to perform a template analysis and to deploy the service within the cloud environment using the orchestrator. The third objective is to monitor the service continuously. If one of the service metrics goes against the established threshold, it must trigger the recovery method by the specified order in the template. Upon recovery failure, the recovery engine will notify the user about the recovery failure. State of the art in this dissertation had the role of presenting recent technologies and currently used methods of presenting different cloud environments, monitoring systems, and availability metrics, among others, and of discussing the advantages and drawbacks of each separate technology. With the motivation, defined objectives, and the state of the art analysis, a design for a solution was proposed. The solution includes an analysis of possible SLA metrics to monitor, a component diagram describing the interaction between them, the possible

91 template structure for service deﬁnition, the operation workﬂow that should happen, and the recovery methods in case of SLA violation. Moreover, the implementation deployment followed the proposed solution architecture. The solution made use of the OpenStack cloud environment, and it monitored the SLA with the TICK Stack monitoring system. Template analysis and service recovery required the development of new modules such as a service manager and recovery engine. The implementation successfully reached the proposed solution and completed the objectives. Finally, to validate the implemented system, evaluation and analysis took place. By performing tests, presenting the obtained results, and discussing them, the dissertation presented satisfactory results. However, the obtained results presented some room for aspects to be improved in the future. It is important to denote the amount of required work to set up this OpenStack platform and add the high-available mode to it, which is crucial for sustaining the work presented in this dissertation. The OpenStack community considered implementing the SLA monitoring at the Enterprise level, and this dissertation work demonstrated SLA monitoring in practice and also added automated mechanisms to reestablish the agreed SLA to an extent.

6.1 Future work

Although the developed work in this dissertation is working and currently used in a real-world scenario, some of the planned functionalities were not developed due to lack in time. Those functionalities to implement in the future can be split into two categories.

Monitoring metrics • Monitor internet connectivity bandwidth: measures the internet connection speed between the local service and a remote site location, which could be performed by downloading a large ﬁle and measuring the time taken to download. • Accurate service availability: using uptime of the virtual instance to obtain the exact service uptime. • Measure disk performance, obtain the disk performance of each virtual instance using the iostat tool.

Recovery methods • Service re-build: re-build the entire service stack or re-build only failed resources. • Migration recovery: performing virtual resource migration to another healthy server if the current server is being under high load usage.

92 Another feature to include in the future is monitoring interval speciﬁcation in each service template, allowing to specify custom monitoring intervals for critical services. This way, each monitoring interval will be accordingly to the service priority, lowering the monitoring traﬃc on the network comparing to the case if we would set up the same low monitoring interval for every service.

APPENDIX A

GitLab runner template

95 heat_template_version: 2013-05-23 heat_stack_name: runner-stack description: GitLab Runner installation resources: GitLabRunnerNode: type: OS::Nova::Server content: ansible: file: 'ansible-playbooks/deploy_runners.yaml' extra-vars: "token='TOKEN' runnername='auto-runner-stack' taglist='tester'" sla: external: network: -{ name: 'tcp_MonitoringNode', protocol: 'tcp', probe_bandwidth_mb: '2', min_bandwidth_mb: '1', probe_interval: '5', recovery_tag: 'network_alert', monitoring_function:{ period: "5m", every: "30s", function: "median" }} -{ name: 'udp_MonitoringNode', protocol: 'udp', probe_bandwidth_mb: '2', min_bandwidth_mb: '1', max_jitter_ms: '0.1', max_loss_percentage: '5', probe_interval: '5', recovery_tag: 'network_alert', monitoring_function:{ period: "5m", every: "30s", function: "median" }} internal: net_response: -{ name: 'sshd', host: 'localhost', protocol: 'tcp', listen_port: '22', timeout: '1s', recovery_tag: 'sshd_recovery' } procstat: -{ name: 'gitlab-runner', state: 'running', service: 'gitlab-runner', recovery_tag: 'gitlab-runner_recovery' } -{ name: 'sshd', state: 'running', service: 'sshd', recovery_tag: 'sshd_recovery' } http_response: -{ name: 'google', url: 'http://google.com', return_code: '200', response_timeout: '10s', method: 'GET', follow_redirects: 'true', insecure_skip_verify: 'true', recovery_tag: 'network_alert' } recovery: sshd_recovery: - instance_soft_restart: { retries: 1, delay_between_retries: 60 } - delay: { delay_between_retries: 60 } - instance_hard_restart: { retries: 1, delay_between_retries: 60 } - delay: { delay_between_retries: 60 } - notify: { email: '[email protected]' } gitlab-runner_recovery: - service_restart: { service: 'gitlab-runner', retries: 1, delay_between_retries: 60 } - delay: { delay_between_retries: 60 } - instance_soft_restart: { retries: 1, delay_between_retries: 60 } - delay: { delay_between_retries: 60 } - instance_hard_restart: { retries: 1, delay_between_retries: 60 } - delay: { delay_between_retries: 60 } - notify: { email: '[email protected]' } network_alert: - notify: { email: '[email protected]' } properties: image: CentOS-7-x86_64-GenericCloud-1604-UP flavor: m1.small key_name: tick-stack networks: - network: int_net

GitLabRunnerVolume: type: OS::Cinder::Volume properties: description: GitLabRunnerVolume size: 10

GitLabRunnerVolumeAttachment: type: OS::Cinder::VolumeAttachment properties: volume_id: { get_resource: GitLabRunnerVolume } instance_uuid: { get_resource: GitLabRunnerNode }

Listing 22: GitLab runner template

96 References

[1] J. B. Morgado, D. Barbosa, J. P. Barraca, D. Maia, J. Bergano, M. Di Carlo, M. Canzari, M. Dolci, R. Smareglia, and D. Bartashevich, “Very large scale high performance computing and instrument management for high availability systems through the use of virtualization at the Square Kilometre Array (SKA) telescope”, in Software and Cyberinfrastructure for Astronomy V, J. C. Guzman and J. Ibsen, Eds., vol. 10707, SPIE, Jul. 2018, p. 20, isbn: 9781510619678. doi: 10.1117/12.2313559. [2] S. R. Chaudhuri, M. Di Carlo, G. Le Roux, S. Natarajan, Y. Wadadekar, N. M. Ramanujam, J. Kodikar, V. Sathe, M. Patil, A. Khanvilkar, A. Dange, V. Trivedi, J. Ranpura, S. Nakave, V. Kumthekar, V. Mohile, S. Valame, L. van den Heever, P. Swart, R. Brederode, S. Williams, M. Nicol, P. Klaassen, A. O’Brien, S. Reed, M. Canzari, R. Smareglia, M. Dolci, C. Knapic, G. Jerse, V. Alberti, F. Tinarelli, J. C. Guzman, S. Vrcic, D. Barbosa, J. Bergano, J. P. Barraca, J. B. Morgado, D. Maia, N. Silva, G. Brajnik, L. Babani, D. Bartashevich, A. Bridger, and Y. Gupta, “SKA telescope manager: a status update”, in Software and Cyberinfrastructure for Astronomy V, J. C. Guzman and J. Ibsen, Eds., vol. 10707, SPIE, Jul. 2018, p. 2, isbn: 9781510619678. doi: 10.1117/12.2313649. [3] M. Di Carlo, D. Bartashevich, J. Morgado, D. Nunes, S. Williams, M. de Beer, K. Madisa, A. Venter, and M. Bartolini, “CI-CD Practices with the TANGO-controls Framework in the Context of the Square Kilometre Array (SKA) Telescope Project”, in 17th Int. Conf. on Accelerator and Large Experimental Physics Control Systems (ICALEPCS’19), Oct. 2019. [4] I. E. Baciu, “Advantages and Disadvantages of Cloud Computing Services, from the Employee’s Point of View”, National Strategies Observer, vol. 1, no. 2, Jun. 2015. [5] G. Pal, K. K. Barala, and M. Kumar, A review paper on cloud computing, International Journal for Research in Applied Science and Engineering Technology (IJRASET), Sep. 2014. [6] R. Grossman, “The Case for Cloud Computing”, IT Professional, vol. 11, no. 2, pp. 23–27, Mar. 2009, issn: 1520-9202. doi: 10.1109/MITP.2009.40. [7] F. Shaikh and S. Haider, “Security threats in cloud computing”, Jan. 2011, pp. 214–219, isbn: 978-1-4577-0884-8. [8] R. Morabito, J. Kjallman, and M. Komu, “Hypervisors vs. Lightweight Virtualization: A Performance Comparison”, in 2015 IEEE International Conference on Cloud Engineering, IEEE, Mar. 2015, pp. 386–393, isbn: 978-1-4799-8218-9. doi: 10.1109/IC2E.2015.74. [9] R. Dua, A. R. Raja, and D. Kakadia, “Virtualization vs Containerization to Support PaaS”, in 2014 IEEE International Conference on Cloud Engineering, IEEE, Mar. 2014, pp. 610–614, isbn: 978-1-4799-3766-0. doi: 10.1109/IC2E.2014.41. [10] N. Jain and S. Choudhary, “Overview of virtualization in cloud computing”, in 2016 Symposium on Colossal Data Analysis and Networking (CDAN), IEEE, Mar. 2016, pp. 1–4, isbn: 978-1- 5090-0669-4. doi: 10.1109/CDAN.2016.7570950.

97 [11] J. Hwang, S. Zeng, F. Wu, and T. Wood, “A component-based performance comparison of four hypervisors”, 2013 IFIP/IEEE International Symposium on Integrated Network Management (IM 2013), 2013. [12] S. A. Baset and S. A., “Cloud SLAs”, ACM SIGOPS Operating Systems Review, vol. 46, no. 2, p. 57, Jul. 2012, issn: 01635980. doi: 10.1145/2331576.2331586. [13] Pooja and A. Pandey, “Virtual machine performance measurement”, in 2014 Recent Advances in Engineering and Computational Sciences (RAECS), IEEE, Mar. 2014, pp. 1–3, isbn: 978-1- 4799-2291-8. doi: 10.1109/RAECS.2014.6799630. [14] J. Gray and D. Siewiorek, “High-availability computer systems”, Computer, vol. 24, no. 9, pp. 39–48, Sep. 1991, issn: 0018-9162. doi: 10.1109/2.84898. [15] R. Buyya, S. K. Garg, and R. N. Calheiros, “SLA-oriented resource provisioning for cloud computing: Challenges, architecture, and solutions”, in 2011 International Conference on Cloud and Service Computing, IEEE, Dec. 2011, pp. 1–10, isbn: 978-1-4577-1637-9. doi: 10.1109/ CSC.2011.6138522. [16] C. Vecchiola, X. Chu, and R. Buyya, “Aneka: A Software Platform for .NET-based Cloud Computing”, Jul. 2009. arXiv: 0907.4622. [17] M. Brattstrom and P. Morreale, “Scalable Agentless Cloud Network Monitoring”, in 2017 IEEE 4th International Conference on Cyber Security and Cloud Computing (CSCloud), IEEE, Jun. 2017, pp. 171–176, isbn: 978-1-5090-6644-5. doi: 10.1109/CSCloud.2017.11. [18] I. Angelopoulos, E. Trouva, and G. Xilouris, “A monitoring framework for 5G service deployments”, in 2017 IEEE 22nd International Workshop on Computer Aided Modeling and Design of Communication Links and Networks (CAMAD), IEEE, Jun. 2017, pp. 1–6, isbn: 978-1-5090-6302-4. doi: 10.1109/CAMAD.2017.8031617. [19] G. Da Cunha Rodrigues, R. N. Calheiros, V. T. Guimaraes, G. L. dos Santos, M. B. de Carvalho, L. Z. Granville, L. M. R. Tarouco, and R. Buyya, “Monitoring of cloud computing environments”, in Proceedings of the 31st Annual ACM Symposium on Applied Computing - SAC ’16, New York, New York, USA: ACM Press, 2016, pp. 378–383, isbn: 9781450337397. doi: 10.1145/2851613.2851619. [20] M. Di Carlo, M. Dolci, R. Smareglia, M. Canzari, and S. Riggi, “Monitoring and controlling the SKA telescope manager: a peculiar LMC system in the framework of the SKA LMCs”, G. Chiozzi and J. C. Guzman, Eds., Aug. 2016. doi: 10.1117/12.2231614. [Online]. Available: http: //proceedings.spiedigitallibrary.org/proceeding.aspx?doi=10.1117/12.2231614. [21] R. Zhang, Y. Shang, and S. Zhang, “An Automatic Deployment Mechanism on Cloud Computing Platform”, in 2014 IEEE 6th International Conference on Cloud Computing Technology and Science, IEEE, Dec. 2014, pp. 511–518, isbn: 978-1-4799-4093-6. doi: 10.1109/CloudCom.2014. 87. [22] M. Katyal and A. Mishra, “Orchestration of cloud computing virtual resources”, in 2014 International Conference on Contemporary Computing and Informatics (IC3I), IEEE, Nov. 2014, pp. 833–838, isbn: 978-1-4799-6629-5. doi: 10.1109/IC3I.2014.7019756. [23] J. Kovács and P. Kacsuk, “Occopus: a Multi-Cloud Orchestrator to Deploy and Manage Complex Scientiﬁc Infrastructures”, Journal of Grid Computing, vol. 16, no. 1, pp. 19–37, Mar. 2018, issn: 1570-7873. doi: 10.1007/s10723-017-9421-3. [24] L. M. Pham, A. Tchana, D. Donsez, N. de Palma, V. Zurczak, and P.-Y. Gibello, “Roboconf: A Hybrid Cloud Orchestrator to Deploy Complex Applications”, in 2015 IEEE 8th International Conference on Cloud Computing, IEEE, Jun. 2015, pp. 365–372, isbn: 978-1-4673-7287-9. doi: 10.1109/CLOUD.2015.56.

98 [25] X. Wang, Z. Liu, Y. Qi, and J. Li, “LiveCloud: A lucid orchestrator for cloud datacenters”, in 4th IEEE International Conference on Cloud Computing Technology and Science Proceedings, IEEE, Dec. 2012, pp. 341–348, isbn: 978-1-4673-4510-1. doi: 10.1109/CloudCom.2012.6427544. [26] D.-H. Le, H.-L. Truong, G. Copil, S. Nastic, and S. Dustdar, “SALSA: A Framework for Dynamic Conﬁguration of Cloud Services”, in 2014 IEEE 6th International Conference on Cloud Computing Technology and Science, IEEE, Dec. 2014, pp. 146–153, isbn: 978-1-4799- 4093-6. doi: 10.1109/CloudCom.2014.99. [27] M. Caballer, D. Segrelles, G. Moltó, and I. Blanquer, “A platform to deploy customized scientiﬁc virtual infrastructures on the cloud”, Concurrency and Computation: Practice and Experience, vol. 27, no. 16, pp. 4318–4329, Nov. 2015, issn: 15320626. doi: 10.1002/cpe.3518. [28] N. Hochgeschwender, G. Biggs, and H. Voos, “A Reference Architecture for Deploying Component-Based Robot Software and Comparison with Existing Tools”, in 2018 Second IEEE International Conference on Robotic Computing (IRC), IEEE, Jan. 2018, pp. 121–128, isbn: 978-1-5386-4652-6. doi: 10.1109/IRC.2018.00026. [29] M. Nabi, M. Toeroe, and F. Khendek, “Availability in the cloud: State of the art”, Journal of Network and Computer Applications, vol. 60, pp. 54–67, Jan. 2016, issn: 1084-8045. doi: 10.1016/J.JNCA.2015.11.014. [30] M. Toeroe and F. Tam, Service availability : principles and practice. Wiley, 2012, isbn: 1119941679.

[31] E. Bauer and R. Adams, Reliability and Availability of Cloud Computing. John Wiley & Sons, 2012, isbn: 1118394003. [32] N. Roy, A. Dubey, and A. Gokhale, “Eﬃcient Autoscaling in the Cloud Using Predictive Models for Workload Forecasting”, in 2011 IEEE 4th International Conference on Cloud Computing, IEEE, Jul. 2011, pp. 500–507, isbn: 978-1-4577-0836-7. doi: 10.1109/CLOUD.2011.42. [33] F. Paraiso, P. Merle, and L. Seinturier, “Managing elasticity across multiple cloud providers”, in Proceedings of the 2013 international workshop on Multi-cloud applications and federated clouds - MultiCloud ’13, New York, New York, USA: ACM Press, 2013, p. 53, isbn: 9781450320504. doi: 10.1145/2462326.2462338. [34] H. J. La, S. W. Choi, and S. D. Kim, “Technical Challenges and Solution Space for Developing SaaS and Mash-Up Cloud Services”, in 2009 IEEE International Conference on e-Business Engineering, IEEE, 2009, pp. 359–364, isbn: 978-0-7695-3842-6. doi: 10.1109/ICEBE.2009.56. [35] J. C. Patni and M. S. Aswal, “Distributed load balancing model for grid computing environment”, in Proceedings on 2015 1st International Conference on Next Generation Computing Technologies, NGCT 2015, Institute of Electrical and Electronics Engineers Inc., Jan. 2016, pp. 123–126, isbn: 9781467368094. doi: 10.1109/NGCT.2015.7375096. [36] Chen Chuan, Zhang Huaxiang, Yu Zhilou, Fan Ying, and L. Liu, “A new live virtual machine migration strategy”, in 2012 International Symposium on Information Technologies in Medicine and Education, IEEE, Aug. 2012, pp. 173–176, isbn: 978-1-4673-2108-2. doi: 10.1109/ITiME. 2012.6291274. [37] B. Yu, Y. Han, X. Wen, and Z. Xu, “SMPA: An Energy-Aware Service Migration Strategy in Cloud Networks”, in 2016 IEEE 9th International Conference on Cloud Computing (CLOUD), IEEE, Jun. 2016, pp. 984–989, isbn: 978-1-5090-2619-7. doi: 10.1109/CLOUD.2016.0150. [38] W. Li and A. Kanso, “Comparing Containers versus Virtual Machines for Achieving High Availability”, in 2015 IEEE International Conference on Cloud Engineering, IEEE, Mar. 2015, pp. 353–358, isbn: 978-1-4799-8218-9. doi: 10.1109/IC2E.2015.79. [39] N. Marshall, M. Brown, G. B. Fritz, R. Johnson, and P. Gelsinger, Mastering VMware vSphere 6.7. 2018, isbn: 9781119512974.

99 [40] D. Richter, M. Konrad, K. Utecht, and A. Polze, “Highly-Available Applications on Unreliable Infrastructure: Microservice Architectures in Practice”, in 2017 IEEE International Conference on Software Quality, Reliability and Security Companion (QRS-C), IEEE, Jul. 2017, pp. 130–137, isbn: 978-1-5386-2072-4. doi: 10.1109/QRS-C.2017.28. [41] P. Heidari, M. Hormati, M. Toeroe, Y. Al Ahmad, and F. Khendek, “Integrating Open SAF High Availability Solution with Open Stack”, in 2015 IEEE World Congress on Services, IEEE, Jun. 2015, pp. 229–236, isbn: 978-1-4673-7275-6. doi: 10.1109/SERVICES.2015.41.

100