UPTEC IT 19013 Examensarbete 30 hp Augusti 2019

A Service for Provisioning Compute Infrastructure in the Cloud

Tony Wang

Institutionen för informationsteknologi Department of Information Technology

Abstract A Service for Provisioning Compute Infrastructure in the Cloud Tony Wang

Teknisk- naturvetenskaplig fakultet UTH-enheten The amount of data has grown tremendously over the last decade. Cloud computing is a solution to handle large-scale computations and immense data sets. However, Besöksadress: cloud computing comes with a multitude of challenges that scientist who are using the Ångströmlaboratoriet Lägerhyddsvägen 1 data have to tackle. Provisioning and orchestration cloud infrastructure is a challenge Hus 4, Plan 0 in itself with a wide variety of applications and cloud providers that are available. This thesis explores the idea of simplifying the provisioning of computing cloud applications Postadress: in the cloud. The result of this work is a service which can seamlessly provision and Box 536 751 21 Uppsala execute cloud computations using different applications and cloud providers.

Telefon: 018 – 471 30 03

Telefax: 018 – 471 30 00

Hemsida: http://www.teknat.uu.se/student

Handledare: Salman Toor Ämnesgranskare: Sverker Holmgren Examinator: Lars-Åke Nordén UPTEC IT 19013 Tryckt av: Reprocentralen ITC

Contents

1 Introduction 1

2 Background 1 2.1 Cloud Computing Concepts and Obstacles ...... 2 2.2 Scientific Computing ...... 4 2.3 HASTE Project ...... 4 2.4 Motivation ...... 5 2.5 Purpose ...... 6

3 Related Work 6

4 System Implementation 8 4.1 System Overview ...... 9 4.2 Terraform ...... 12 4.3 REST Service ...... 14 4.4 Message Queue ...... 15 4.5 Data Aware Functionality ...... 16 4.6 Negotiator Module ...... 17 4.6.1 Resource Availability ...... 17 4.6.2 Terraform Configuration Generation ...... 18 4.6.3 Executing Terraform Scripts ...... 18 4.7 Tracing ...... 19 4.8 Infrastructure Implementations ...... 20 4.8.1 Spark Standalone Cluster ...... 22 4.8.2 HarmonicIO cluster ...... 24

iii 4.8.3 Loading Microscopy Images ...... 26 4.8.4 Single Container Application ...... 27 4.9 Simple Web User Interface ...... 28

5 Results 28 5.1 Spark Standalone Cluster ...... 30 5.2 HarmonicIO Cluster ...... 30 5.3 Image Loader ...... 31 5.4 Running a Trivial Container ...... 32

6 Discussion & Evaluation 33 6.1 Comparison Against Other Methods ...... 33 6.1.1 SparkNow ...... 34 6.1.2 KubeSpray ...... 34 6.1.3 Manual Provisioning ...... 35 6.2 Future Development Complexity ...... 35 6.3 Tracing ...... 36 6.4 Data Aware Function ...... 37 6.5 Security Issues ...... 37 6.6 Limitations of This Service ...... 38

7 Future Work 39

8 Conclusion 39

iv 2 Background

1 Introduction

There has been a tremendous growth in data over the past decade. This trend can be observed in almost every field. The Large Hadron Collider experiment at CERN [2] and Square Kilometre Array project [7] are examples of scientific experiments dealing with data beyond the petascale. This requires efficient, scalable and resilient platforms for the management of large datasets. Furthermore, to continue with the analysis, it is required to make these large datasets available to the computational resources. Recently, together with the cloud infrastructures, a new concept has emerged to offer Infrastructure-as-a- Code (IaC). IaC enables run-time orchestration, contextualization and high-availability of resources using programmable interfaces [4]. The concept allows mobility and high- availability of customized computational environments. AWS Cloud Foundry, Open- Stack HOT and Google AppEngine are the platforms aligned with the concept of IaC. However, it is still overwhelming and time-consuming to capitalize on this concept. In order to satisfy researchers and to seamlessly access the customized computational en- vironment for the analysis, it is required to create a level of abstraction that hides the platform-specific details and intelligently place computational environment close to the datasets required for the analysis.

This thesis proposes a software service that aims to support the researchers in the Hi- erarchical Analysis of Spatial and Temporal Data (HASTE [3]) project to seamlessly compute applications on different cloud services. The main capabilities of the software are its cloud agnostic ability, tracing of the build process of the compute infrastructure and its ability to be data aware, meaning that it can locate the data resource that is used in the proposed computation.

2 Background

Cloud computing is appearing as a new trend in the ICT sector due to the wide ar- ray of services the cloud can provide. Many companies such as Google and Amazon are offering different kind of cloud services, Google App Engine1 and (AWS)2 respectively. Each service manages their own infrastructure in their own fashion. The cloud providers control large pools of computers and profit from the cloud by renting out user requested resources. The users are billed on a timely, pay per month or on a usage basis where users pay depending on the workload of the rented resources. Other than for commercial use, cloud computing is expanding in scientific

1https://cloud.google.com/appengine/ 2https://aws.amazon.com/

1 2 Background research using platforms such as OpenStack3 to provide computation. However cloud computing comes with many challenges that are tackled by businesses using the cloud for commercial use along with scientist that are looking for the cloud to run scientific computations.

2.1 Cloud Computing Concepts and Obstacles

The term cloud computing has existed since the 1960s however the concept gained pop- ularity in 2006. There has been no clear definition of the term cloud computing. The definition cloud computing is however described by the National Institute of Standards and Technology (NIST) [13] who defines cloud computing as model for enabling con- venient on demand network access to a shared set of configurable computing resources that can be rapidly provisioned and released with minimal effort.

Generally speaking, the cloud can be divided into four different architectural layers. Zhang et. al. [24] describes the layers in the following way. The lowest level can be described as the hardware layer, this is where the bare metal resides as routers, switches, power and cooling systems. Next is the infrastructure layer which creates a set of re- sources that are configured on the hardware through virtualization technologies. Above the infrastructure layer is the platform layer where operating systems and application frameworks lie. The final layer is the application layer where software applications are deployed.

The business model of the cloud can be categorized in different services that are derived from the architectural layers. NIST defines the services as following. Infrastructure as a Service (IaaS) provides processing, storage and networks. The user should have the ability to deploy and run software on the infrastructure such as operating systems and applications. Examples of IaaS providers are Google and Amazon. Platform as a Ser- vice (PaaS) allows the user to use the cloud infrastructure through provided tools. The user does not control the underlying networks, operating or storage but only the self de- ployed applications. (SaaS), the highest level of user abstraction where the user is only capable of accessing the cloud through the provider’s interface commonly in the form of a thin client interface or a web browser.

As mentioned in the introduction, the newly coined concept infrastructure as a Code (IaC) is on the rise. The principle of IaC is to treat the infrastructure as code. Thereafter use the code to provision and configure the infrastructure, more importantly provision- ing virtual machines (VMs) in IaaS. The code represents and gives the desired state

3https://www.openstack.org/

2 2 Background of the infrastructure without having to walk through manual steps and previous config- urations [16]. This concepts allows the ability to apply similar software engineering techniques related to programming and software development when building one’s in- frastructure. This means that a blueprint or state of the infrastructure can be version controlled, shared and re-used. The end purpose of IaC is to improve the quality of one’s infrastructure [21].

Another important concept is the container concept which is growing in the cloud com- puting field and containers are most often used at the application level to replace vir- tual machines. There are many advantages of using containers. Containers are more lightweight than VMs, start time and resources usage is decreased [14]. Docker4 is one of the most well-known and used tool for containerizing applications. Docker provides services that builds and assembles applications. Each Docker container is based on an system image, a static snapshot of a system configuration. A Docker container is run by the client, when a Docker container is ready to be run, it looks for the Docker image on the machine or on a remote registry to download the Docker image. Once the image is ready, Docker creates a container and allocates a file system with a read and write layer and creates a network to interact with the host machine [19]. The main principles of using Docker containers are to avoid conflicting dependencies, an example could be if two websites need to run two different versions of a framework, then each version can be installed in a separate container. Also, all dependencies are bound to a container which means that the need to re-install dependencies disappear if the application is to be re-deployed. Furthermore, Docker containers are not very platform dependent. The only requirement is that the runs Docker [14].

As a consequence of multiple providers many individuals who are using cloud services are facing the issues of adapting to each and every different cloud provider. One of the main obstacles is the concept of vendor lock-in [11], meaning that the cost of changing vendor is too high to justify the change which leads to the problem of being locked into one vendor because of the cost of changing vendor. The lack of standards causes an in- creased difficulty to manage interfaces to different cloud vendors. Multiple recent works have tackled the problems of vendor lock-in by developing APIs that interfaces to var- ious types of cloud. Developing standards could be a good solution however the larger cloud vendor who are leading the cloud business does not seem to agree on proposed standards. 4https://www.docker.com/

3 2 Background

2.2 Scientific Computing

Scientific computing is a research field that uses computer science to solve problems. The types of research can be related to large scale computing and simulation which often requires large amount of computer resources. Recently, scientific computing pro- gressively requires more and more computing to cope with the immense amount of data that is generated. The amount of data stored by massive-scale simulations sensor de- ployment, high throughput lab equipment etc. has increased in the recent years. Already in 2012, it was predicted that the amount of data generated was to pass 7 trillion giga- bytes [17]. And the amount of data used for computing exceeds the power of an individ- ual computer, to counteract this problem, distributed computing systems are therefore used in some cases. Cloud computing proposes an alternative solution for running sci- entific computations which can be specifically beneficial for scientific computing. Re- searchers can take advantage of the potential lesser cost of running cloud computations by reducing the administration costs and taking advantage of flexible cloud scalability. Also, cloud computing allows researchers located in different areas an opportunity to ease the collaboration process. Compare this to running computations on personal com- puters or campus exclusive resources where there may be limited resources, security issues and difficulties in sharing data.

In the large scaled scientific computing field, one of the popular frameworks is Apache Spark5 (Spark) [22]. Spark is the largest open source software for unified programming and big data projects. Spark has a programming model called Resilient Distributed Datasets (RDD) that can include a wide range of processing techniques including SQL, machine learning and graph processing. The key point of RDDs is that they are func- tions that are scattered around a compute cluster which can then run functions in parallel. This of course requires that the user has access to cloud infrastructure. Users can then use RDDs by applying specific functions, for example map, filter and groupBy. The main speedup of Spark is its data sharing capabilities. Instead of storing its data in disk it stores the data in memory to allow faster shareability. Spark was developed as a tool for users, including scientist.

2.3 HASTE Project

Part of the work of this thesis is to assist the HASTE project. Their aims are to intelli- gently process and manage microscopy data in images. The HASTE project is funded by the Swedish Foundation for Strategic Research [1]. Their main objectives are to dis- cover interestingness in image data using data mining and machine learning techniques

5https://spark.apache.org/

4 2 Background and to develop intelligent and efficient cloud systems. To find the interestingness of a microscopic image, different machine learning techniques are used that are processed in the cloud. The HASTE project pipeline and the scientist at the project have different academic backgrounds and the whole project consists of different smaller projects that are related to each other, multiple projects uses the cloud to store and execute data and computations.

SNIC Science Cloud6 is a community cloud run by Swedish National Infrastructure for Computing with the purpose of providing large scale computing and storage for research in Sweden. SNIC mainly provides IaaS and higher level PaaS. The HASTE project runs their computation on the SNIC Science cloud. The work in this thesis exclusively uses the SNIC Science Cloud to provision infrastructure for the users.

2.4 Motivation

In order to demonstrate the demand of this service, a motivational example is presented. As of now, researchers of the HASTE project and potentially other individuals as well who have to run scientific cloud computations have their own procedures to provision and cluster their own infrastructure by using their own written scripts, different com- mand line and graphical interfaces. This could be rather time consuming and a re- searcher would perhaps rather spend time doing actual research instead of infrastructure provisioning. The problem of vendor lock-in comes in here when scripts and interfaces may vary a lot depending on cloud vendor, a scientist may have successfully executed their experiments on one cloud infrastructure, when the need to change cloud provider arises, the process has to be repeated. Another important factor to consider is the physi- cal placement of the data, a scientist may have to manually look up the metadata inside a potentially hidden or difficult to access file. Another difficulty in cloud orchestration is the error prone characteristics of the long provisioning process. Various errors can arise from the orchestration process that are difficult to find and debug. An example scenario of a HASTE researcher who would like to run a machine learning computation can be acted out in the following fashion. The researcher has to locate the credentials and other metadata regarding the cloud provider before starting a machine. The next step is to provision the data requested to the machine, and to execute the code, the researcher has to install all the required packages. This is a lengthy process that is preferable to avoid having to repeat. Another example is the orchestration of multiple instances. If a researcher wants to run a compute cluster, the researcher has to start multiple ma- chines and then with much effort connect them together into a cluster. Again, doing this process can arguably be even more time consuming than previous example.

6https://cloud.snic.se/

5 3 Related Work

2.5 Purpose

The purpose of this thesis is to support the researchers at the HASTE project and to build a general software application service for automatic provisioning of cloud infras- tructure with intelligent data aware aspects. The data aware aspects comes from using pre provisioned metadata to the service to allow the users to skip setting cloud meta- data variables. The main purpose is to provide options for the HASTE researchers to seamlessly run HASTE relevant software on the SNIC Science Cloud through the ser- vice. The purpose is to simplify the provisioning process compared to running HASTE cloud projects through manual provisioning. A general use case of provisioning a Spark compute cluster and a case where a container application is run are also provided to exemplify a process which a non HASTE scientist can benefit from. The researchers should have the ability to provision compute infrastructure through easy accessible com- mand line and graphical interfaces. The service includes the potential to provision not only to the SNIC Science Cloud but also for other OpenStack cloud projects and other non-OpenStack cloud providers. Additionally, a tracing mechanism is implemented to provide transparency and feedback on the clustering process for the underlying orches- tration process, to give the user more transparency about any potential errors during the process. Furthermore, a conceptual web interface is created for the purpose of granting the user a simple graphical interface to use for creating their infrastructure.

3 Related Work

There exists several other cloud computing frameworks for the purpose of abstracting the cloud orchestration layer and to counteract the problems of vendor lock-in where it is too burdensome and difficult to deploy applications on different cloud providers while keeping important aspects such as security and quality of service consistent. A few have been using Model-driven design as their main method for designing the frame- work. Model-driven design is not the focus of this work however it is an interesting take on development that one can draw inspiration from. Other frameworks and software ap- plications have also been developed and published as open source to help the developer community to deploy infrastructure.

Specifically, Ardagna et al. [8] presents the idea of cloud deployment with MODA- CLOUDS which used Model-driven design to develop a framework and IDE for de- veloping and deploying applications on multiple cloud providers. MODACLOUDS is supposed to offer a run-time environment for observing the system during execution to allow developers proactively determine the performance of the system. Their ambitions

6 3 Related Work are to run MODACLOUDS as a platform for deployment, development, monitoring and adaptations of applications in cloud.

Chen et al. presents MORE [10], this framework uses Model-driven design to ease the challenges of deployment and configuration of a system. MORE provides the user with a tool to model the topology of a system structure without demanding much do- main knowledge. The model is further transformed into executable code to abstract the orchestration of the system. The user eventually gains access to the cloud infrastructure.

Other non model-driven developed tool also exists. For example Sandobalin, Insfran & Abrahao presents an infrastructure modelling tool for cloud provisioning called AR- GON [15]. The tool is supposed to solve the management of infrastructure as a code (IaC). Their goal is to take the DevOps concept and apply it to IaC. Through a domain specific language they are able to develop ARGON to reduce the workload for opera- tions personnel. With ARGON, developers have the opportunity to version control and manage their infrastructure without the need to consider the interoperability of different cloud providers.

To further investigate into cloud interoperability and approaches to avoid vendor lock-in Repschlaeger, Wind, Zarnekow & Turowski [23] implemented a classification frame- work to compare between different cloud providers. Their purpose was to help e- Governments with the problem of selecting an appropriate cloud vendor with regards to prices, security and other important features. Their method of development was to investigate through literature surveys and expert interviews.

Furthermore Capuccini, Larsson, Toor & Spjuth developed KubeNow [9] a framework for rapid deployment of cloud infrastructure using the concept of IaC through the Kuber- netes framework. The goal of KubeNow is to deliver cloud infrastructure for on-demand scientific applications. Precisely, KubeNow offers deployment on Amazon Web Ser- vices, OpenStack and Google Compute Engine.

Additionally, there are other more well known frameworks that bring the benefits of IaC. Some examples include Ansible7, Puppet8, AWS OpsWork9 (Uses Chef + Puppet) and Terraform10.

Unruh, Bardas, Zhuang, Ou & DeLoach presents ANCOR [20], a prototype of a system built from their specification. The specification is designed to separate user require-

7https://www.ansible.com/ 8https://puppet.com/ 9https://aws.amazon.com/opsworks/ 10https://terraform.io

7 4 System Implementation ments from the under-laying infrastructure, and to be cloud agnostic. ANCOR uses Puppet as a configure management tool however ANCOR supports other configure man- agement tools such as Chef, SaltStack, bcfg2 and CFEngine. ANCOR mainly targets OpenStack however there is a possibility of using AWS as well. ANCOR developed with a domain specific language based on YAML. Their conclusions show that ANCOR can improve manageability and maintainability and enable dynamic cloud configuration under deployment without performance loss.

SparkNow11 is a type of provisioning software that focuses on rapid deployment and teardown of clusters on OpenStack. It simplifies the provisioning process by provid- ing pre written provisioning scripts and through user arguments it can provision the requested infrastructure without requiring the user to learn the orchestration process. KubeSpray12 is similar to SparkNow in the sense that they simplify infrastructure pro- visioning however their focus is on rapid deployment of Kubernetes13 clusters instead of Spark clusters on OpenStack and AWS clouds.

The related work mentioned in this section pushes their focus on developing standalone tools and different domain specific languages for creating infrastructure. They put a lot of effot on deploying cloud applications with ease through their tools reducing the complexity of creating cloud infrastructure. The IaC concept is again explored and used efficiently to provision infrastructure. Cloud vendor lock-in is discussed as well, con- cerning the ability to deploy application on different providers which is important for the users. This thesis proposes the ability for a user to request cloud infrastructure using less domain knowledge, the options to choose which provider to deploy infrastructure and monitoring of the provisioning process. Another proposal is to explore further in- frastructure abstraction using even less required knowledge, adding another abstraction layer over existing software. Using data-aware aspects which takes advantage of meta- data to pre-provision the orchestration service with metadata. Furthermore this work implements a tracer for the orchestration process to track the orchestration flow.

4 System Implementation

To develop the provisioning software numerous technology was used. The service is split into different modular parts who communicate with each other. The user com- municates with a server through interfaces which in turn communicates with another

11https://github.com/mcapuccini/SparkNow 12https://github.com/kubernetes-incubator/kubespray 13https://kubernetes.io/

8 4 System Implementation module which uses external frameworks to provision infrastructure. Overall the system can be seen as a client-server application. The whole system is also traced using external libraries that are integrated within the whole system.

4.1 System Overview

To start of, there is a conceptual graphical user interface that is built as a web interface using common scripting languages HTML, CSS and JavaScript. Moreover the React14 library is used as the main library for writing and structuring the interface. Using React, the business logic and markup is split into components that allows for more flexibility and re-usability.

The service who handles the requests and provisions the infrastructure is called the ne- gotiator. The REST service lies between the user and the negotiator and that is written in python 2.7 with the Flask15 library and its functionality is split into a Representational State Transfer (REST) service that can be called for communication. The middleman or the broker between the REST server and the negotiator is provided by RabbitMQ16 which is a message broker that handles the requests from the client and sends them to the negotiator. The negotiator is designed so that new calls to different cloud providers can be integrated into the module by constructing new classes for each provider. Apply- ing a REST service grants the system an interface between the user and the negotiator which gives the system a possibility to seamlessly alter the communication with the ne- gotiator. By defining the REST endpoints the module can consistently accept expected arguments to create infrastructure.

SNIC Science Cloud implements OpenStack which is a platform for orchestrating and provisioning infrastructure. This project uses Terraform in conjunction with OpenStack to provision the compute infrastructure on SNIC Science Cloud. Where Terraform was used as the framework to provide IaC. Tracing is performed by OpenTracing17 which is an open source tracing library that is available in multiple languages.

The general step by step process of the system from a user perspective to request for infrastructure can be described as the following steps:

(a) The user creates a POST request to the REST service from any interface which

14https://reactjs.org/ 15http://flask.pocoo.org/ 16https://www.rabbitmq.com/ 17https://opentracing.io/

9 4 System Implementation

Figure 1: A high level overview of the system.

can be from a web interface or a command line interface. (b) The request arrives at the REST server and the request is forwarded to the message broker and the server returns the web URL to the tracing interface. (c) The message broker receives the request and puts the request, now a message in the queue for consumption. (d) The consumer forwards the request to the negotiator module which handles the request and provisions the infrastructure. (e) After orchestration, the user is sent feedback regarding the infrastructure. (f) The process can be traced during and after each request

A high level overview of the system is shown in Figure 1 to gain a better abstract un- derstanding on how the system is communicating. What can be seen is the user who interacts with the system through a REST service, implemented as a Flask server. The request is forwarded to the negotiator who then depending on the request provisions infrastructure in the user requested cloud provider.

The system from a user or a scientist perspective can be seen in Figure 2. The user may interact with the system by requesting or deleting infrastructure. The user may also access the trace of the requested process of the system.

10 4 System Implementation

Figure 2: User perspective of the system

11 4 System Implementation

4.2 Terraform

Terraform is a tool that applies the IaC for provisioning infrastructure. Terraform can be used to build, change and version control cloud infrastructure. The state of the cloud infrastructure is described in Terraform configuration files written by the users, after successfully executing the Terraform configuration scripts using the Terraform binary, the infrastructure requested in the configuration file is provisioned. The main motivation for using Terraform is its simplicity to change and add new infrastructure for different providers and also the power of IaC that is used to dynamically provision infrastruc- ture. Furthermore using Terraform may avoid the problem with vendor lock-in because of the multitude of providers Terraform supports. Software such as Heat18 works sim- ilarly however they only provide for one platform (OpenStack). While Terraform can perform the same tasks and also enable multiple providers, for example it can orches- trate an AWS and an OpenStack cluster at the same time. Terraform is cloud agnostic in the sense that the software Terraform can be used by various providers however one may think that a single configuration can be used by different providers but that is not the case. To create an infrastructure configuration that creates an equivalent copy of an infrastructure on two different providers then one has to write two different con- figurations, although some of the functions can be shared such as variables. Still it is simple to change provider, the syntax, functions and thought process to write code stays the same. The configuration files are written in HCL (HashiCorp Configuration Lan- guage)19, which is a configuration language built by HashiCorp20 who are the founders of Terraform. The same language is then used for all the providers that Terraform sup- ports. HCL can also be used in conjunction with the JSON format to allow for more flex- ibility. An example configuration can be seen in Listing 1, which shows a configuration for an AWS cloud when executed Terraform creates one instance of type t2.micro in the us-east-1 region using the user’s access and secret key. The providerblock determines the provider and the resource block describes which resources that are pro- visioned. Additionally, Terraform has the ability to provide more than just compute instances but for storage, networking, DNS entries and SaaS features and much more.

Listing 1: provider "aws" { access_key = "ACCESS_KEY_HERE" secret_key = "SECRET_KEY_HERE"

18https://docs.openstack.org/heat/latest/ 19https://github.com/hashicorp/hcl 20https://www.hashicorp.com/

12 4 System Implementation

region = "us-east-1" } resource "aws_instance" "example" { ami = "ami-2757f631" instance_type = "t2.micro"

...#Additional blocks }

This work uses Terraform’s OpenStack configuration to provide the OpenStack based infrastructure that is used by SNIC Science Cloud. The basic configuration for Ter- raform’s OpenStack provider consists of the provider block, that is similar to the AWS example above which determines the OpenStack provider and resources blocks that describes the provisioned resources. Below is an example of an OpenStack configu- ration where a single instance is created under a specific user. Additional connection variables, auth url, tenant name, tenant id, user domain name are given to connect to the specific cloud. The single instance is created using given parameters to specify, image name, the flavor, key pair and security groups. In this example, variables are used as input parameters instead of static strings. Each parameter is referencing a variable that stores the argument. Using this method, variables can be set from exte- rior methods, for example from the command line interface, environment variables or external executable files.

Listing 2: provider "openstack" { user_name = "${var.user_name}" password = "${var.password}"

tenant_id = "${var.tenant_id}" tenant_name = "${var.project_name}" auth_url = "${var.auth_url}" user_domain_name = "${var.user_domain_name}" }

resource "openstack_compute_instance_v2" "example" { name = "example" image_name = "${var.image_name}" flavor_name = "${var.flavor_name}" key_pair = "${var.key_pair_name}" security_groups = ["default"] count = 1

13 4 System Implementation

...#Additional instance variables } ...#Additional blocks

4.3 REST Service

A REST service is an architectural design pattern for machine to machine communi- cation [12]. By applying a REST architecture the separation of concerns principle is applied, that is the separation of the user interface and the system back-end. The result is that the portability and scalability of the system is improved. The REST service and the rest of the system can be developed independently. A REST service requires that the client makes requests to the service. A request contains a HTTP verb which defines the operation to perform, a header containing the data to pass to an endpoint. The four basic verbs are POST, PUT, DELETE and GET. The negotiator REST service has two callable endpoints POST, DELETE

Using the POST request, the endpoints accepts the user arguments for provisioning the infrastructure. The DELETE endpoint is then used to delete existing infrastructure. The endpoints themselves use the functions of the negotiator module when called upon. This allows a flexible REST implementation where changes, such as new endpoints can be made to the REST service without affecting the negotiator module. The REST service expects the data in the header to be in JSON format and then replies with data in JSON format. The JSON format is human readable and simple to use and supported by most languages for easier integration. The REST server must be asynchronous otherwise the user has to make a request then wait for the result, considering that provisioning a clus- ter may take several minutes. To solve this problem, the user is returned an id of the request, The id is then bound to the request and any future calls on the requested infras- tructure is used in conjunction with the id.

Calling the REST server using a valid JSON object will eventually trigger the nego- tiator module. However to have everything execute without errors, the JSON request must include valid input arguments. There is only one main requirement for the REST service and that is a JSON object with the field provider. Listing 3 describes the min- imum requirement for a valid REST call. The provider field is used to describe which provider the negotiator module is supposed to call to continue the process. Additional parameters are individually unique depending on the implementation of the infrastruc- ture configuration.

14 4 System Implementation

Listing 3: { provider: ’some_provider(Openstack, aws, google etc.)’ }

4.4 Message Queue

The system uses RabbitMQ which implements the advanced message queue protocol (AMQP)21. The message queue is placed between the REST service and the negotiator. The idea behind using the message queue protocol is to avoid having long running and resource intensive tasks. The tasks are instead scheduled to be executed when ready. To summarize the process, tasks are turned into messages and put in a queue until they are ready to be executed. RabbitMQ itself is the broker which receives and delivers mes- sages. A queue lives inside RabbitMQ and producers, programs that sends messages are producing messages which the broker stores in its queue. A consumer program who consumes the messages in the queue is run to handle the producer’s messages. Figure 3 depicts an image representing the workflow of the message queue. The producer which is in this work the REST service puts new messages (requests from the users) into the queue. The consumer side of the system is then ready to execute the requests from the queue.

The benefits of a message broker is that it can accept messages and thus reduce the load of the other programs such as the REST service. Consider the fact that the pro- visioning process takes several minutes. A synchronously implemented service will then be on hold for the whole process and therefore lock other clients from connecting. The message queue avoids this problem, meaning that they can execute a request to the service and then unlock the requesting process for something else. Another important benefit from message queues is the modularity, it is developed to be separated from the rest of the system and it can be written in any language and started and run separated from the REST server and the negotiator [5].

The system’s message queue is the middleman between the REST service and the nego- tiator. The REST service sends the parsed POST or DELETE request as a message from the user to the broker which then stores the message in the queue and waits for it to be consumed. After consuming the message, the receiving part of the message queue will call the negotiator to start the requested provisioning process.

21https://www.amqp.org/

15 4 System Implementation

Figure 3: Producers adds tasks to queue which consumers consume [6]

4.5 Data Aware Functionality

The data aware aspect is one of the main characteristics of the negotiator. The purpose is to pre-store metadata regarding the cloud provider to avoid having the user config- ure metadata infrastructure arguments. During run-time, the metadata is fetched from a metadata store which has pre-stored values from the user or another user. The metadata store uses a key-value based data storage where the key of the data is the name and the value of the data contains the relevant metadata that the negotiator module needs to locate the data. Since different provider requires different metadata. The metadata is stored under a provider key which could be aws or openstack.

Listing 4 shows an example of metadata for an OpenStack provider. The variables are required to start an instance on an OpenStack cloud. Looking at these variables they are very tedious and most of the time kept the same and rarely changed. The external network id, tentant id are for example two variables that a user most probably does not want to have a responsibility to control. By pre-storing these variables the users of this cloud does not have to keep track of these variables which reduces the amount of input parameters on the user side. However when a metadata parameter is changed, a user has to manually change it, this also adds a positive effect that helps multiple users of the system where one change means that the other users do not have to change the same variable. Compare this to if the variables are stored locally on a user’s machine and when a change happens all the users of the system has to change that variable.

Listing 4: "openstack":{ "example_data":{ "external_network_id": "b8eigkt4-w0g84-bkeog-93833-shb029biskv",

16 4 System Implementation

"floating_ip_pool": "Network pool", "image_name": "Ubuntu", "auth_url": "https://cloud.se:5000/v3", "user_domain_name": "cloud", "region": "Region One", "tenant_id": "r2039rsovbobsaboeeubocacce", "project_name": "tenant"

4.6 Negotiator Module

The negotiator module accepts the user arguments from the message broker which re- ceived the request from the user via the REST service. Then according to the arguments provisions the requested infrastructure. To start, the module finds the provider ar- gument which was mentioned in section 4.3. By looking at the provider argument the module can find the implementation of the system corresponding to the provider. Using the provider value, the provisioning can begin.

4.6.1 Resource Availability

The first part of provisioning is pre-determining if the provisioning is possible with regard to the available resources. The meaning of available resources depends on the provider and implementation. Cost based providers which have virtually unlimited com- putation can implement the module to check if there is enough balance to provision the resources while in non cost based the general determining factor is the amount of com- puting that is available. By pre-determining the available resources the negotiator can detect if the system is going to be stopped later due to any insufficient resources error.

By using the provider value, the negotiator finds the file that corresponds to the provider. The file must contain the function check resources(resources) that deter- mines if the resources are available. This is similar to how interfaces are built in object oriented design. Each file must implement the check resources(resources) function. As an example, if the user is requesting resources and uses OpenStack as the provider value then the module will look for the file called OpenStack. However this step can be skipped if there is no implemention of the check resources(resources) function. or if the file does not exist.

The end result after determining the resources is a boolean value which determines if the negotiator decides to exit the process if the resources are not enough or to continue

17 4 System Implementation if there are existing resources available.

4.6.2 Terraform Configuration Generation

The second step of the provisioning process is to generate a Terraform configuration file that represents the infrastructure that the user requested. Similar to the first step where resources are checked, the module looks for the folder and file which both have the same name as the provider and calls on a function in the file. The requirement for the imple- mentation is that the file must be under a folder with the same name as the provider and the file must have a function called orchestrate resources(request) that accepts the request as a parameter. The function must return a valid Terraform JSON configuration. The Terraform configuration file is generated programmatically depend- ing on the user’s request. Different implementations of specific infrastructure uses the user input differently.

One of the core power of the module is to use metadata that is bound to certain data blobs. That is the data awareness function. The metadata is then collected from the name of the data that the user has requested. In the cases where the user should pass the name of the data. Using the potential metadata and user data, a python dictionary that corresponds to a valid Terraform JSON configuration is generated and returned to the module which will later convert into a Terraform JSON file. The Terraform configura- tion implementations are described in section 4.8.

4.6.3 Executing Terraform Scripts

The previous steps creates the Terraform configuration file, while the last step is to execute it. The command terraform apply is used to execute a Terraform config- uration to provision the infrastructure. However the command has to be executed in the same file location as the files are located. Since each configuration is id specific, the module moves the files for each configuration under a folder with the corresponding id. Any files under the provider folder are moved under the folder with the corresponding id. terraform apply command is then executed to start the provisioning. When the execution is finished and the infrastructure is created. The user may be notified with different means after execution.

18 4 System Implementation

4.7 Tracing

Considering the fact that there can be a bundle of errors in the provisioning process. Everything from name errors, network errors etc. debugging and finding errors is a time consuming process. Integrating a tracing system can assist in locating the errors in the process. This work implements a tracer to trace the provisioning process from top to bottom. Each unique request is traced starting from the user request until the process is complete in the cloud. Tracing through the REST server and the negotiator is the same, however the tracing is implemented differently for each type of orchestration. There are different methods to track the process. This project uses OpenTracing in combination with Jaeger 22, using Jaeger bindings to python23 to trace the whole process in a web interface.

OpenTracing is an open source tracing API used to trace distributed systems and Jaeger is a python library implementing the API. A trace describes a flow of a process as a whole, a trace propagates through a system and creates a tree-like graph of spans, that represent a segment of processing or work. Using a tracing framework one can then trace the error prone or time consuming processes. The trace is implemented in a way to create spans of each part in the system.

Listing 5 shows the initialization of a tracer. A trace object is created from the jeager client python library which is Jaeger’s python implementation. The important parameter to look at is the reporting host which is the address to the machine that is hosting the Jaeger server. The trace object will forward the traces to the server.

Listing 5: from jaeger_client import Config def init_tracer(service): config = Config( config={ ’sampler’:{ ’type’: ’const’, ’param’: 1, }, ’local_agent’:{ ’reporting_host’: ’x.x.x.x’, ’reporting_port’: 5775 },

22https://github.com/jaegertracing/jaeger 23https://github.com/jaegertracing/jaeger-client-python

19 4 System Implementation

’logging’: True, ’reporter_batch_size’: 1, }, service_name=service, ) return config.new_tracer() tracer = init_tracer(’Trace’)

To trace a distributed system, each part of the system must be bound to the same trace. A span object which is a key-value pair is sent through the process starting from where the trace is created. The trace begins when the REST server accepts the user request. The span object is forwarded through the system. When the REST server reaches the RabbitMQ sender, then the span object is forwarded through the message that is sent to receiver through the message’s header properties. The span continues after receiving the message until the Terraform provisions the infrastructure. The span object is sent to the requested infrastructure by writing the value of the span object to a Terraform variable. The negotiator can then use this Terraform variable to send the span over to the infrastructure and continue the trace there. By continuing to run python scripts inside the machines in the infrastructure the trace is continued.

4.8 Infrastructure Implementations

Four different configurations are implemented. However as mentioned in previous sec- tions there is a possibility of creating more implementations as long as the implemen- tation rules are followed and the requested configuration to implement is supported by Terraform. The configurations implemented in this work are the following which will be described in the next sections:

A general Spark Standalone cluster • Haste specific HarmonicIO cluster • Haste specific application to load microscopy images • A configuration to run a container application •

The method for deploying infrastructure that require software installed such as Spark

20 4 System Implementation

Figure 4: Example figure of the orchestration of a compute cluster. and harmonicIO is the use of docker containers. By tieing applications inside docker containers the difficulties of deploying the applications on different operating systems is solved. The only limitations for Docker containers are that the machine must be able to run Docker, however most of the common distributions can run Docker. This allows the ease of deployment on different operating systems and versions of operat- ing systems. The same cluster can for example be run on Ubuntu and CentOS. Using Docker containers in combination with Docker Compose24, the deployment is eased down to configuring the compose file to run the container. After Terraform is complete, the negotiator will send an email to the user to notify the completion. The email is given in the request to the service.

Figure 4 shows a typical example on how to create a compute cluster using Docker containers. The system access the machines in the cloud and starts the orchestration by communicating with the machines who are then using Docker images from a remote Docker repository to download the containers which contain the programs to deploy distributed applications. The Spark standalone cluster, the HarmonicIO cluster and the data image loading application (only with one machine) uses the same method.

24https://docs.docker.com/compose/

21 4 System Implementation

Terraform techniques used to execute scripts inside virtual machines are using provisioner blocks. They include methods to upload files and to execute commands through ssh. Additionally the data block is used for rendering run-time variables onto script files.

4.8.1 Spark Standalone Cluster

The main idea behind a Spark Cluster is to run functions on a distributed compute clus- ter. Meaning a cluster that spans several machines to increase the processing power. A Spark Standalone cluster25 is a Spark cluster that does not contain additional tools such as YARN26 or Mesos27. The required parameters to provide for this configuration are the worker count which is the number of spark workers the cluster is running, the data that is used to determine in which region the cluster is to be placed, the public key of the user to later access the machines. Lastly, the name of the flavor to be used for the virtual machines.

To provide a Spark cluster on the infrastructure, Docker containers were used. By tieing the Spark application inside a Docker container, the deployment difficulties of deploy- ing the Spark cluster is reduced. To fetch and start containers, two separate Docker Compose files were used, one to start the Spark master and the other to start the Spark workers.

The Terraform configuration file is pre-written for the Spark cluster. Meaning the Ter- raform configuration file representing the Spark infrastructure is already configured ex- cept for some variables to adjust the cluster to the user’s request are left to interpolate. To configure the cluster according to the user’s request, the variables in the pre-written configuration file are set through variable interpolation.

The first step of creating the Spark cluster starts with spawning the master and giv- ing it a floating IP for outer access and spawning the number of worker machines that was requested, this number is interpolated through a variable that is set through the user request. After the machines are spawned, the scripts that are used in the machines are uploaded to the master through Terraform’s file provisioner. A snippet of how the scripts are uploaded can be seen in Listing 6. All files in the scripts folder are uploaded to the host machine in the connection block using a private key to ssh to the machine. The master machine is given multiple scripts. One bash script to run its commands, two

25https://spark.apache.org/docs/latest/spark-standalone.html 26http://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/YARN.html 27http://mesos.apache.org/

22 4 System Implementation previously mentioned Docker Compose files for master and slave and one script that is to start the worker machines.

Listing 6: provisioner "file" {

connection { host = "${openstack_networking_floatingip_v2.floating_ip.address}" type = "ssh" user = "ubuntu" private_key = "${file("${var.ssh_key_file}")}" }

source = "./scripts/" destination = "scripts" }

After the master machine is instantiated, it executes one of the bash scripts to download Docker and Docker Compose to start a Docker image which starts a Spark master using the Spark master Docker compose file. Then it transfers the worker Docker Compose and the worker script to the worker machines using the scp command. Finally it exe- cutes the worker script inside the worker machines through the ssh command to start the Spark worker container. The worker script downloads Docker and Docker Com- pose like the master however it uses the worker Docker Compose file to start the Spark worker which connects to the master to form a cluster.

The script used for the Spark master contains a comma separated string of the pri- vate IP addresses of the worker machines. The Terraform method template file is used to render the private IP addresses onto the master script. The method of ren- dering variables can be seen in Listings 7, 8. The worker instances’ IP addresses are joined together and rendered on the slaves variable in the master script. To connect to the master from the worker, the master IP address is required in the worker Docker Compose file. So in the same fashion, the master IP address is rendered into the worker Docker Compose file using the same Terraform method template file. When the worker container starts it is then connected to the master. Finally the cluster is complete when all the worker have finished executing. The end result is one Spark master running on a Docker container and one or multiple Spark worker on separate machines running containers that connects to the master. Figure 4 which was previously shown describes the final result and part of the orchestration process. The system communicates with

23 4 System Implementation the machines which in turn communicates with a remote Docker repository to form a cluster.

Listing 7: data "template_file" "master_template" { template = "${file("scripts/master_script.sh")}"

vars { slave_adresses = "${join(",", openstack_compute_instance_v2.slave.*.access_ip_v4)}" } }

Listing 8: #master_script.sh slaves=${slave_adresses}

4.8.2 HarmonicIO cluster

HarmonicIO [18] is a streaming framework developed through the HASTE project. To summarize HarmonicIO, it is a peer to-peer distributed processing framework. Its pur- pose is to let users stream any data to HarmonicIO and have the data directly processed in HarmonicIO worker nodes and then store the processed data in data repositories which also lets the users preview the data before the process is complete. A HarmonicIO cluster operates with the master-worker architecture as well similar to a Spark cluster. There is an individual master machine and one or multiple worker machines that han- dles the data processing. Manual orchestration of a harmonicIO cluster is similar to orchestrating a Spark cluster. The important steps are the following, starting with the master node:

1. Instantiate a master node machine.

2. Download the HarmonicIO remote repository

3. Run the bash script to install dependencies

4. Change the IP address in the configuration file for the master node

5. Finally run the master script to start the master node

24 4 System Implementation

Then for each worker

1. Instantiate a worker machine

2. Download the HarmonicIO remote repository

3. Install docker

4. Change the master address and the internal address in the configuration file

5. Finally run the worker script to start the worker node

The implementation to deploy a HarmonicIO cluster is similar to the spark cluster. The user parameters for the HarmonicIO cluster are the number of workers, the flavor of the instances, which region to place the cluster and lastly the public key of the user to later gain access to the master node.

The Terraform configuration file is again pre-written to fit a harmonicIO cluster with dynamic variables that are input from the user. The configuration follows a similar con- figuration to the Spark cluster. A provider block is used for a master and another is used for the workers. A count variable in the worker block determines how many workers are to be deployed. Listing 9, 10 shows the configuration for the nodes. The master node is a single machine while the number of workers are determined by the count variable. Other variables are shown as well, the flavor, image, and key pairs to the instances.

Listing 9: resource "openstack_compute_instance_v2" "master" { name = "HIO Master" image_name = "${var.image_name}" flavor_name = "${var.flavor_name}" key_pair = "${var.key_pair_name}" security_groups = ["default", "Tony"]

network { name = "${var.network_name}" } }

Listing 10: resource "openstack_compute_instance_v2" "worker" {

25 4 System Implementation

count = "${var.instance_count}" name = "${format("HIO-Worker-%02d", count.index + 1)}" image_name = "${var.image_name}" flavor_name = "${var.flavor_name}" key_pair = "${var.key_pair_name}" security_groups = ["default"] network { name = "${var.network_name}" } }

Once the machines are created. The next steps are to connect the machines to create a HarmonicIO cluster. A python script is provided, one for the master and another one for the worker machines. The master python script follows steps (2) in 4.8.2 and onwards to download HarmonicIO from a remote repository, install the dependencies, set the configuration file and to execute the script to start the master. Then it uses the scp command to transfer the python script and a worker bash script to the workers one at a time. The bash script for the worker is to install the python dependencies and the python script is used for running steps (2) and onward. Both python master and worker scripts initiates a Jaeger tracer object which continues the trace and each step is wrapped inside a trace span to inform the trace server and the user that the steps are running. To ensure that the trace is a continuation of the request trace from the negotiator, the Terraform variable that contains the span object value is interpolated to the python scripts.

4.8.3 Loading Microscopy Images

This configuration implements a deployment and execution of a HASTE specific pro- gram that loads a certain set of microscopy images from a larger set of images

The input parameters are slightly different in this case, this configuration not a cluster running the master-worker architecture. The required input parameters are the source object store container and the destination object store container, that is where to read and store the input and output. The data aware function is used here to locate which region the the process should be executed in. By giving the name of the object store container to download from, the files are downloaded to the machine. The public key and flavor is given as the previous implemented provisioners. The return address is also given to let the user know when execution is complete. Lastly, the user can choose whether the machine should be destroyed after it has finished running the simulations

26 4 System Implementation using a boolean value.

Considering the fact that this infrastructure is not a cluster but rather a single machine, the Terraform configuration becomes simpler. The pre-written Terraform configuration file is written with one provider block to create a single machine. A python script used for downloading dependencies, downloading object store files, tracing and running a container is provided along with a docker compose file used to download and run the container.

After the machine is created, the python script is provided then it installs its own de- pendencies. The script is then run in the following way. It starts a tracer with a span continuing from the negotiator. Every following step is then wrapped in a span to create a trace around each step. The script downloads Docker and Docker compose used for the container, it downloads the object store files from the given container. It starts the container image using Docker compose with the object store files mounted as container volumes.

The application inside the Docker container is written to read files from the mounted directory so it can execute its program and then store the result in a result file. The host machine can then use the result file and upload it to the object store. After this step, the host machine has finished executing and the negotiator will check whether the user requested that the VM should be terminated when finished executing. If it is requested, the VM is destroyed.

4.8.4 Single Container Application

Lastly, there is a configuration that deploys a given container application from Docker Hub28 on OpenStack. The purpose of this configuration is to create a general configura- tion for deploying containers. This configuration is similar to previous. The parameters are the URL to the Docker container on Docker Hub, the commands that the user want to execute, the region, the return address and the public key. This configuration creates a single machine in the requested region using the metadata store to fetch the region data. The machine runs a python script to trace and install the required dependencies and then runs commands to download Docker and then it executes all the requested commands.

28https://hub.docker.com/

27 5 Results

4.9 Simple Web User Interface

To increase the user experience, a graphical user interface was developed. The web interface is developed using the React library to create a simple single page application. The web interface is essentially a proof of concept. The web interface contains three steps to simplify the process for the user. Figure 5 shows three images where each image represents each step. The first image lets the user select a choice to delete or create new infrastructure, the second image is where the user chooses the configuration and lastly, the input parameters are given for the configuration before pressing create which sends a REST request to start the process.

1. Select a choice to create new infrastructure or delete existing infrastructure.

2. Select which configuration to request.

3. Fill in the parameters for the chosen configuration and create the infrastructure,

4. The trace URL is returned to the user which gives the user access to the trace of the requested infrastructure.

5 Results

The results of this work is a service allows users with a few clicks or a simple REST request create infrastructure for scientific computing in the cloud. The benefits of the service is the ability to theoretically create any type of infrastructure on any provider that is supported by Terraform. That is, it is possible to further implement the system to create more infrastructure provisioning configuration than the previous four mentioned in the method section. This work implements three different options for infrastructure using the OpenStack provider. An option to create a Spark cluster, another option to create a cluster for the HASTE specific HarmonicIO application, a third option to run a single container and lastly an option to run a green channel application used for image processing using a single machine with extra features that are automatic execution and storing the files and finally automatic tear-down of the machine.

The service includes two key features. The data aware feature grants the system the ability to pre-provide metadata to allow the user to omit providing the metadata. Trac- ing to improve the transparency between the service and the user. The user is granted the ability to access an overview of the state of the process.

28 5 Results

(a) (b)

(c)

Figure 5: Example web interface.

29 5 Results

Figure 6: Example trace of the Spark Standalone configuration

5.1 Spark Standalone Cluster

A request containing one worker with a trivial data set in a region inside SNIC Science Cloud was sent to the service with the user’s public key, a trivial flavor and the user’s email address. The trace id and URL to the web interface is immediately returned after the request and the service starts orchestrating the cluster through Terraform using the pre-configured scripts. The result is three created machines where one is the master with a floating IP attached and two worker machines. The trace in Figure 6 shows how a master is created which in turn start a worker.

5.2 HarmonicIO Cluster

A HarmonicIO cluster was requested with two workers. The request includes the worker count, data, flavor, the public key and user’s email address. After sending the request, the trace id and URL to web interface containing the trace is returned. The REST server accepts the request and starts orchestration of a cluster with three machines. Where one machine is the master with a floating IP attached and two worker machines.

The trace can be seen in Figure 7. The trace is split into three sub-figures. The first

30 5 Results

(a) (b)

Figure 7: Spans of the HarmonicIO trace. sub-figure 7(a) Shows the starting point of the trace. The negotiator gets the request and executes the Terraform configuration to start the orchestration. The master machine accept the continuation of the trace and starts its own scripts to start the HarmonicIO master. The second sub-figure 7(b) shows how the workers are started. The first worker receives the continuation of the trace and uses its script to start the HarmonicIO worker process. When the first worker has finished. The second worker starts the same process and the orchestration is complete.

5.3 Image Loader

A request was passed to the image loader configuration to load a set of images from a trivial container. A couple of containers in different SNIC regions were pre-provisioned in the metadata store with data that is related to the region and the configuration. That is for example network id, project name which is used for Terraform. The request was sent to the system with the name of a container in UPPMAX region as the source con- tainer to read images from with other parameters as well. Most importantly, the return address to the senders mail address. The request also includes that the running VM is destroyed when the docker container has finished processing. After sending the request, the id and an URL is returned to the user. The process accepts the request and creates a

31 5 Results

Figure 8: Trace including spans and time of the image loader configuration.

VM in the UPPMAX region because the negotiator understands that the process should be executed in UPPMAX due to the metadata. The result of this execution ends with a set of images loaded in the same container and an notification to the email address explaining that the process is finished.

A full trace of the whole process is available in the jaeger client interface which can be accessed with the id. The trace can be seen in Figure 8. The trace shows that the REST server receives the request and pushes it to receiver of the message broker. The negotiator then handles the request to begin creating the infrastructure. The Terraform configuration which is generated is executed to provision the infrastructure and the trace is sent to the VM where the execution of the script can be seen. The dependencies and container objects are downloaded and the docker container is executed and run. Finally a notification through email is sent to the user.

5.4 Running a Trivial Container

To run the single container configuration a request with a Docker Hub url to trivial con- tainer application was sent to the service. The request also included the public key, the floating IP address and trivial commands. The service returns the trace URL and starts creating the infrastructure. A single machine then downloads the dependencies and the container from the given URL. The commands are executed and the process is finished. The spans of the trace can be seen in Figure 9. The first Figure 9(a) contains the ne- gotiator trace and Figure 9(b) contains the machine trace. The container is downloaded

32 6 Discussion & Evaluation

(a) (b)

Figure 9: Spans of the container application. and some commands are executed.

6 Discussion & Evaluation

This section evaluates the service of this work, comparisons are made against other software who uses similar methods. As well to evaluate the selling points of this work which is the tracing, data aware function. Also some of the system’s drawbacks and weaknesses are reviewed.

6.1 Comparison Against Other Methods

To compare this work with other works a general overview of the process for provision- ing different type of infrastructure using similar methods is presented. Two different applications for creating compute clusters and a discussion regarding manual infras- tructure provisioning using no or little additional tools.

33 6 Discussion & Evaluation

6.1.1 SparkNow

SparkNow as mentioned in the related work section is an open source project used to deploy a Spark cluster on Openstack. The summary of the workflow to deploy a Spark cluster from SparkNow is to download the repository, export a set of environment vari- ables that comes from OpenStack metadata and to use the source Linux command on the OpenStack RC file to set additional environment variable, next step is to use Packer (an image building software) to build an image. Then to configure additional metadata and variables for the cluster architecture to finally to orchestrate a cluster using Ter- raform.

SparkNow is perhaps more difficult deploy for the average user. It is required to have a considerable amount of OpenStack knowledge to set the environment variables plus knowing exactly which variables that should be used and where. Also, the user has to install multiple binaries that includes Terraform, Packer and Git. Some Linux familiar- ity is also required to deploy with SparkNow.

The work of this thesis provides a Spark cluster however with many features and the ability to skip most of the required deployment steps. The main differences are the data aware function so the user can pre-provide the metadata or have the variables pre- provided by someone else and the tracing mechanism. Also, zero installations are re- quired because this work provides a REST service accessible from the web or from the command line. However, SparkNow provides many options for the user to configure the Spark cluster differently, depending on the user’s need there are more options to con- figure the cluster by SparkNow. The amount of configuration that this work provides is the worker count and flavor.

6.1.2 KubeSpray

Kubespray is also an open source project that similarly uses Terraform configuration to provision a Kubernetes cluster. They use different deployment methods allowing for more options. One uses Terraform and another uses Ansible. Using Terraform, it is possible to deploy on both AWS and OpenStack.

To create a cluster, KubeSpray requires that multiple applications have to be installed and many variables have to be set in different files. Also, the software and variables dif- fer between the OpenStack and AWS deployment. Still, Kubespray makes a huge effort to ease the deployment of Kubernetes. However from this work’s point of view. The deployment can be made even simpler by this work’s rest service, the need for installing

34 6 Discussion & Evaluation software is only required once. Providing the metadata is also only required once. Just like SparkNow, KubeSpray offers much more of configuration.

6.1.3 Manual Provisioning

There are multiple ways to manually deploy any type of cluster involving multiple ma- chines or running computations inside a single VM. However the manual process is laborious compared to the multitude of solutions that has been developed so far. How- ever comparing this work and most other works, the manual process requires much more extensive knowledge on the deployment process. Not only is it required to know how to deploy a spark cluster, that is installing dependencies and the required software on both master and worker machines but also how to use the cloud provider which could be OpenStack, AWS, Google App Engine or any other provider. Deploying the Harmoni- cIO or Spark cluster manually is no easy feat either. A few HASTE members know how to deploy it otherwise there are manual instructions29. Consider a new member who does not know how to deploy HarmonicIO and if they would have to do it manually then it would be difficult and perhaps troublesome for other HASTE members. The work of this thesis could then reduce the workload of the members of HASTE.

6.2 Future Development Complexity

The main selling point of this service is the potential of being cloud agnostic. It is al- ready theoretically possible for the service to provide infrastructure for different providers as long as Terraform supports the provider. However to achieve this potential, the ser- vice needs to be further developed adding more configurations than the three that has been mentioned before and also adding the same configurations for different providers. Adding more configurations is not necessarily an easy feat. As for now, the minimum requirement to add a new configuration is to add a folder with a file that returns a valid Terraform configuration or to add a folder that has Terraform configuration ready. How- ever to develop a new configuration that is useful it is required to have enough knowl- edge about Terraform and the Terraform language to create new configurations.

The diagram in Figure 10 describes the current configuration and how to add a new configuration. The requirement to add a new configuration is to add a new class that im- plements the negotiator interface that has the function orchestrate resources which returns a python dictionary. Since python does not explicitly have interfaces, the system is programmed in a way to simulate interfaces. However the difficulty here

29https://github.com/HASTE-project/HarmonicIO/blob/master/Readme.md

35 6 Discussion & Evaluation

Figure 10: New configurations are created by interfacing. is that the dictionary has to be a valid Terraform configuration that is equivalent to a Terraform configuration in JSON format. Programming the Terraform configuration is perhaps not easy and the programmer must have sufficient knowledge about the provider and Terraform to create a configuration. However, a simple Terraform configuration is most often not enough to create a full infrastructure. It is often important to provide exterior scripts along the Terraform configuration to execute commands or install de- pendencies inside the machines in the infrastructure. Most importantly a python script to support the tracing mechanism under the orchestration process. Also, the because running the python script with Jaeger tracing implemented require the machines to have dependencies installed.

6.3 Tracing

The tracing implementation does give the user more transparency regarding the process if it would not exist beforehand and problems that occur will be more understandable with tracing. For example if there are fewer workers created for the Spark cluster then it could be possible to see which code blocks were not executed and perhaps detect what the problem was. However there are some issues with the implementation. This work implements traces around code blocks, which means that it is not possible to actually trace inside imported functions. Since the purpose of tracing is to see what is wrong. It is still difficult to trace the problems of imported functions. The main issue is some of the longer running commands such as the Terraform apply command. This is one of the longest command to run since it is provisioning the actual infrastructure. This command can generate different errors and with tracing it is only possible to see that there was one error and not which error. If the tracing could be injected into the function then there would be more possibilities to detect the types of errors that may occur.

Adding tracing further increases the code development complexity. Since each pro-

36 6 Discussion & Evaluation visioning configuration have a different type of implementation and scripts, each script needs to include tracing differently. However it is possible to skip the tracing part for future configurations. Adding a trace that is actually useful is time consuming as well.

It is interesting to discuss the usefulness of the trace to different type of users. To the scientist, the trace might be incomprehensible and essentially useless. On the other hand, someone who understands the provisioning process well could certainly use the trace to understand the eventual issues during the process.

6.4 Data Aware Function

The data aware function reduces the metadata required for the user. The issues with the data aware function is that the required metadata has to be pre-provided. This does not avoid the whole purpose of the function. There has to be someone at some point in time to add the required metadata. Also, to add the required metadata the person requires knowledge about the negotiator and the configuration implementation on how the metadata is to be stored. Any changes to the implementation of the configuration may require a change in the metadata as well which could cause problems where a change in the configuration could result in a change that breaks the service. This means that the users must rely on someone or themselves to provide the metadata, so this service works well in a perfect scenario where someone provides the metadata for the user otherwise the point of the data aware function is meaningless.

6.5 Security Issues

Users may be reluctant to user the service because of other security issues. The security issues depends on the configuration however one of the issues is that for Terraform to execute scripts inside the OpenStack virtual machine Terraform requires a private key that is stored on the server. If they key is acquired then the user’s machines might be compromised. An issue specific for the Spark and HarmonicIO configuration is that the private key is uploaded to the master machine because the key is required for the master to connect to the workers. Also, for the image loader the user credentials are relocated to the machine to authenticate and download the object store containers. There is certainly a security issue when sending sensitive data to the REST service due to the risk of man in the middle attacks. That is another trade off to consider, because of the REST service implementation users do not require any installations however to interact with the service, a connection over the Internet must be established and because of this, the security issues might repel users.

37 6 Discussion & Evaluation

6.6 Limitations of This Service

The main limitations in this service lies in the implementations of the four different configurations. For each configuration, the user is locked in for a configuration that uses certain set of parameters with a specific infrastructure configuration. Important to note is that the main change the user can do in the Spark and HarmonicIO cluster is the number of workers in the cluster. However if the user requires other functionality or changes to the cluster it is not possible unless they do manual configuration after the service has completed the initial creation. For example, if the user wants to use another Spark version, another Spark configuration or include external tools such as YARN or Mesos. This limitation exists for the image uploader as well, it only has one purpose which is to run the Docker container.

This can be solved by changing the configuration to allow for more parameters. How- ever to do these changes, the complexity of the configuration grows because more code has to be added. For each change in paratemer, more code has to be added in the form of Terraform configuration and depending on the change there might be change in the python code which means that more tracing code has to be added as well.

Because the foundation of this works is built on Terraform, the complete service is limited by Terraform, however Terraform is an open source tool that is continuously updated. There is of course the problem where if Terraform becomes obsolete then this work can not progress anymore unless it extends to implement additional tools. It is not easily possible for someone to extend the software outside Terraform. Adding configu- rations outside Terraform’s scope would be increasingly difficult.

It has been mentioned that with Docker containers, the applications can be deployed with ease over multiple operating systems. There is however a limitation with the cur- rent implementation. Because a python script is required to be run along with bash scripts on the machines during orchestration of the configurations, the operating must then be able to run these scripts. For python to be run, plenty of dependencies are re- quired to be installed. A major issue is that Jaeger requires a python 2.x version to be run. This is the main reason to include a python script, this also means that the operating system must support a python 2.x version.

38 8 Conclusion

7 Future Work

Because this work focused on OpenStack deployment. It would be interesting to im- plement cluster for different providers. Even though it is theoretically possible to run on, for example Amazon because Terraform supports the Amazon provider. Actually implementing and seeing it work is to be desired. This would mean that a user would have the option to deploy a single cluster configuration on two different cloud providers or the user would be able to have data on different cloud providers or eventually create a cluster on a the most appropriate provider depending on other parameters such as cost or availability.

To further add more configurations and parameter options to each configuration it is important to keep the design simple. Some of the main points of discussion is how dif- ficult it is to further develop the system. From a design perspective it is possible think of this work like any other software system to implement more design principles such as interfaces to add to the system’s longevity.

There are other tools and frameworks that are not Terraform that this thesis has not ex- plored. Another future implementation would be to write an abstract layer over multiple different infrastructure provisioning tools. For example, to combine the capabilities of Terraform and Ansible over one layer so to let the user interact with both tools instead of just one. Or use the previosuly mentioned SparkNow and KubeNow in combination with this system to create Spark and Kubernetes clusters.

The previously mentioned limitation of requiring the machines to be able to run a Python 2.x version can be solved by having docker containers that include all the dependencies that are required such as python libraries and docker installed. However there comes a problem with development complexity. This means that the initiation of the machine requires a Docker container and the application inside the container has to itself start Docker containers. For example a Spark cluster must run a Docker container which itself starts the Docker master container.

8 Conclusion

This work implements a service which eases the infrastructure provisioning process for users who want to create a Spark cluster and for users inside the HASTE project who needs to run a HarmonicIO cluster and an image filtering application. It is also possible to further improve upon the service and add more configurations not only for OpenStack

39 8 Conclusion which this work explores but for other cloud providers as well to create a service more in line with the cloud agnostic philosophy. The service’s tracing function adds more transparency for the user and the data aware function potentially gives the user less re- quired parameters to deploy infrastructure.

The main difference between this work and other similar work that have been discussed is the trade-off between changeability of the infrastructure and the effortless deployment of infrastructure. Adding more changes and configurations makes the infrastructure more complicated to deploy. Other methods to deploy infrastructure that are compared against do come with more power to change the infrastructure depending on the user’s needs. Meanwhile, this work provides multiple different static infrastructures that are difficult to configure and change or with few possibilities to make changes to the infras- tructure but with a compensating ease of deployment. For particular users who are not well informed regarding cloud concepts, application deployment, but who would like to use cloud infrastructures to run computations then this service would fit those users well. While users who are well informed and require a specific cluster would perhaps prefer to user other deployment methods or to manually deploy infrastructure.

Applying another layer of abstraction over already existing software works quite well however adding more layers comes with limitations to the users. To reduce the limita- tions further code complexity is required and the implementation difficulties increases. For the service to fully reach the cloud agnostic ideology then it has to provide configu- rations for all providers and all the possible types of infrastructure configurations. This leads to an immensely large project that is challenging and demanding to maintain.

40 References

References

[1] https://strategiska.se/pressmeddelande/200-miljoner-till-big-data-och-berakningsvetenskap/. Accessed: 2018-06-13.

[2] Cern data centre passes the 200-petabyte milestone. https://www.home.cern/about/ updates/2017/07/cern-data-centre-passes-200-petabyte-milestone. Accessed: 2018-04-22.

[3] Haste: Hierarchical analysis of spatial and temporal data. http://haste.research.it. uu.se. Accessed: 2018-04-07.

[4] Iaac for devops: Infrastructure automation using aws cloudforma- tion. https://community.toadworld.com/platforms/oracle/w/wiki/11715. iaac-for-devops-infrastructure-automation-using-aws-cloudformation. Accessed: 2018-04-22.

[5] Rabbitmq. https://www.cloudamqp.com/blog/ 2015-05-18-part1-rabbitmq-for-beginners-what-is-rabbitmq.html. Accessed: 2018-07-12.

[6] Rabbitmq. https://www.rabbitmq.com/tutorials/tutorial-one-python.html. Ac- cessed: 2018-08-13.

[7] Ska project. https://www.skatelescope.org/project/. Accessed: 2018-04-22.

[8] Danilo Ardagna, Elisabetta Di Nitto, Parastoo Mohagheghi, Sebastien´ Mosser, Cyril Ballagny, Francesco D’Andria, Giuliano Casale, Peter Matthews, Cosmin- Septimiu Nechifor, Dana Petcu, Anke Gericke, and Craig Sheridan. Modaclouds: A model-driven approach for the design and execution of applications on multiple clouds. pages 50–56, 06 2012.

[9] Marco Capuccini, Anders Larsson, Salman Toor, and Ola Spjuth. KubeNow: A Cloud Agnostic Platform for Microservice-Oriented Applications. In Fer- gus Leahy and Juliana Franco, editors, 2017 Imperial College Computing Stu- dent Workshop (ICCSW 2017), volume 60 of OpenAccess Series in Informatics (OASIcs), pages 9:1–9:2, Dagstuhl, Germany, 2018. Schloss Dagstuhl–Leibniz- Zentrum fuer Informatik.

[10] W. Chen, C. Liang, Y. Wan, C. Gao, G. Wu, J. Wei, and T. Huang. More: A model- driven operation service for cloud-based it systems. In 2016 IEEE International Conference on Services Computing (SCC), pages 633–640, June 2016.

41 References

[11] T. Dillon, C. Wu, and E. Chang. Cloud computing: Issues and challenges. In 2010 24th IEEE International Conference on Advanced Information Networking and Applications, pages 27–33, April 2010.

[12] Roy T Fielding and Richard N Taylor. Architectural styles and the design of network-based software architectures, volume 7. University of California, Irvine Doctoral dissertation, 2000.

[13] Peter M. Mell and Timothy Grance. Sp 800-145. the nist definition of cloud com- puting. Technical report, Gaithersburg, MD, United States, 2011.

[14] Dirk Merkel. Docker: lightweight linux containers for consistent development and deployment. Linux Journal, 2014(239):2, 2014.

[15] J. Sandobalin, E. Insfran, and S. Abrahao. An infrastructure modelling tool for cloud provisioning. In 2017 IEEE International Conference on Services Comput- ing (SCC), pages 354–361, June 2017.

[16] J. Scheuner, P. Leitner, J. Cito, and H. Gall. Cloud work bench – infrastructure-as- code based cloud benchmarking. In 2014 IEEE 6th International Conference on Cloud Computing Technology and Science, pages 246–253, Dec 2014.

[17] Yassine Tabaa, Abdellatif Medouri, and M Tetouan. Towards a next generation of scientific computing in the cloud. International Journal of Computer Science, 9(6):177–183, 2012.

[18] Preechakorn Torruangwatthana. S3DA: A Stream-based Solution for Scalable DataAnalysis. Master’s thesis, Uppsala University, 2017.

[19] Andrea Tosatto, Pietro Ruiu, and Antonio Attanasio. Container-based orchestra- tion in cloud: state of the art and challenges. In Complex, Intelligent, and Software Intensive Systems (CISIS), 2015 Ninth International Conference on, pages 70–75. IEEE, 2015.

[20] Ian Unruh, Alexandru G. Bardas, Rui Zhuang, Xinming Ou, and Scott A. De- Loach. Compiling abstract specifications into concrete systems—bringing order to the cloud. In 28th Large Installation System Administration Conference (LISA14), pages 26–42, Seattle, WA, 2014. USENIX Association.

[21] Andreas Wittig and Michael Wittig. Amazon Web Services in Action. Manning Publications Co., Greenwich, CT, USA, 1st edition, 2015.

[22] Matei Zaharia, Reynold S. Xin, Patrick Wendell, Tathagata Das, Michael Arm- brust, Ankur Dave, Xiangrui Meng, Josh Rosen, Shivaram Venkataraman,

42 References

Michael J. Franklin, Ali Ghodsi, Joseph Gonzalez, Scott Shenker, and Ion Sto- ica. Apache spark: A unified engine for big data processing. Commun. ACM, 59(11):56–65, October 2016.

[23] R. Zarnekow, S. Wind, K. Turowski, and J. Repschlaeger. A reference guide to cloud computing dimensions: Infrastructure as a service classification framework. In 2012 45th Hawaii International Conference on System Sciences(HICSS), vol- ume 00, pages 2178–2188, 01 2012.

[24] Qi Zhang, Lu Cheng, and Raouf Boutaba. Cloud computing: state-of-the-art and research challenges. Journal of Internet Services and Applications, 1(1):7–18, May 2010.

43