EEssccuueellaa PPoolliittééccnniiccaa SSuuppeerriioorr ddee LLiinnaarreess MMaasstteerr iinn tteelleeccoommmmuunniiccaattiioonn eennggiinneeeerriinngg STRATEGIES Department:Telecommunication engineering Dr Supervisors: Dr Noureldien Student: Yossra DEVELOPMENT B IG

DATA . . . José Enrique MuñozJosé . Expósito Sebastián García Galán Sebastián García

PLATFORM IN June, 2020 June,

CLOUD

OF

SCHEDULING

COMPUTING

FOR

THE Declaration

I declare that this thesis is my work and I referenced all the sources I used throughout the thesis. Also, the thesis has not been submitted to any other degree before.

2020

1 Acknowledgments

I am grateful to almighty God for his grace in me.

I would like to thank the university of Jaen for offering me this opportunity to complete my postgraduate master degree. In addition, my sincere gratitude goes to professors José Enrique and Sebastián García whom, with their guidance and patience, I was able to complete this work.

My special appreciation is to my family, my lovely parents and my brothers for their continuous support and encouragement in my whole life.

2 Abstract

Cloud computing is an evolving topic in the last decade. Clouds work as infras- tructures for storing and accessing data. Cloud scheduling is used to map tasks’ requirements according to the availability of resources. Extensive researches were done to develop methodologies for scheduling. In general, the goal is to optimize a specific objective such as time, cost, energy and quality of service. In the literature, there is an enormous number of scheduling strategies, such as MinMin algorithm, MaxMin algorithm and Round robin algorithm. In this work, we are introducing a platform, in which it is possible to select between multiple scheduling strategies to run simulation scenarios in cloud computing environ- ments. The platform will serve mainly simulations to avoid having a real data center. The simulation environment selected is the Workflowsim-DVFS simulator. This simulator is an improved version of the original Workflowsim simulator as it takes energy consumption into account while performing simulations. The proposed platform could be used by researchers as a tool for testing procedures when running a wide range of experiments.

3 Resumen

La computación en la nube es un tema en evolución en la última década. Las nubes funcionan como infraestructuras para almacenar y acceder a datos. La programación en la nube se utiliza para mapear los requisitos de las tareas de acuerdo con la disponibilidad de recursos. Se realizaron extensas investigaciones para desarrollar metodologías para la programación. En general, el objetivo es optimizar un objetivo específico como el tiempo, el costo, la energía y la calidad del servicio. En la literatura, hay una enorme cantidad de estrategias de programación, como el algoritmo MinMin, el algoritmo MaxMin y el algoritmo Round Robin. En este trabajo, presentamos una plataforma en la que es posible seleccionar entre múltiples estrategias de programación para ejecutar escenarios de simulación en entornos de computación en la nube. La plataforma servirá principalmente simulaciones para evitar tener un centro de datos real. El entorno de simulación seleccionado es el simulador Workflowsim-DVFS. Este simulador es una versión mejorada del simulador original de Workflowsim, ya que tiene en cuenta el consumo de energía al realizar simulaciones. La plataforma propuesta podría ser utilizada por los investigadores como una herramienta para probar procedimientos al ejecutar una amplia gama de experimentos.

4 Objetivos

Este trabajo tiene como objetivo implementar una plataforma que permita la selección de diferentes estrategias de programación en la computación en la nube para realizar escenarios de simulación. El usuario podrá seleccionar múltiples opciones para establecer diferentes parámetros. La lista de los diversos objetivos detrás de este trabajo se enumeran a continuación:

• Desarrollo de una interfaz de usuario amigable para programar estrategias en la computación en la nube. La interfaz proporcionará múltiples opciones entre las cuales el usuario puede seleccionar los diferentes parámetros.

• Permitir al usuario realizar simulaciones de programación. Esto será ben- eficioso cuando intente realizar escenarios de prueba, como un conjunto de cargas de trabajo y un conjunto de estrategias de programación, y el objetivo es analizar el rendimiento y ver el comportamiento de las diversas estrategias.

• Obtenga resultados de rendimiento más precisos al tener la posibilidad de selección de múltiples parámetros. Esto se refiere al hecho de que la plataforma admitirá una amplia gama de parámetros. Esto incluye múlti- ples archivos de entrada, algoritmos de programación, configuración de la máquina de simulación ... etc.

• Inicie una versión de código abierto de la plataforma para que otros la utilicen y agradezca las contribuciones para ampliar su escala y mejorar su rendimiento.

• Contribuya al campo de la investigación de computación en la nube al pro- porcionar una herramienta poderosa que pueda apoyar otras investigaciones en técnicas de programación.

5 Conclusiones

Este trabajo introdujo el desarrollo de una plataforma web que podría usarse para ejecutar simulaciones de programación en entornos de computación en la nube. La plataforma propuesta tiene como objetivo proporcionar un entorno realizar escenarios de prueba donde es posible seleccionar un conjunto de cargas de trabajo y un conjunto de estrategias de programación, y el objetivo es analizar el rendimiento y ver el comportamiento de las diversas estrategias.

Comenzamos presentando, en el capítulo3, los diversos conceptos relacionados con nuestro trabajo, desde la computación en la nube y las metodologías de programación hasta la lógica difusa y los enfoques de aprendizaje. También pro- porcionamos una descripción general del entorno de simulación que utilizamos, así como los diferentes lenguajes de programación y plataformas de virtualización que necesitábamos.

Luego, proporcionamos una explicación detallada del procedimiento que seguimos para el desarrollo de la plataforma propuesta en el capítulo4. Terminamos el capítulo presentando el método que seguimos para integrar las aplicaciones de Big Data para que la plataforma las admita.

En el capítulo5, mostramos las configuraciones que establecimos para nuestros experimentos para probar la plataforma. Proporcionamos, en una explicación paso a paso, tres escenarios de simulación que podrían ser atendidos por la plataforma. Terminamos el capítulo con una discusión sobre cómo la plataforma podría ser beneficiosa para realizar análisis y comparaciones de los resultados de las simulaciones.

Para concluir, desarrollamos la plataforma web propuesta, sin embargo, estableci- mos en la siguiente sección algunos objetivos futuros que podrían hacerse para mejorar aún más la plataforma, y también agradecemos cualquier contribución de otros investigadores a este trabajo.

6 Contents

List of Figures 10 List of Tables 12

1 INTRODUCTION 13

2 OBJECTIVES 14

3 ANTECEDENTS 15 3.1 Cloud computing 15 3.1.1 Cloud computing types 15 3.2 Cloud scheduling 16 3.2.1 Scheduling algorithms 17 3.3 Fuzzy logic 18 3.4 Knowledge acquisition systems 20 3.4.1 Genetic Fuzzy systems 21 3.4.1.1 Pittsburgh approach 22 3.4.2 Particle Swarm Optimization Fuzzy systems 23 3.4.2.1 KASIA approach 24 3.5 Simulation environment: WorkflowSim with DVFS 26 3.5.1 WorkflowSim 26 3.5.2 Power model 27 3.5.3 DVFS 29 3.5.4 WorkflowSim-DVFS operation 29 3.5.5 Fuzzy schedulers 31 3.6 Programming languages, databases and virtualization platforms 33 3.6.1 NodeJS 33 3.6.2 Python 34 3.6.3 35 3.6.4 MySQL 36

7 3.6.5 Docker Containers 37 3.7 Big Data 38

4 METHODOLOGY 41 4.1 Framework overview 41 4.2 Platform development 42 4.2.1 Conceptual model 42 4.2.1.1 Layered architecture 42 4.2.1.2 Layers communication 43 4.2.1.3 Main building blocks 45 4.2.2 User interface layer 49 4.2.2.1 UI design 49 4.2.2.2 UI development 56 4.2.3 Operational layer 65 4.2.3.1 Operations program implementation 65 4.2.3.2 Scheduling simulator modifications 71 4.2.4 Infrastructure layer 73 4.2.4.1 Dockerization 73 4.2.4.2 Cluster managers 74 4.3 Big Data integration 75 4.3.1 Integration method 75 4.3.2 Integration issues 76

5 RESULTS AND DISCUSSION 77 5.1 Simulation scenarios 77 5.1.1 Configurations 77 5.1.1.1 Infrastructure configuration 77 5.1.1.2 Database configuration 78 5.1.1.3 Knowledge acquisition algorithm configuration 78 5.1.1.4 Ports set up 81 5.1.2 Simulation scenario 1 82 5.1.3 Simulation scenario 2 85 5.1.4 Simulation scenario 3 88

8 5.2 Discussion 90

6 CONCLUSION AND FUTURE GOALS 93 6.1 Conclusions 93 6.2 Future goals 94

Bibliography 95

9 List of Figures

3.1 Operation of a fuzzy rule based system FRBS 20 3.2 Genetic fuzzy systems 21 3.3 WorkflowSim Overview. 27 3.4 Montage workflow with 25 jobs 28 3.5 Docker structure 37 3.6 Virtual Machine structure 38 3.7 Layers of resource management in Big Data cluster 40

4.1 Layered architecture of the platform 43 4.2 Interaction between layers 44 4.3 Tree graph for building blocks components of the platform 46 4.4 PowerAwareFuzzySeq Procedure 47 4.5 Checking the scheduling strategy and simulation type 47 4.6 Main window of the user interface 49 4.7 Input window 50 4.8 Schedule window 51 4.9 Data Center window 52 4.10 Virtualization window 53 4.11 Simulation window, basic view 54 4.12 Simulation window, training view 55 4.13 Convert EJS into HTML 57 4.14 Different HTML files of the user interface imported by index.ejs 58 4.15 Project directory folders 58 4.16 Class Diagram represents CreateResults() function 61 4.17 Flow chart for the main file, service.py, in the operations program 67 4.18 Flow chart for Pittsburgh learning method in the operations program 70 4.19 Flow chart for reference A 71 4.20 Packages extension 72

10 5.1 Scenario 1, selecting the input from interface 82 5.2 Scenario 1, selecting the scheduling algorithm 82 5.3 Scenario 1, selecting the simulation parameters 83 5.4 Scenario 1, initial values stored in the database 83 5.5 Scenario 1, final values stored in the database 84 5.6 Scenario 1, final values displayed in task list window 84 5.7 Scenarios similar to scenario 1 84 5.8 Scenario 2, selecting the input from interface 85 5.9 Scenario 2, selecting the scheduling algorithm 85 5.10 Scenario 2, selecting the simulation parameters 86 5.11 Scenario 2, initial values stored in the database 86 5.12 Scenario 2, iterations values stored in the database 87 5.13 Scenario 2, final values stored in the database 87 5.14 Scenario 2, final values displayed in task list window 87 5.15 Scenarios similar to scenario 2 88 5.16 Scenario 3, selecting simulation parameters 88 5.17 Scenario 3, final values stored in the database 89 5.18 Bar chart for the results indicated in table 5.5 91 5.19 Bar chart for the results indicated in table 5.6 91

11 List of Tables

4.1 Input files 50 4.2 Scheduling algorithms 51 4.3 Data center characteristics. First part 52 4.4 Data center characteristics. Second part 53 4.5 Virtualization parameters 54 4.6 Simulation parameters 56

5.1 Machine specifications 78 5.2 Simulations table 79 5.3 Results table 79 5.4 Learning algorithm parameters 80 5.5 Results for Inspiral 100 file 90 5.6 Results for different input files 92

12 CHAPTER 1

INTRODUCTION

Cloud computing refers to a type of platforms which acts as an infrastructure for accessing, computing and storing of data [28]. Cloud computing has become very popular in the last decade, its powerful features allow it to behave as a platform that serves a wide range of applications. Cloud scheduling is an evolving topic in cloud computing research. The general definition of scheduling is the ability of a scheduling algorithm to map jobs requirements according to the availability of resources [70]. However, jobs have to be assigned to resources efficiently, thus, scheduling needs to be performed in an optimized way. The benefit behind this is to improve throughput and system load balance, reduce energy consumption, reduce execution time and improve the performance of the cloud environment. In general, scheduling methodologies are based on objectives such as time, cost, energy and quality of service [8]. Some of the most popular scheduling algorithms are Min Min algorithm, Max Min algorithm and Round Robin algorithm. On the other hand, web application development is becoming more popular to be used these days. It facilitates the representation of a specific application. There are various software and tools available to build a web application, whether for the front end development, referred to as client side, or the back end development, that is the server side. Web platforms are easily accessible through browsers and could be run in different operating systems. There are different levels in which it is possible to integrate different scheduling strategies to be represented in a platform. One level is by having multiple scheduling strategies and build a scheduler selector which will automatically select the appropriate scheduler depending on the workload type. Another approach is by implementing a platform that can integrate different scheduling strategies to use and gives the users options to access different schedulers. In the latter approach, it is possible to run different sets of workloads with a different set of scheduling strategies and analyze the performance of each. In this work, we are implementing this approach.

13 CHAPTER 2

OBJECTIVES

This work aims to implement a platform that allows the selection of different scheduling strategies in cloud computing to perform simulation scenarios. The user will be able to select multiple options to set different parameters. The list of the various objectives behind this work are listed below:

• Development of a friendly user interface for scheduling strategies in cloud computing. The interface will provide multiple options from which the user can select the different parameters.

• Allow the user to perform scheduling simulations. This will be beneficial when trying to perform test scenarios such as a set of workloads and a set of scheduling strategies and the goal is to analyze the performance and see the behaviour of the various strategies.

• Get more accurate performance results by having multi-parameter selection possibility. This refers to the fact that the platform will support a wide range of parameters. This includes, multiple input files, scheduling algorithms, simulation machine setup...etc.

• Launch an open source version of the platform to be used by others and welcome contributions to extend its scale and improve its performance.

• Contribute to the cloud computing research field by providing a powerful tool that can support other researches in scheduling techniques.

14 CHAPTER 3

ANTECEDENTS

3.1 Cloud computing

Cloud computing refers to a type of platforms which acts as an infrastructure for accessing, computing and storing of data [28]. Cloud computing has become very popular in the last decade, its powerful features allow it to behave as a platform that serves a wide range of applications. Cloud computing has shown enormous benefits in terms of:

(a) Reduction in costs: as the customers are billed only based on the specific resources usage and according to the period of time they used the resources.

(b) Cloud scalability: this refers to the ability to extend the resources up or down depending on the requirements.

(c) Security: cloud ensures secure access, data encryption and security intelligence [61].

(d) Simplicity: accessing resources is easy and dealing with cloud is simple and can be done by all kind of customers.

3.1.1 Cloud computing types

Due to the previous mentioned benefits, it was essential to have different types of cloud computing as well as different services to provide support according to the specific require- ments of the different data. Cloud computing types are :

(a) Public Cloud Public cloud providers aim to provide services and resources to customers who want to pay only what they consumed to build their applications.

(b) Private Cloud The main benefits of setting up a private cloud rather than using public ones are to

15 ensure security to organizations and reduce the overhead when accessing public clouds.

(c) Hybrid Cloud A combination between private clouds and public clouds. They provide higher flexibility and efficiency.

Cloud computing services are referred to as the Service Models of Cloud Computing [59]. The main three models are Infrastructure as a Service (IaaS), Platform as a Service (PaaS), and Software as a Service (SaaS). As follows:

(a) Infrastructure as a Service (IaaS) This service provides the infrastructure for the customer to support applications devel- opment. IaaS aims to provide computing and storage resources in order to host the customers needs for physical resources. Some examples of IaaS include Amazon’s EC2 and Google Compute Engine.

(b) Platform as a Service (PaaS) The service provided is a platform for developing cloud programs. PaaS aims to simplify the procedure for customers when developing their applications and services by providing a development environment. Some examples of PaaS include IBM Cloud and Google AppEngine.

(c) Software as a Service (SaaS) This service provides built-in softwares to be used by customers. SaaS could be accessed through internet, examples of SaaS include Google Apps and Google Docs.

3.2 Cloud scheduling

Cloud scheduling is an evolving topic in cloud computing research. The general definition of scheduling is the ability of a scheduling algorithm to map jobs requirements according to the availability of resources [70]. In cloud computing, quality of service (QoS) as well as energy consumption are crucial elements that must be taken into consideration when perform scheduling. The elasticity of using a shared resource pool is the main reason that the cloud is able to provide unlimited resources [70]. However jobs have to be assigned to resources efficiently, thus, scheduling need to be performed in an optimized way. The benefit behind this is to

16 improve throughput, system load balance, reduce energy consumption, reduce execution time and improve the performance of the cloud environment as a whole. In general, scheduling methodologies are based on objectives such as time, cost, energy and quality of service [8]. Task/job scheduling in cloud computing is done by allocating the virtual machines (VMs) resources according to the user demands [24]. There are two levels when performing scheduling in cloud computing [44], the VM level and the host level. VM level is responsible for reserving virtual machines to jobs using a scheduler, it is named as task scheduling. On the other hand, in the host level the virtual machines will be allocated to physical machines, thus given the name VM scheduling. There are many types of scheduling that could be performed in the cloud, this include centralized-decentralized scheduling, static-dynamic scheduling and heuristic-metaheuristic scheduling.

3.2.1 Scheduling algorithms

There are wide variety of scheduling strategies and many researches were done in this field. In cloud computing, some of the most popular scheduling algorithms are mentioned below.

(a) Min Min scheduling algorithm The concept of Min Min algorithm is based on assigning resources to the tasks that have the minimum execution time first [33][56]. When the tasks arrive to the queue, the algorithm will start checking the completion time for all tasks then assign the task with the least completion time to the specific resource. The process will be repeated till completing all tasks.

(b) Max Min scheduling algorithm Max Min algorithms works on a similar way to Min Min algorithm except that the resources will be assigned to the tasks that have the maximum execution time first [33] [56]. The benefit behind using this algorithm is to minimize the waiting time for larger tasks [31].

(c) Round Robin scheduling algorithm Round Robin algorithm (RR) is based on the concept of time slicing. The jobs that don’t find free resources will be waiting in the queue. For the waiting queue, each job will be given a time slice referred to as time quantum TQ [62][2]. This corresponds to

17 the amount of the time the job will be served by the resources. After this TQ finishes, the job will be returned to the end of the queue [50][3][2].

(d) Power Aware Scheduling Algorithm Power-aware algorithm uses dynamic voltage scheduling (DVS) to choose both the next job as well as the processor’s operating frequency [53]. This type of algorithm is mainly used to minimize energy consumption. For instance, PowerAwareSeqSchedulin- gAlgorithm orders the tasks in a sequential format, then, it selects the least power consumption VM to serve each task in each turn. Thus, the least VM that consumes power will be busy serving the first task, while the second least consumption VM will be assigned to the second task and so on. Once the first VMs finish, they will again be reused before moving to other VMs with higher consumption.

(e) Static scheduling algorithm Static scheduling referred to as offline or deterministic scheduling. In this type of scheduling algorithm, no run time scheduling is needed, the algorithm runs on compile time [58]. The tasks will be ordered and the algorithm, based on the order, schedules the tasks [69].

3.3 Fuzzy logic

Fuzzy Logic concept was initiated by Lotfi A. Zadeh, professor for computer science at the University of California in Berkeley in 1965 [20][73][55]. Fuzzy logic, could be referred to as the logic that permits the usage of ranges among conventional evaluations such as true/false, yes/no, high/low, etc [55]. Descriptions like very slow or short could be used as inputs for computers to be processed based on a methodology closed to human way of thinking [74][55]. The idea behind fuzzy logic started back in the fifties when professor Zadeh started to work on analytical methods and system theory. In the beginning of the sixties, he believed that even complex real problems could be solved by system analysis techniques. Thus, he published a new theorem named as “Fuzzy Set Theory” in 1964. Fuzzy Set Theory defined what is called fuzzy set. A fuzzy set is a set of fuzzy objects with a membership described as "continuous", in which there is a gradual transition from a membership subset to a non membership subset [55]. For a specific problem, a non fuzzy logic approach aims to include the detailed relations between the problem parameters while a

18 fuzzy logic approach only cares about the essentials of a problem. A fuzzy logic is usually represented with simple IF-THEN rules. Therefore, the fuzzy system is named as a Fuzzy Rule Base System (FRBS). A fuzzy logic system, as every other system, has inputs, outputs and an operation that is performed in the inputs to produce the outputs. For a fuzzy system, the operation is composed of three components, membership functions, combination functions (and / or), and conditional statements (rules). A Fuzzy Rule Base System FRBS process goes through five steps as described in Matlab Fuzzy Logic Toolbox Handbook [37], as the following :

1. “Fuzzify” the inputs This step is required to measure to which degree does the input suits a specific input fuzzy set. Thus, for a specific fuzzy system, a mapping process needs to be done to “fuzzify” the input parameters. Thus, membership functions are required to map inputs which could be any value to fuzzy variables that in the range between 0 and 1. The inputs always named as antecedents in fuzzy logic.

2. Apply the fuzzy operator The operators are used to define the fuzzy rules. The major operators that are used are the minimum function for intersection statements (AND), and the maximum function for union (OR) statements. Either of those operators could be performed on the fuzzified inputs.

3. Apply the implication method This method is used to produce outputs. The process is done by mapping each of the conditional statements to a value of the output. The output membership function of the output fuzzy set is truncated using the minimum function.

4. Aggregate all outputs It could happen that several rules correspond to a single output, so aggregation of all rules into a single fuzzy value is needed. The outputs could be combined by a summation, for instance, of the set of truncated output membership functions.

5. "Defuzzify" the fuzzy output This step is required to transform the fuzzy outputs back into crisp (classical) outputs. Consequent is the common name for an output in fuzzy logic. The steps are represented in figure 3.1.

19 Figure 3.1: Operation of a fuzzy rule based system FRBS

To have a better understanding of how a fuzzy system operates, you can refer to this example that was demonstrated by [5] about an air conditioner control system created using FRBS. It represents the previous mentioned terminologies and operations. In this work we used a fuzzy rule base system FRBS to build a task scheduler.

3.4 Knowledge acquisition systems

Knowledge acquisition could be defined as getting knowledge from a specific domain experts to be used by a computer system [10]. For a certain problem, having well defined information is the first step to build a professional system, referred to as expert system. Four steps [10] are needed for knowledge acquisition process:

1. Planning: The first step is to explore the problem in hand, identify parameters, study knowledge acquisition techniques and build a design.

2. Knowledge capturing: To get knowledge about the specific problem domain using the different knowledge acquisition techniques.

3. Knowledge analysis: this process is to analyze the captured information to be repre- sented in a clear format, for instance as a fuzzy logic set.

20 4. Knowledge verification: to build the knowledge base, a verification process is needed. In this process, experts will make sure that the knowledge base is sufficient for the problem in hand.

Two major knowledge acquisition fuzzy systems are Genetic Fuzzy systems and Knowledge Acquisition in Fuzzy-Rule-Based Systems With Particle-Swarm Optimization. Those systems are explained in the following sections.

3.4.1 Genetic Fuzzy systems

Classical methods have proved themselves in solving engineering problems, however, complex advance problems needed new improved methods to solve them [34][25]. The different computational intelligence techniques are examples of those new methods, this includes genetic algorithms (GAs)[21][27], fuzzy logic [72] and artificial networks [49]. The combination between fuzzy logic and genetic algorithm is referred to as genetic fuzzy systems (GFSs) [14]. A(GFS) could be defined as a usual fuzzy system which uses evolutionary computation to perform a learning process. The basic concepts of fuzzy logic has been explained in section 3.3. On the other hand, genetic algorithm is mainly a global search method that looks for optimal solutions in the complex search space. GAs are powerful in their ability to perform knowledge acquisition and they could be combined with fuzzy systems to let them gain initial knowledge [25], for instance, in their fuzzy rules, number of rules, fuzzy membership function...etc. In section 3.3, the main components for a fuzzy rule based system (FRBS) has

Figure 3.2: Genetic fuzzy systems been explained. More terminologies are introduced in this section. There are different kinds

21 of FRBSs, however the most important one is referred to as Mamdani FRBSs or linguistic FRBSs [36]. Another complex type of FRBSs are called TS-type fuzzy systems [63]. In this work, the concentration will be on linguistic FRBSs. An important terminology in linguistic FRBSs is the knowledge base (KB). The knowledge base (KB) has two elements [25]:

(a) A data base (DB), this includes the parameters used in the rules set as well as the membership functions.

(b) A rule base (RB), a group of rule sets, simply, the IF-THEN fuzzy rules.

There are two types of genetic fuzzy system approaches, genetic tuning and genetic learning. In this work we focus on genetic learning. Genetic learning uses the rule coding and competi- tion between individuals for learning rule sets. There are two ways in order to encode rules within a population of individuals :

(a) The first one works in which each individual encodes a set of rules. This is referred to as a Pittsburgh approach [60].

(b) The second approach have the behaviour in which each individual encodes only one rule. Some examples includes Michigan approach, iterative rule learning approach and genetic cooperative-competitive learning.

3.4.1.1 Pittsburgh approach

As explained in section 3.4.1, in a Pittsburgh approach, each individual encodes a set of rules. This behaviour avoids interaction between individuals. The following elements are significant in the rule selection methodology on Pittsburgh [14]:

• The population of rule bases: Pittsburgh considers the rule bases as solutions to the computational problem. A random or specific solution, thus a rule base, is chosen in the beginning of the process. A population of rules bases is then created.

• The evaluation system: also refers to as the fitness function. This component differs widely from application to another depending on the problem in hand.

• The rule base discovery system: the discovery system will provide new populations, hence, new sets of rules bases. This process is done by implementing evolutionary methods to the initial solution, and hence, to any new solution that is generated.

22 Algorithm 3.1: Pittsburgh approach

Result: return the best individual

1 Initialize population;

2 Evaluate fitness;

3 while Number of iterations do

4 Perform Crossover and get children;

5 Children mutation;

6 Evaluate children;

7 Iteration ++;

8 end

• Representation: is an important terminology in the field of evolutionary computation methods. For Pittsburgh approach, the solution that is represented as a string of genes is the rule base.

• Crossover and mutation: are as well important terminologies in evolutionary computa- tion and genetic algorithm. Crossover will provide new combinations of rules while mutation will tend to generate new rules in the Pittsburgh approach [14].

Algorithm 3.1 is a pseudo code for Pittsburgh.

3.4.2 Particle Swarm Optimization Fuzzy systems

Particle swarm optimization (PSO) has been introduced by Kennedy and Eberhart [17]. The idea of particle swarm optimization has been inspired from how animals, such as birds, behave in order to find food. The individuals of these swarms collaborate by changing the search space to get experience from other individuals or from their own learning [71]. Before Kennedy and Eberhart, previous work done by Reynolds [47] suggested that established boids (bird-oid objects) might follow simple rules to improve their behaviour. Eberhart and Kennedy got benefits from Reynold’s work and set the idea behind PSO algoritm. The algorithm is based on the idea of finding optimum search space and the solutions are considered as the particles of the swarm. Namely the canonical particle swarm optimizer is based on this evolutionary idea. Particle swarm optimization algorithm is based on swarm intelligence. PSO has been used in many research areas in order to solve complex multidimensional

23 problems. It has been compared to genetic algorithm (GA) and proved out-performance in some scenarios [48]. PSO doesn’t need crossover and mutation operations that are needed by GA, which makes it simpler to use. In PSO, particles update themselves with velocity consideration. Another features of the particles is that, they have memory and they perform communications in an effective way. Also the convergence in PSO outperform GA by having less evaluations. Canonical PSO, proposed that the space where the particles search is multidimensional space. The procedure starts by defining a number of particles and a number of iterations to repeat the procedure. An evolution function is defined, named as the fitness function, to measure the quality of solutions. The particles update their positions each iteration by checking the new positions using the fitness function. The process is repeated depending on the the number of iterations till an optimal solution is found. There are three main components that control how the particles will update their positions, namely an inertial component, an inner tendency component (that is defined by the particle itself to determine the best position) and lastly a social component in which each particle is affected by the findings of the other particles in the swarm. By taking these points into consideration, the velocity and the position of each particle in each iteration are updated. Fuzzy logic and fuzzy rule based systems (FRBSs) has been introduced in section 3.3. Other knowledge acquisition methods in fuzzy systems exist in the literature such as Pittsburgh approach which has been introduced in 3.4.1.1. PSO has been used with fuzzy logic in multiple applications such as [4], [68]. Knowledge acquisition in fuzzy systems using PSO was introduced by [19] in which an individual swarm will represent a rule base (RB) in such a way that the knowledge base (KB) will be improved. Hence, knowledge acquisition with a swarm-intelligence approach (KASIA) was born.

3.4.2.1 KASIA approach

KASIA, knowledge acquisition with a swarm-intelligence, has been established to improve the knowledge base of fuzzy systems using particle swarm optimization as introduced in the previous section. In KASIA, the following assumptions has been set [19]:

• Each particle represents a rule base RB formatted as a matrix in which the rows are the fuzzy rules. The rules are constructed from antecedents, consequences, and a connector (AND/OR). Therefore, the goal is optimize this matrix.

24 • The learning process doesn’t affect the rule membership function. Also the particles size remains the same, thus, the rule base size as well doesn’t change.

• The algorithm assumed some constrains, firstly the antecedents can’t be equal to zero, this condition will confirm rule base coherence. Secondly the particles can’t exceed the solution space and a set of equations has been taken into account to emphasize this condition. For each iteration the particles and the rules are tested to fulfill theses conditions.

• The velocity matrix has been constructed to consider the updating process of particles positions. Hence, each particle will have two matrices, one to represent RB and the other to represent its velocity.

• The updating process is based on a fitness function to measure the quality of each particle, thus, a solution.

• The process will start by selecting an initial population of particles that are generated randomly with a random velocity selection. The constrains will be checked then the particles positions will be updated according to the fitness function.

• The process will be repeated depending on the number of iterations or till a specific criteria is met.

• The best rule base (RB) for the particles and the best position found in the swarm will be specified by the end of each iteration.

• KASIA proved abilities to converges to local optimum while convergence to global optimum hasn’t been proved yet.

• The control properties associated with KASIA convergence are similar to the ones of the original particle swarm optimization.

The pseudo code of KASIA algorithm as proposed by [19] is represented by algorithm 3.2. In this work we have used the Pittsburgh approach as a learning methodology and we are proposing KASIA and Michigan approaches for future work.

25 Algorithm 3.2: KASIA

Result: return the best position

1 Initialization of swarm parameters;

2 Random setting of position and velocity;

3 Check conditions;

4 while Number of iterations do

5 while Number of particles do

6 Update position and velocity;

7 Check conditions;

8 Evaluate fitness;

9 Particles ++;

10 end

11 Iteration ++

12 end

3.5 Simulation environment: WorkflowSim with DVFS

In this work the simulation platform that has been used is WorkflowSim-DVFS. Before explaining how this simulator works, the following sections will provide a basic understanding of important elements involved in the behaviour of this simulator.

3.5.1 WorkflowSim

Scientific workflows have shown powerful abilities in representing complex computations. Those workflows regularly tend to be very complex with being constructed of complicated modules. As a result, measuring their performance have a high time cost. Thus, scheduling algorithms have became the new methodology to evaluate the performance of work flow systems [9]. Due to those facts, a simulation-based environment based on scientific workflows was developed. WorkflowSim has been created by Weiwei Chen a Phd student from University of Southern California. WorkflowSim has extednded the features of already existing simulator, that is CloudSim. CloudSim [7] is a well known simulation environment for cloud computing resources and infrastructure. However, CloudSim lacks the possibility of supporting multiple workflows simulations such as task clustering. Furthermore, workflows manage the order in which the tasks will be implemented in contrast to what happens with Cloudsim. As a result,

26 WorkflowSim came to fulfill theses features. WorkflowSim extends the functionalities of CloudSim by adding extra components above Cloudsim scheduling. Some of those components are the workflow engine, the workflow mapper, the clustering engine, the failure generator, the failure monitor. Figure 3.3 shows those components. The workflow mapper is responsible for achieving concrete workflows from abstract ones, while the workflow engine works with data dependencies. The clustering engine converts smaller tasks into a single larger task, which will result in minimizing the execution time. Another important concept combined with WorkflowSim is the fact that it uses Directed

Figure 3.3: WorkflowSim Overview.

Acyclic Graph (DAG) [64] with XML format, called DAX. DAX files could be generated by a software named Pegasus [45]. It is worth to mention here some of the most popular DAX workflows, this includes, montage workflows, inspiral workflows and sipht workflows [15]. Montage workflows [38] are input images in FITS format related with astronomy and have been developed by NASA. Sipht workflows [30] has been developed by SIPHT program which is related to a bioinformatics project at Harvard University. Finally, inspiral workflows [30] are collected from the Laser Interferometer Gravitational Wave Observatory (LIGO), this observatory deals with gravitational waves based on Einstein’s theory. The DAX files could be mapped into images, figure 3.4 represents an example of the workflow order for a montage DAX file [13] in which 25 jobs are included.

3.5.2 Power model

A fitness function refers to the objective function that measures to which degree a specific solution is good. There are several power models depending on the application on hand. The

27 Figure 3.4: Montage workflow with 25 jobs model that is used by WorkflowSim-DVFS follows equation 3.1

E = P · t (3.1) where E is the energy consumption, P is the power and t is the execution time. Equation 3.1 states that the total energy to perform a specific job will depend on both the time to execute the job and the power that consumed by the resources which serve that job. An important trade off appears here, that is between the power and the time needed. This is due to the fact that, if more resources were used to serve a specific job, less time will be needed but more power will be consumed and vice versa. Another model [32] follows equation 3.2:

E = Ed + Es (3.2)

Where E is the total energy consumption, Ed is the dynamic energy and Es is the static energy.

Ed depends on the processor and Es depends on the other elements in the data center. The dynamic energy could be re-written as in equation 3.3:

Ed = Pd · t (3.3)

28 Where E is the total energy consumption, Pd is the dynamic power and t is the execution time.

On the other hand Pd could be re-written as in equation 3.4:

2 Pd = kdCV f (3.4)

where kd is a constant, C is the processor capacitance, V is the processor voltage and f is the processor frequency. Also Es could be reformatted as in equation 3.5:

Es = ksEd (3.5)

where ks is a constant. Finally the overall energy can be re-constructed by combining equations 3.3 and 3.5:

E = Es + Ed = KsEd + Ed = (1 + Ks)Ed ∝ Es(V, f) (3.6)

Equation explicitly introduces the importance of the voltage and frequency in the overall energy consumption. Section 3.5.3 explains in details this concept.

3.5.3 DVFS

Dynamic voltage and frequency scaling (DVFS) refers to the concept of varying voltage and frequency in a hardware device to reduce the total amount of energy consumption. For cloud computing, this concept is crucial due to the fact that it could critically affect the performance [18][35]. Many methodologies have been developed in the literature to minimize the energy consumption, yet DVFS algorithm is one of the best. DVFS tends to scale CPU voltage and frequency with respect to the need. Governors are the controllers for the voltage and frequency. Static governors fix the value for both of the voltage and frequency, on the other hand, dynamic governors vary them based on thresholds.

3.5.4 WorkflowSim-DVFS operation

Sections 3.5.1, 3.5.2, 3.5.3 were necessary to explain how WorkflowSim-DVFS works. This simulator has been introduced by [15] and it is open source platform developed in Java and the source code is available at [1]. This simulator extended the regular WorkflowSim by providing energy-aware mechanism through the dynamic voltage frequency scaling method.

29 It provides computation and communication power model. This simulator behaves in a similar way to real world CPUs as it works with DVFS technique and provides different types of governors. WorkflowSim-DVFS is energy-aware simulator, meaning that it takes energy consumption into account. It has the ability to simulate workloads that are similar to real ones in the cloud. In addition, it offers task clustering. DVFS algorithm, as explained earlier, is implemented in this simulator by having five governors in which they can adjust the frequency depending on the workload that will be processed. These five governors are [15]:

• Performance: This is a static governor where the highest frequency will be selected.

• Powersave: As well a static governor but rather the lowest frequency will be selected

• Userspace: The frequency is selected based on the user selection.

• Ondemand: A dynamic governor where the frequency selected based on CPU utilization and a specific threshold.

• Conservative: Another dynamic governor but very similar to Ondemand, however the frequency selection will be performed in a gradual process.

It is important to mention that WorkflowSim-DVFS has the capability to compute the average power needed, the consumed energy and the execution time. This features enables the simulator to give performance estimation of new methodologies that are designed for energy saving. The power model that has been used in this simulator is inspired by [23] where it states that the power consumed by a CPU could be represented by equation 3.7:

P (α, V, f)total = P (V, f)idle + [P (V, f)full − P (V, f)idle]α (3.7)

where α is the utilization rate at the CPU. Hence, when α is 0, the machine is idle, whereas when α is 1, the machine is fully operational. The simulator will retrieve four output values for the user after finishing the simulation, namely, the total execution time, the total power consumption, the average power and the energy consumption. The main components of WorkflowsSim-DVFS are the planner, the merger, the engine, the scheduler and the datacenter. The process starts by the planner where the DAX files get analyzed and converted to tasks that pushed to the merger. The merger, from its name, is the clustering engine that is responsible of gathering the tasks into jobs. Those jobs will be passed to the engine which will check the

30 order of the workflow. Then the scheduler will select from the datacenter the virtual machines that will process the tasks.

3.5.5 Fuzzy schedulers

The previous section gave a general overview about how WorkflowSim-DVFS works. However, there are important points need to be mentioned here [51]:

• The original WorkflowSim has been created with an internal VM scheduler that is power aware, however WorkflowSim doesn’t support task scheduler that is power aware, in fact this is left to the user to select from the classic Min Min, Max Min scheduling strategies. In combination with establishing Workflowsim-DVFS, work has been done to improve both algorithms to be power aware and named powerawareMinMin and powerawareMaxMin. Thus, Workflowsim-DVFS is now able to test those new algorithms as it provides task scheduling that is power aware. The difference between VM scheduling and task scheduling has been explained in section 3.2 and Min Min, Max Min algorithms has been introduced in section 3.2.1

• It is true that the new created simulator, Workflowsim-DVFS, can now work with power aware scheduling algorithms, however, these classic algorithms considers only one parameter, for instance, time or power. Another methodology has been developed to consider multiple parameters. This is the case of FRBS. By using fuzzy rule base system FRBS, new schedulers have been created to support multiple parameters selection. Hence, two schedulers were developed, one for VM scheduling and the other for task scheduling, using FRBSs. These schedulers were tested by Workflowsim-DVFS. This section provides an overview for how these schedulers were developed.

As explained in section 3.3, the fuzzy system is composed of antecedents, consequents and operators. The antecedents are inputs whereas the consequents are the outputs and a set of membership functions are defined. The rule base is a critical element on having a good final decision. In this case, the decision refers to selecting which host will be selected to serve a VM machine in case of VM scheduling and selecting which VM machine will serve a specific task in case of task scheduling. To have a good rule base, a Matlab program was developed to help selecting the best rule base and therefore, to create good schedulers. The Matlab program

31 uses the Pittsburg approach explained in section 3.4.1.1. The procedure implemented for both schedulers is quite similar as follows [51]:

• To fuzzify the input and implement the rule base then defuzzify the output, jFuzzyLogic [12][11] has been used.

• The rule base is important as it will directly affects the final decision of both schedulers. The decision is selecting which host will serve a VM machine in case of VM scheduling and selecting which VM machine will serve a specific task in case of task scheduling. Due to this reason a Matlab program using Pittsburgh approach was developed to find the optimum rule set that will correspond of having the minimum energy consumption for each scheduler.

• For each scheduler a different set of antecedents has been defined. For VM scheduling, four variable were initiated, namely, MIPS requested by the VM, total MIPS of the host, utilization of the host and power consumption. On the other hand, the antecedents of task scheduling are, VM MIPs, Power consumption, task length, execution time and total energy.

• After creating the initial rule base, they will be send to the simulator in terms of JSON format to be evaluated. The simulator will evaluate the quality of the rule base.

• The simulator will evaluate the quailty of the rule set based on the energy consumption. The rule set will be improved by the Pittsburg algorithm.

• The Pittsburg function is the approach that will improve the rule bases. The pseudo code of Pittsburg was demonstrated in section 3.4.1.1. The improved rule set will be send again to the simulator to be evaluated. This process is repeated depending on the number of iterations till finding the optimum rule set for each scheduler which will correspond on having the minimum energy consumption.

• After finishing all the iterations, the best rule base , hence the best decisions, will be send to each scheduler to be used. Thus, corresponding in the minimum energy consumption system.

• The performance of these developed schedulers was tested in comparison with other classic schedulers and showed significant results.

32 The WorkflowSim-DVFS was used in the work as the scheduling simulator as we will see in the next chapter.

3.6 Programming languages, databases and virtualization platforms

This section will give basic concepts about the programming languages, software tools, databases and virtualization platforms that have been used throughout the work.

3.6.1 NodeJS

Node.js is an open source platform for web development that works as a runtime environ- ment for java script [39]. The four major concepts involved with Node.js are [54]:

(a) Google’s V8 engine This is the core engine used by Node.js. It is written in C++. The goal behind this engine is to work as a compiler for JS functions.

(b) I/O operations are not blocked This is related to the way that Node.js handles requests. Node.js provides I/O behaviour that is asynchronously. This feature makes Node.js more effective to use.

(c) Event-driven Node js is not thread-based but rather event based. This refers to the way of handling events and processing them.

(d) Node.js package manager (npm) Provides all Node.js tools as well as a registry containing a large amount of libraries.

The above mentioned components enable Node.js to have superior advantages over other web development software tools. More benefits are introduced below:

(a) Node.js is an open source platform that could be accessed by all kind of users.

(b) It doesn’t have the complications of front end/back end development. In fact, it provides full working environment for developers.

(c) Node.js is scalable and extendable.

(d) It can be used with different OS platforms such as Windows, Linux...etc.

33 (e) It provides sufficient level of performance, speed, and resource-efficiency.

(f) Node.js has a big community support behind it.

Node.js is the software platform that was used to build the web platform interface for this work.

3.6.2 Python

Python is a general purpose programming language that uses dynamic semantics [46]. This refers to its ability to provide dynamic typing and dynamic binding. It’s simplicity made it one of the most widely used programming languages. Python scripts could be run using Python interpreter. Python interpreter works on the level of Unix shell. By running the command, python script name, your script will be passed to the interpreter. Python organization (python creator) summarized the benefits of using python language in the following points [46]:

(a) It has simple syntax which does not make it only easy to learn but also it reduces program cost maintenance.

(b) It provides modules and packages, that enables the modularity of programs.

(c) Open source with all its libraries and sources.

(d) There is no compilation phase which increases productivity.

(e) The debug process is very fast and bugs won’t cause any faults.

(f) The debugger is very powerful, allowing inspection at all levels.

(g) The fast edit-test-debug cycle will simplify the process of debugging by providing an effective debug methodology.

Python has been extensively used in many computer science areas. The fact that python is very flexible and has a huge stock of libraries made it suitable enough to contribute to the fields of artificial intelligence, machine learning and deep learning. This work includes a python application to run the simulation scenarios.

34 3.6.3 Java

Java is an object oriented programming language that was created by Sun Microsystems in 1995 [29]. It has lots of features that made lots of software tools depend on it. The Java Language Environment [43] has explained in details the different features that makes java powerful, some are listed below:

(a) Simple and Object oriented Java is well known of being very simple to use and learn. It’s simplicity is related to the fact that its object-based language.

(b) Robust and Secure Reliable programs could be designed by Java. This is done by the compilation process followed by run-time checking. Thus, reliable softwares could be easily built. Also security features are built-into Java which means it will allow developers to build secure applications.

(c) Architecture Neutral and Portable Architecture neutrality is a concept related with the capability of working on different levels of architectural designs and operating environment. In fact, a portable system is architecturally neutral. Java supports this feature through its Java virtual machine.

(d) High Performance Java could participate in building applications that require large amount of computation power due to its high performance capabilities.

The source file extension for java files is .java. The compilation process will result in class files that constructed of bytecodes which could be processed by Java Virtual Machine. Java could operate in different operating system platforms. There are two elements that compose the Java platform, Java Virtual Machine and Java Application Programming Interface (API). Developers have unlimited abilities with java [42], they are provided with compact development tools such as the javac compiler and the java launcher. Java also supports user interface toolkits to enable users to create graphical user interface, an example is the Java 2D toolkits. Another capability is the ability to deploy java applications through deployment softwares such as Java Plug-In software. Furthermore, it offers classes and libraries you can integrate to your applications. Also, it is important to mention two terminologies that

35 always combined with Java, those are, the Java Runtime Environment (JRE) and the Java Development Kit (JDK). The difference between these two packages is, the JDK is responsible for the creation and compilation processes while the JRE is the environment for running programs. For this work, the simulator for the web platform was developed in Java.

3.6.4 MySQL

MySQL [41] is open source SQL database management system, developed by Oracle. SQL stands for “Structured Query Language”. MySQL management system performs CURD operations such as create, update, read and delete on a collection of data structured as databases [40]. MySQL database is relational, which means it supports relationships between data fields. The MySQL database server is very powerful in managing huge databases. It is very secure and fast to be used and can be deployed in any local machine. Furthermore, it supports both client/server and embedded systems and can be provided as a library to support isolated applications. MySQL server is a core tool in multiple popular applications. MySQL is supported by different operating systems such as Windows and Linux. In addition, a wide range of data types are supported by MySQL. Also different query statements are defined, this includes, select-where clause, group statements and group functions. A security system is installed in MySQL which ensures secure access and connection. Clients can connect to the server by TCP/IP sockets, named pipes or Unix domain socket files. On the other hand, client programs could be written in different programming languages. MySQL has various connectors that supports ODBC connections, JDBC connections and .NET applications. There are many extra facilitating tools and programs provided by MySQL, such as MySQL Workbench which is a popular graphical interface to manage MySQL servers. It provides control over the server instances and capablilities to set connection parameters, execute queries on databases using the SQL Editor, facilitate data modeling and data migration. Other tools includes mysqldump, mysqladmin and mysqlcheck. In this work, we used MySQL as the database management system for the platform that we developed.

36 3.6.5 Docker Containers

Docker is an open source platform that runs softwares and applications in a packaged format in containers. Containers are isolated from each other and they run using a single operating system [6]. The concept of Docker containers has big influence on how developers work on their applications. It facilitates the development procedure by dividing a large scale application into small scale applications deployed into containers and then provide control over theses containers. Developers are allowed to deploy their applications in production mode a soon as possible with Docker as stated by [52]. There are four important terminologies with Docker, these are, Docker server and client, Docker containers, Docker images and Docker registries. The server and client of Docker are the core elements in which the client will by served by the daemon/server. They can be running on different machines or the same machine if needed [65]. Docker images are the templates that hold the applications. A Docker image can be build either from a base image, such as an operating system, then modifying the base image would result in a new image, or building images using Docker files. Docker registries are simply the repositories for Docker images, for instance Docker hub is a public registry, private registries are also available. After creating a Docker image using Docker build command, Docker run command could be used to establish a Docker container to run the application.

Figure 3.5: Docker structure

The benefits of using Docker are countless. Docker is very fast software when building and running applications into containers [66]. Also Docker is scalable as it can run in various platforms and systems. Thus, Docker provide reliable environment and portable applications that can be easily developed and deployed [6]. However, Docker doesn’t support full virtualisation as it is based on Linux kernel, while for Windows and Mac the full environment must be virtualized.

37 It is important to mention the difference between virtual machines and Docker as they have similar concepts. While virtual machines use Hypervisor, Docker is more efficient by simply using containers which will reduce system overhead. Figure 3.5 represents the Docker structure and figure 3.6 represents the virtual machine structure.

Figure 3.6: Virtual Machine structure

In this work, we used Docker to compact the different lightweight applications, (simulator, web interface...etc) that were running separately into a single dockerized application.

3.7 Big Data

The term big data has become very popular in the last decade. This term doesn’t refer only to the fact that the size of the data is big but also to the amount of the input data per time unit and to the different types of data. This refers to the 3V factor, volume, velocity and variety [28]. Due to the complexity of big data, processing platform has been developed to handle them. These platforms could be classified into batch, stream and hybrid based platforms. The reason for this classification is the various types of applications that are evolving and the need for more complex services. A typical batch-based platform is the Apache Hadoop introduced in 2005 by Yahoo company. Hadoop is an open source platform that uses commodity hardware to perform tasks processing. The two main elements in Hadoop structure are the distributed file system (HDFS) and Hadoop Yarn. The purpose of the file system is to reduce the size of the data on data blocks, thus reduce operations in large files. Furthermore, the distributed file system uses networks in an efficient way. On the other hand, the Yarn [67] is the manager of Hadoop resources, it aims to control the process of handling applications through its master and slaves nodes. Hence, the jab of the Yarn is to manage the tasks. Hadoop uses MapReduce programming concept

38 [16]. Map Reduce refers to using mappers and reducers. More information about mappers and redurcer operations could be found at [57]. Many problems arise in Hadoop, it lacks the ability of working with iterative methods in machine learning, such methods has caused huge amount of data been stored in the disk which affected the system overhead. Another important platform to be mentioned is the hybrid Apache Spark platform devel- oped at the University of California at Berkeley. The similarities between Apache Spark and Hadoop is that, Spark as well is an open source software for processing big data, however it supports batch based and stream applications. Stream applications are real time streaming applications. Also Spark outperforms Hadoop in terms of performance capabilities, this is due to its ability of using memory computations. Thus, the problem of having large scale data causing an overflowed disk in Hadpoop doesn’t exists in Spark. This gives it the privilege to support machine learning applications as well as iterative algorithms. Spark can run locally or in the cloud. It also supports Java, Python. It has been proved that Spark could be around hundred times faster than Hadoop when data is in memory and around ten times faster when data is on disk [57]. It is worth to mention here the most used cluster managers in the big data field. Those include Hadoop Yarn, Apache mesos and Google Kubernetes. Hadoop Yarn [67] works by having a resource manager in the master node and node managers in the salve nodes. Both elements collaborate to serve application tasks to run into containers. On the other hand, Mesos [26] has the ability to support multiple big data platforms. These platforms (such as Hadoop and Apache spark mentioned earlier) reserve resources by their schedulers through Mesos. Lastly Google Kubernetes [22] works on production level. It is composed of a master that controls the cluster and perform scheduling, and nodes which negotiate with the master to run containers of tasks. Resource management in big data has been explained in [28]. They have simplified the process by a layers framework as represented in figure 3.7. The setup layer is composed of three elements the resource selection process, deployment of a cluster manager and deployment of a big data framework. Those elements will ensure the cluster set up. The resources, whether physical or virtual resources, will be defined depending on the application. A cluster manager will be selected (for instance, Google Kubernetes) and deployed to serve the tasks from the big data processing platforms. The second layer is the operation layer, this layer is composed of three elements, per-

39 Figure 3.7: Layers of resource management in Big Data cluster formance modeling where all the service level agreement requirements are set, resource allocation in which tasks can reserve resources to accomplish their jobs. Resource allocation can be static by the user or dynamic. The third element of the operation layer is crucial for resource management in big data, that is job scheduling. Job scheduling refers to ordering the jobs in a specific manner to be served later by the resources. Many scheduling methodologies exists in the literature. Some popular algorithms have been explained in section 3.2.1. The last layer in the resource management stack is the maintenance layer where the cluster can be monitored and failures can be detected. Also it supports cluster scaling up or down depending on the demand. In this work, we aimed to include big data applications in the proposed platform. In the following chapter we will explain what we have done.

40 CHAPTER 4

METHODOLOGY

4.1 Framework overview

The development procedure of the work has been divided into two main phases of implementation. As follows:

• Platform development In this phase, we developed the proposed web platform for scheduling strategies. We created the different elements and, to have a structured work, we split this phase into three main steps. The first step was to develop the graphical user interface of the platform. In the second step, we prepared the simulation environment. This step was related to the last step, which was establishing the infrastructure components required. We also performed the required connection and communication link between the different stages.

• Big Data integration In this phase, we tried to connect our developed platform to a big data framework, specifically; Apache Spark, so that our simulation platform could serve not only regular applications but as well big data applications. However, the complex nature of big data platforms, as well as the overall complicated procedure of resource management for big data applications, was a barrier from full integration with our platform. Nevertheless, we presented what we have performed with Apache Spark and the implementation process we followed.

This chapter is sectioned according to the previously mentioned phases. We will follow a top-down model approach in explaining the detailed architecture of how the work was done, starting by presenting the main building blocks of the crucial components of the platform combined with their interaction between each other as well as their connection and

41 communication. This is provided in the conceptual model section. In the next sections, a detailed explanation about the functional behaviour and the implementation procedures of each building block will be provided. This will be presented in three layers sections. Finally, the Big data work will be provided in the last section of this chapter.

4.2 Platform development

4.2.1 Conceptual model

This section provides a top-down explanation approach for our work. We start by present- ing the layered architecture of the work followed by a detailed description.

4.2.1.1 Layered architecture

The development procedure of the simulation platform was partitioned into three layers. The first layer is the user interface layer, the second layer is the operational layer and the third is the infrastructure layer. As follows:

• User interface layer This is the layer where the platform interface was developed. The interface could be described as a web application. The user will interact with this application and will be able to access it through a web browser. Using this interface, the user will be able to set the elements required for running the simulations which will be handled by the operational layer. The services provided by this layer are:

(a) Store user’s parameters in a JSON format file.

(b) Send a client request to the operational layer once the user presses the "run button". In this request, the different simulation values (JSON) are attached.

(c) Connect to database to store simulation data and simulation results.

• Operational layer This is the layer that connects between the user interface layer and the infrastructure layer. This layer is responsible for running the simulation scenarios. It performs simula- tions based on the user’s parameters that were passed by the first layer. The operational layer will be able to accomplish its task with the assistance of the infrastructure layer. The services provided are :

42 (a) Accept the request sent by the user interface layer.

(b) Perform simulations based on the user’s parameters.

(c) Send results to the user interface layer to be stored in the database.

• Infrastructure layer This is the button layer in the development methodology stack. In this layer the data center structure will be set. This layer is composed of the physical and virtual machines required to run the simulations as well as the docker node(s) to simulate the behaviour of the data center. Thus, the infrastructure layer will support the operational layer. The services provided are :

(a) Provide resources that will handle the request sent by the operational layer to serve tasks. This includes both physical and virtual machines.

(b) Use containers and cluster managers.

Diagram 4.1 provides an overview of the layered architecture.

Figure 4.1: Layered architecture of the platform

4.2.1.2 Layers communication

The development of the platform was done on a local host; however, deployment in a server is also possible. As could be seen in figure 4.2 the communication between the layers

43 is as follows:

• The user interface layer was developed to listen in the browser on port 8081, so the user can access it by the URL localhost:8081. The user can interact with the web application to set the values of the simulation. The values will be stored in a JSON file. JSON refers to JavaScript Object Notation which is a file format that transmits data objects in a readable structure. When the user presses the "run button" in the interface, the application sends an HTTP client request using Axios with the JSON file to the operational layer. Axios is a JavaScript library that performs HTTP requests in Node.js.

• The operational layer is composed of the operations program and the scheduling simulator. The operations program contains the main method and listens on port 5000 with a confirmation message "worker is here" that ensures it is working properly. It will accept the HTTP client request and create a variable to save the JSON file that holds the user’s parameters. The operations program uses the scheduling simulator to perform the scenarios and then return the results as well as a "done" message to the interface layer, in which, we have programmed a pool of services that use a database.

Figure 4.2: Interaction between layers

• The scheduling simulator will use the infrastructure layer, thus the physical resources to serve the tasks. When performing a simulation, the resources that are used are virtual

44 ones. Hence, VMs are installed in the physical machines. However, in our case, the responsibility of establishing virtual machines is the simulator job and we don’t have to install them manually. As mentioned before, there are two types of scheduling, VM scheduling and task scheduling. VM scheduling is to select which physical host will serve each VM while task scheduling is to select which VM will serve each task. In our platform, we focus on task scheduling.

4.2.1.3 Main building blocks

Figure 4.3 represents a tree graph for the elements that are the fundamental building blocks of each of the layers. The building blocks are the crucial components required for creating the proposed platform. We developed the user interface layer using different frameworks such as HTML, CSS and other technologies as we will see in the next section. However, the main file in this layer is the node.js file that launches the UI application, namely, app.js. App.js is a java script file that is responsible for the connection with the operational layer and it will send a http request combined with the JSON file that contains the user parameters to the python operations program, specifically to service.py file. As it can be seen in the tree graph, service.py is the main python file in the operational layer. The purpose of this file is to save the simulation parameters sent by app.js and pass them to the learning algorithm, if needed. In this work, we are using Pittsburgh as a learning approach. This learning algorithm is used, with the help of fuzzy rule base systems FRBSs, to implement a scheduler that optimizes energy consumption. This scheduler refers to one of the scheduling strategies present in the platform, namely PowerAwareFuzzySeq. Thus, the Pittsburgh algorithm is not necessarily called in all simulations scenarios. In fact, it is mainly used with this scheduling strategy. PowerAwareFuzzySeq scheduling algorithm is an algorithm that specifically designed to work with Workflowsim-DVFS simulator. This algorithm corresponds to having a scheduler that optimizes energy consumption. To build this scheduler, FRBS is used. As mentioned in chapter3, FRBS has the privilege of being multi parameter system. To establish an FRBS, it needs inputs (antecedents) and a rule base to produce an output (consequent). The output, in this case, corresponds to the scheduling decision. The inputs considered are MIPs, power, utilization, task length and maximum power. However, the rule base is a key factor in creating

45 Figure 4.3: Tree graph for building blocks components of the platform a good FRBS, in our case having a good scheduler. For that reason, Pittsubugh.py file will run to find the best possible configuration of rules for the FRBS.

Figure 4.4 represents the procedure followed when PowerAwareFuzzySeq is the selected scheduling strategy and the user wants a training simulation to be implemented. In the case of selecting other scheduling strategies, Pittsubugh.py won’t run and the simulation parameters will be passed directly to the simulator to perform the simulation scenario and get the results. However, the explanation of this chapter is based on the existence of this learning approach as

46 Figure 4.4: PowerAwareFuzzySeq Procedure it gives the big picture of how the simulation process works. Wfs-learn.jar is the java program of Workflowsim-DVFS simulator. In the case of Pow- erAwareFuzzySeq strategy, this jar file is called in Pittsburgh.py and it is responsible for evaluating the rules generated. The evaluation process will be done in multiple iterations while improving the rules in each iteration till getting the best possible rule set that optimizes energy consumption. The rules will be encoded into JSON format file and passed to wfs-learn.jar which will decode them to be evaluated. However, in the case that the scheduling strategy is not PowerAwareFuzzySeq and no training simulation is needed, hence no learning algorithm, the simulator will be directly called to perform the simulation. Figure 4.5 represents this behaviour.

Figure 4.5: Checking the scheduling strategy and simulation type

47 The simulator has the capability to compute the average power needed, the consumed energy and the execution time based on the scheduling strategy selected. In this work we concentrate on the consumed energy, that is the amount of energy needed to serve the tasks. Hence, once the simulator is called, it will provide this result. The simulation results are sent back to App.js in the user interface layer. App.js calls another file that is responsible for the interaction with the database, namely, Model.js. Fur- thermore, Model.js contains a group of functions that depend on Db.js file. Db.js defines the necessary parameters to perform the connection needed with the database. It contains Mysql queries that will store the simulation results. The database was constructed using Mysql by creating workflow.sql file. The database is composed of two tables. Simulations table and results table. Simulations table holds general information about the simulations as well as the final results obtained while the results table contains the workflow results of the learning algorithm if it is used. We will demonstrate the content of these two tables in chapter5.

48 4.2.2 User interface layer

This layer provides the interface where the user will be able to set and submit the simu- lation parameters. This section is divided into two parts, the first part shows the UI design, while the second one provides the UI development.

4.2.2.1 UI design

We designed the user interface to be user friendly. The layout is simple, so that the user will be able to set the different simulation parameters in an easy way. We created the interface as a single page application. The main tab of this page corresponds to multiple windows. These windows are:

• Main window: As it can be seen in figure 4.6, this is the main page the user will see when launching the app. This window shows a simple diagram that explains how the simulator works as well as a reference article.

Figure 4.6: Main window of the user interface

• Input window: Through this window, the user will be able to select between different input DAX files that contain the input data the simulator will operate. The input files available in the interface are presented in table 4.1. The input files used are montage workflows, inspiral workflows and sipht workflows. As mentioned in chapter 3, montage workflows [38] are input images in FITS format related with astronomy and have been developed by NASA. Sipht workflows [30] has been developed by SIPHT program which is related to a bioinformatics project at Harvard University. Finally, inspiral workflows [30] are collected from the Laser Interferometer Gravitational Wave

49 Observatory (LIGO), this observatory deals with gravitational waves based on Einstein’s theory. Each of the input files has multiple versions depending on the number of jobs. Figure 4.7 shows the Input window and the input files are designed to be available in a drop down menu.

Table 4.1: Input files

DAX files Number of jobs Inspiral 30,50,100,1000 Montage 25,50,100,1000 Sipht 30, 60, 100, 1000

Figure 4.7: Input window

• Schedule window: This is where to select the scheduling algorithm the simulator will use to schedule the tasks. There are different scheduling algorithms the platform supports, which are presented in table 4.2. These algorithms were explained in section 3.2.1. Figure 4.8 shows the Schedule window.

50 Table 4.2: Scheduling algorithms

Algorithms Minimum Minimum scheduling algorithm Maximum Minimum scheduling algorithm Round Robin scheduling algorithm Power aware fuzzy seq scheduling algorithm Static scheduling algorithm

Figure 4.8: Schedule window

• Data center window: The user will be able to display the different characteristics for the data center that he used as infrastructure. As shown in figure 4.9, the data center window has quite range of characteristics that could be set. These characteristics are divided into two parts. The first one is the physical machines characteristics. The second part allows the user to configure various sets of resources composed of multiple hosts. Tables 4.3 and 4.4 define the different parameters appear in each part.

51 Figure 4.9: Data Center window

Table 4.3: Data center characteristics. First part

Characteristics Definition Machines specifies the machines type, if the physical machines used are identical in features or not, for instance, homoge- neous machines or variable machines. Architecture and OS the operating system features of the machines. Virtualization technology the technology supported to create virtual machines, for in- stance, Xen and Vmware. Cost of performance this is the cost when the ma- chines operate, it includes multiple features such as total cost, cost per bandwidth and cost per storage. Machine time zone depends on the location of the machines. Transfer rate the maximum speed possible for sending information sup- ported by the machines.

52 Table 4.4: Data center characteristics. Second part

Characteristics Definition Number of hosts the number of hosts used in each set. Processing element the number of processors used. Governors DVFS governors mentioned in section 3.5.4 RAM the size of the random access memory. Storage the size of the hard disk drive capability. Bandwidth the amount of data that can be transmitted in a specific time.

• Virtualization window: This is where to set up the elements regarding virtual machines. As explained previously, VMs need to be installed in the physical hosts to perform the simulations. However, in our case, the responsibility of establishing virtual machines is the simulator job and we don’t have to install them manually. Table 4.5 represents the elements appear in the Virtualization window 4.10.

Figure 4.10: Virtualization window

53 Table 4.5: Virtualization parameters

Characteristics Definition Number of virtual machines the number of VMs used in the scenario. Size the image size of Vms (MB). RAM Vm memory (MB). MIPS The number of "million in- structions per second" sup- ported Bandwidth total bandwidth of the VM. PesNumber Number of CPUs. Virtualization technology the technology used to create the virtual machines, for in- stance, Xen and Vmware. Variable mips specifies if the mips used are variable or not.

• Simulation window: This window contains the "Run button" that, when pressed, will start the current simulation scenario. The parameters available in this window have a direct effect on the simulation. These parameters are present in table 4.6.

Figure 4.11: Simulation window, basic view

54 Figure 4.12: Simulation window, training view

The simulation window has multiple views, these views will appear interchangeably depending on other parameters. However, the basic view is shown in figure 4.11. This view will appear when selecting any scheduling strategy apart from PowerAwareFuzzySeq, in this case, the user must set the simulation type to validation as there is no possibility of performing a training simulation with classic scheduling strategies. On the other hand, when selecting PowerAwareFuzzySeq, the view will depend on the simulation type, for instance, when selecting the training simulation, the view is the one in figure 4.12, it can be seen that it is similar to the basic view but with additional options such as the learning approach, the number of iterations and the number of simulations. Another view will appear when selecting PowerAwareFuzzySeq with validation simulation, in this case, the view is similar to the basic view but with one more option to upload a file of rule bases. Simulation scenarios for all these cases are available in chapter5. There are other windows available in the interface, Task list window will show the data of the simulations represented in a table while FAQ and About us windows provide information about the interface objectives and the UI developer respectively. We created the user interface to include a wide range of parameters, however, it is important to point out that some windows and parameters are only for demonstration purposes. For instance, both Data center and Virtualization windows are only for displaying the characteristics of the physical machines and virtual machines used by the user, therefore, they don’t provide control over the resources and the simulation will depend on the actual setup of the infrastructure. Nevertheless, these windows could be used to display the set up implemented.

55 Table 4.6: Simulation parameters

Simulation setup Definition Simulation type two types are available, namely training and valida- tion, validation for testing purposes while training enables running a learning algorithm. Simulation name to provide a name for the cur- rent simulation scenario. Learning approach specifies the learning algo- rithm that will be used when running the simulation. Goal specifies the purpose of the simulation. For instance, to find out the energy consumed or the execution time or multi objective. Iterations defines the number of iter- ations when launching the learning algorithm. Number of simulations defines the number of simula- tions needed to be performed. Rules bases allows a set of rules bases to be validated.

4.2.2.2 UI development

We developed the user interface using different technologies. In general, the integration between HTML, CSS and JavaScript is well known in web development. HTML is Hyper Text Markup Language that creates basic structure documents to be displayed in web browsers. CSS refers to Cascading Style Sheets which is a style sheet language used to format and present the layout for plain texts created by HTML. On the other hand, JavaScript is a programming language that is used to script web pages. It controls the behaviour of the different elements and objects. Almost all web browsers can execute JavaScript. For our work, we used the previously mentioned technologies as follows:

56 • We used Node.js, explained in section 3.6.1, as our run time environment to execute JavaScript codes. In addition, we used Express.js which is a framework for Node.js to create our web app, hence, our user interface. We started by installing Node.js as well as NMP which is Node Package Manager that allows the creation of modules such as Express.js. After having Node running, the next step was to install Express as it will allow us to launch a web app at a specific URL. By that, we had our server side environment established.

• We used HTML combined with CSS to build the layouts of the different windows of the user interface. However, to integrate the HTML files with our Express application, we had to use a templating engine. A template engine enables the usage of static files in web apps. The purpose of this engine is to render and display HTML files in the browser. There are multiple template engines that work with Express such as Pug, Mustache, and EJS. The one we chose was EJS, Embedded JavaScript, which is the simpler to be used among others.

Figure 4.13: Convert EJS into HTML

Figure 4.13 shows the process of converting static files using EJS template engine into viewable content in the web browser. The file will be initially created using HTML, however, it won’t be supported by the browser unless being converted by a templating engine. For that reason, an intermediate ejs extension file need to be created. This file is an embedded java script file that contains java script variables as well as the elements defined in the initial file but within ejs tags. The engine will understand this file and convert it’s content to produce the final HTML file that is viewable by the browser.

57 In our case, we created an initial HTML file for each of the major windows displayed in the user interface. We designed the different windows using CSS styling document with the aid of a CSS framework called Bootstrap. However, to view them in the web browser, we loaded the different HTML files into one EJS file, namely index.ejs as shown in figure 4.14. This file is the intermediate file that will be converted by the EJS engine and it includes the behavioral elements that enable the interface to be interactive.

Figure 4.14: Different HTML files of the user interface imported by index.ejs

The templating engine is also called the view engine as it will enable the HTML files to be viewable in the browser. For that reason, a views folder is included in the main project directory with the index.ejs file inside.

Figure 4.15: Project directory folders

Figure 4.15 shows the folders exist in the main project directory. Nodes_modules folder includes node dependencies. On the other hand, the public folder contains the different HTML and CSS files created. App.js is the main file in the user interface, it listens on

58 port 8081 and by executing the command node app.js the user interface will be running on http://localhost:8081/ in the web browser. Furthermore, model.js and Db.js are files needed to perform the connection with the database. Both package-lock and package are node json files generated to hold information that enables the node package manager (npm) to identify the project and to handle project’s dependencies. Finally, simulation.json file is where the simulation parameters will be stored to be sent later to the operational layer. To let the Express application knows that we are using EJS as our view engine, we had to include the line app.set("view engine", "ejs") in our app.js file. This line will enable the Express app to set EJS as the view engine. The following snippet code 4.1 from app.js shows the initialization of the express app as well as setting EJS as the view engine. The express app uses the public folder and when a GET request is made to the homepage, the response will be made with the rendered index.ejs to view the content in the browser.

var app=express();//initialization of the express app app.set("view engine","ejs");//set view engine app.use(express.static(’public’)); app.get("/",function(req,res){// Get request res.render("index"); });

Listing 4.1: Code snippet for application initialization and EJS connection

After we had the different windows of the user interface viewable in the browser, it was the time to perform the functional implementation to enable the interface to save and submit the simulation parameters to the operational layer. To do that, Axios module was used in our express app to perform a HTTP post request. Axios is a JavaScript library that works in Node.js to perform such type of requests. Therefore, this module was added to the main file app.js. Apart from Axios and Express modules, other modules were also added to the main file such as BodyParser which will parse the JSON encoded data submitted by the HTTP POST request. Also, Fs module was used as a file system module. The following code snippet 4.2 shows all the above mentioned modules defined in app.js file.

var express=require("express"); var bodyParser= require(’body-parser’); var fs= require(’fs’); var axios= require(’axios’);

Listing 4.2: Code snippet for the list of modules required by the user interface

59 To perform the HTTP request, a "Run button" exists in the simulation window so that once the user entered all the simulation parameters, he will be able to submit them by pressing this button. For that, the "Run button" was created in the simulation.html file with the behaviour of once clicked another function named "sendConfig()" will save the configured parameters in one JSON object called "config". Next, to inform the server that we are sending a JSON file to another URL, that is, the URL of the operational layer, we had to use AJAX. AJAX stands for Asynchronous JavaScript And XML, it is a technique used to create asynchronous web applications. AJAX will access the web server from the interface page, hold the JSON object and convert it to a string named "data" without the need to refresh the page of the user interface. Code Snippet 4.3 shows this part of the code in simulation.html file.

function sendConfig(){ var config= {// JSON object of simulation parameter "input_file":$(’#dax’).val(), "scheduler":$(’#scheduler’).val(), "simulation":{ ...... } ..... } $.ajax({// AJAX function to hold JSON data method:"post", dataType:"json", url:’config’, data:’data=’+JSON.stringify(config) }); });

Listing 4.3: Code snippet for the run button to submit simulation parameters

The main file in the operational layer is service.py which listens on port http://127.0.0.1:5000. Hence, Axios sends the HTTP post request with the JSON data to that port. The data will be stored there in specific service with "/config" extension, as could be seen in code snippet 4.4.

60 // post data in specific folder of the URL axios.post(’http://127.0.0.1:5000/config’,data,{ headers:{ ’Content-Type’:’application/json’,//type of data is JSON } } });

Listing 4.4: Code snippet for sending the JSON data to the operational layer port

The operational layer will then use the data sent to run the simulation. The functional implementation of the operational layer is explained in details in the next section. However, to finalize the user interface implementation, it is important to show how we made the database connection so that both the simulation workflow as well as the simulation results will be stored in the database tables. As mentioned before, the main file app.js calls another file that

Figure 4.16: Class Diagram represents CreateResults() function is responsible for the interaction with the database, namely, Model.js. Furthermore, Model.js contains a group of functions that depend on Db.js file. Db.js defines the necessary parameters

61 to perform the connection needed with the database. It contains Mysql queries that will store the simulation workflow and results. We also mentioned that we have two tables in the database, simulations table and results table. Simulations table holds general information about the simulations as well as the final results obtained while the results table contains the workflow results of the learning algorithm if it is used. We created nine functions to interact with the database tables, these are: CreateSimula- tion(), ReadSimulation(), CreateValidation(), CreateResults(), ReadDetail(), CreateBestal- liter(), CreateRules(), ReadAllSimulation() and EndSimulation(). Both CreateSimulation() and ReadSimulation() functions will be immediately called once the user hits the run button, however the rest of functions will be executed later after the operational layer performs the scenario. The purpose of each of these functions is as follows:

• CreateSimulation(): stores values about the scenario in the database in simulations table through the "INSERT" query. These are the values that were earlier inserted by the user through the user interface. It also sets the status of the simulation in "running" mode.

• ReadSimulation(): sets an Id number for each simulation scenario in the simulations table in the database.

• CreateBestalliter(): stores the obtained consumed energy value in the simulations table through the "UPDATE" query.

• ReadDetail(): reads details from results table through the "SELECT" query.

• CreateValidation(): if the simulation type is validation, this function will be called to store the obtained value of the validation scenario.

• CreateRules(): stores the rules of the learning algorithm in the simulations table through the "UPDATE" query.

• CreateResults(): stores the values of the results table through the "INSERT" query.

• ReadAllSimulation(): reads all columns from the simulations table through the "SE- LECT" query.

• EndSimulation(): ends the simulation by changing the status of the simulation from "running" to "finished" through the "UPDATE" query.

62 Complete simulation scenarios including both the database tables and the previously men- tioned functions exist in chapter5. Let’s take as an example the function CreateResults() in more details. Figure 4.16 shows the class diagram which represents the attributes, operations, and the relationships among objects for this function. The connection between App.js, model.js and Db.js could also be seen in the figure. To explain how this function works and for a better understanding of the class diagram, lets take a look at the following three code snippets.

app.post("/results",function(req,res){//post request var dao= new model.DaoWorkflow(); var data=req.body;//JSON data dao.createResults(data).then(r=>{// function called res.setHeader(’Content-Type’,’application/json’); res.send("{’status’:’ok’}"); }); });

Listing 4.5: Code snippet for CreateResults function called in app.js file

Code snippet 4.5 shows CreateResults() function been called in app.js file. The function is called inside a post request in a service with "/results" extension. This function will be called after the operational layer runs the simulation. It is mainly used to store the values of the results table in the database. A new body object containing the parsed result data is created with the name "data". This "data" object is passed as an argument to the function. However, we created an object named "dao" to be able to use the function as the function definition exists in model.js file in "DaoWorkflow class" as represented in code snippet 4.6.

class DaoWorkflow{ constructor() {} createResults(data){// function definition var db= new model.Db(); return new Promise(resolve => { db.createResultsIntoDB(data).then(response => { resolve(response); }); }); }

63 }

Listing 4.6: Code snippet for CreateResults function defined in model.js file

On the other hand, CreateResults() uses an auxiliary function named CreateResultsIntoDB() to execute mysql queries. Again, we defined an object named "db" to be able to use this function. The definition of CreateResultsIntoDB() exists in Db.js file as shown in code snippet 4.7. The constructor of "Db class" is responsible for the configuration to connect to the database.

class Db{ constructor() {// Establish connection this.config= mysql.createPool({ host:"localhost", user:"root", password:"", database:"workflow", }); } createResultsIntoDB(data){//auxiliary function var conn= this.config; return new Promise(function(resolve, reject){ // insert query var sql="INSERT INTO results (simulation_id,iter,results) VALUES(’"+data.id+"’,’"+data.iter+"’,’"+data.value+"’)"; conn.query(sql, function(err, result){//execute query resolve(result); }); }); }}

Listing 4.7: Code snippet for CreateResultsIntoDB function defined in Db.js file

We established a connection object named "conn" while the queries will be executed using the object "sql". The query shown in snippet 4.7 is an insert query for the values held by "data" object into the columns of results table. In the same way, all the other functions mentioned before operate and perform changes to the database tables.

64 4.2.3 Operational layer

The operational layer is responsible for running the simulations. It will handle the simulation parameters sent by the user interface layer to perform the scenario. It is composed of two main components:

• The operations program We developed this program to perform all the required operations of the operational layer. The function of this program is to call the scheduling simulator based on the simulation parameters. We mentioned this before in section 4.2.1.3, where we stated that the scheduling strategy PowerAwareFuzzySeq corresponds to a FRBS scheduler. This scheduler aims to optimize energy consumption and it uses a learning algorithm to produce the best possible set of rule bases. The operations program calls the scheduling simulator to evaluate the quality of the generated solutions of the learning method. We also mentioned that the learning method is not necessarily needed in all simulations scenarios, in fact, it is mainly used with PowerAwareFuzzySeq strategy. However, the operations program works differently if the scheduling method is not PowerAware- FuzzySeq and no training simulation is needed, that is no learning algorithm, in this case it calls the scheduling simulator directly to gain the results. We developed the operations program in Python language.

• The scheduling simulator The simulator is responsible for calculating the corresponding parameters for the simulation such as the total execution time, the total power consumption, the average power and the energy consumption as well as it evaluates the quality of the solutions generated by the learning method. The scheduling simulator is developed in Java language.

The above mentioned components are further explained in the next sections.

4.2.3.1 Operations program implementation

To start, the operations program will run once the user submits the simulation parameters. The main file in this program is service.py. Service.py launches the program as a web application using framework. Flask is a well known microframework written in python to simplifies web development procedures. Microframework refers to the fact that it does

65 not require extra tools or libraries. Service.py listens on port 5000 where the user interface will post the simulation parameters. These parameters will be stored in a specific file in the operations program directory. Figure 4.17 shows the flow chart of how service.py works. As could be seen in the chart, service.py imports Flask to launch an app instance, the files that contain the different learning approaches are also imported. Then, a confirmation message will be printed in the corresponding URL to ensure that the program is listening in the required port. Next, a HTTP request will be used to save the simulation parameters in the project directory in a JSON file. The saved data are checked to identify the simulation type the user selected. If the simulation type is "validation", it will be stored temporary as a learning method to be checked by a switcher. The switch operator checks the potential values for the learning method. There are four values expected:

• Validation In fact this is not a learning approach, as we mentioned before in section 4.2.2.1, this option presents in the simulation window as a simulation type. There are two simulation types, validation and training. Validation is for testing purposes while training enables running a learning algorithm. Hence, the switcher will check this option first so that if the simulation type is validation, it eliminates running a learning algorithm and rather executes validation.py file. Validation.py calls the simulator directly to calculate the value of the consumed energy and sends the results to the user interface to store them in the database. On the other hand, if the simulation type is training, then the following three learning methods are available.

• Pittsburgh If the learning method is the Pittsburgh approach, the execution of Pittsburgh.py will start. In this work, we have implemented mainly this approach while KASIA and Michi- gan are present in the UI but proposed for future work to be added to the operational layer.

• KASIA KASIA.py will execute if KASIA learning approach is the selected methodology. The file will return only a message as "KASIA pending".

• Michigan Same as KASIA, however, michigan.py will run with a message "Michigan pending".

66 Figure 4.17: Flow chart for the main file, service.py, in the operations program

67 After the learning method execution finished, a "done message" is sent back to the user interface layer to indicate that the simulation has finished. We will proceed with the explanation by having Pittsburgh as the selected learning method. As we explained in section 4.2.1.3, this algorithm goal is to find the best possible configuration of rules for a FRBS. FRBS is used to build a scheduler that optimizes energy consumption. To build this scheduler, FRBS is chosen as it has the privilege of being multi parameter system. To establish an FRBS, it needs inputs (antecedents) and a rule base to produce an output (consequent). The output, in this case, corresponds to the scheduling decision. However, the rule base is a key factor in creating a good FRBS, in our case having a good scheduler. For that reason, Pittsubugh.py file will run to find the best possible rule base for the FRBS. After the operations program has got the rule base that optimizes the energy consumption, the rules are sent to the simulator where they are used in the scheduler. The simulator is WorkflowSimDVFS as it works with such types of schedulers. The five variables that are considered as inputs (antecedents) are : MIPS of the VM, power consumed, length of the task, maximum power consumption and utilization of the host. To start explaining how we developed the proposed Pittsburgh for our work, let’s recap the original procedure for the Pittsburgh approach. At the beginning, an initial population of rule bases is created, Pittsburgh considers the rule bases as solutions to the computational problem. Then, an evaluation system is used to evaluate the quality of the rules generated. By using genetic crossover and mutation processes, a new population of rule bases is created. The initial population is given the name "parents" while the second generation are "children". The process is repeated in multiple iterations till reaching the optimal solution possible or till a stopping criteria is met. The pseudo code of Pittsburgh is available in algorithm 3.1. For our work, we have implemented a procedure composed of multiple steps. With the aid of charts 4.18 and 4.19 we represent these steps as follows :

• We started by initializing our population of FISs (Fuzzy inference systems), hence our individuals. We set the different parameters for the process, this includes the number of FISs used, elitism reject number, the number of iterations and all the required elements. A complete scenario for setting these elements is available in chapter5.

• Then, we defined the FIS rule parameters that we will use in our FRBS. This includes the number of inputs (antecedents), outputs (consequent) and connection parameters.

68 • We followed that by the definition of the rule base of the FRBS system and generating KBs. Each FIS has a certain number of rules. The rules for each FIS were generated in a random process.

• The next step was to upload the FIS file that contains the membership functions for our FRBS. The FIS file was initially generated in matlab, but then converted to JSON file. This file is used in the simulator.

• After that we have defined the required variables to save the fitness for every individual.

• The iterations will start and all individuals will be evaluated through an external program. This program is simply our simulator, wfs-learn. The simulator will measure the power for the specific individual and save the result. The power value is always tested, so that if it is zero, it is replaced with infinity value so that it will be discarded later. The power value will be calculated for each individual in all iterations.

• Next, the different power results were reordered to reject the worst values.

• The fitness for every individual as well as the best results obtained for every iteration will be saved. The different results are sent to the user interface to be stored in the database.

• Then, the genetic learning methods were applied, namely, crossover and mutation. This will generate the children, hence, the next population.

• The process will be repeated depending on the number of iterations till reaching the best rule set that is possible. This rule set is the one that optimizes energy consumption and will be used by the scheduler.

69 Figure 4.18: Flow chart for Pittsburgh learning method in the operations program

70 Figure 4.19: Flow chart for reference A

4.2.3.2 Scheduling simulator modifications

The simulator we used throughout this work is Workflowsim-DVFS. An overview is available in section 3.5 about the original Workflowsim simulator and the improvements done by researchers to achieve Workflowsim-DVFS. Both simulators were developed using Java language. The DVFS version has a power awareness feature, meaning that it takes energy consumption into account. The work done in the original Workflowsim allowed the usage of classic scheduling algorithms and provided examples for testing. The modifications of Workflowsim to achieve Workflowsim-DVFS required adding and adapting multiple classes of the original workflowsim java program. Workflowsim-DVFS has an additional new package named "dvfs" which was created to develop the new functionalities of the simulator. The new "org.workflowsim.dvfs" package has two classes and they inherited some of the features of the original classes of "org.workflowsim" package with some modifications to have the power awareness feature. Furthermore, another package named "org.workfloswim.examples.dvfs" was created as an extension to the original "org.workfloswim.examples" package in which the simulation example "WorkflowSimBasicExample1" was used as a basis to create "WorkflowD-

71 VFSBasicNoBrite" class for the new simulator. "WorkflowDVFSBasicNoBrite" contains the main method for WorkflowSimDVFS and it uses the new WorkflowDVFSDatacenter class and prints the final output of the simulation. The packages extensions is shown figure 4.20. Some of the original Workflowsim packages is shown in gray color, while the red packages were added for Workflowsim-DVFS. On the other hand, for our work we added the classes shown in purple color.

Figure 4.20: Packages extension

We added two classes to "org.workfloswim.examples.dvfs" package, "SimulationConfig" and "SimulatorJSON". SimulationConfig class is used to capture the simulation parameters sent by the interface layer while "SimulatorJSON" is the main method and it is a copy of "WorkflowDVFSBasicNoBrite" class with additional features. We imported "SimulationCon- fig" in "SimulatorJSON" class and then we performed a check on the parameters that were passed so the simulator behaves according to these parameters.

72 4.2.4 Infrastructure layer

The infrastructure layer operates as the resources layer for our platform. In the simplest words, it is the physical hosts that serve the simulations. In addition, as we mentioned before, the simulations usually run on VMs and hence these VMs have to be installed in the physical machines. However, in our case, the responsibility of establishing virtual machines is the simulator job and we don’t have to install them manually. Also, we mentioned that there are two types of scheduling, VM scheduling and task scheduling. VM scheduling is to select which physical host will serve each VM while task scheduling is to select which VM will serve each task. In terms of our work, these two scheduling methods could be defined as follows:

• VM scheduling, in general, is a complex problem. In this work, we don’t focus on this type of scheduling. Furthermore, we used the built-in VM scheduling algorithm in our WorkflowSimDVFS simulator. WorkflowSimDVFS has a scheduler that allocates VMs to physical machines. This scheduler could be described as being power aware in order to achieve the energy consumption. When the scheduler receives a request for a VM, it will estimate the energy consumption of each physical machine and then choose the minimum consumer.

• Task scheduling, on the other hand, is the major scheduling type we use. In fact, in our platform this refers to the different scheduling strategies that are available, such as MinMin, MaxMin and even PowerAwareFuzzySeq that we built using FRBS.

4.2.4.1 Dockerization

The dockerization process is from the word "docker", referring to the process of migrating an application to be running in docker containers. In section 3.6.5, we defined docker and mentioned it’s benefits of being used by developers to facilitate their application development process by dividing a large scale application into small scale applications deployed into containers and then provide control over these containers. In our work, we took the advantages of docker containers to package the different applica- tions we explained in the user interface and the operational layers. Hence, all the different components developed using various programming languages and multiple technologies could be run as a single application. We containerized each of the major applications in our

73 platform and by using docker compose tool we were able to run these containers as a single app. Docker compose is a method for running multi-container docker applications. With this method, it is possible to use a single file to configure the different application’s services. Then, with a single command, all the services will start from the configuration. This is the case in our platform, where we have the user interface platform that is composed of the Node.js application service and the operational layer with the python application service and finally, the database service. To establish docker compose, here is the process we followed:

• We started by installing docker and all its dependencies in our machine.

• Next, rather than using "docker build" and "docker run" commands to establish images and specify ports and volumes for our services, we created a single docker compose YAML file. This privilege is due to the fact that we have multiple applications and services and it is simpler to define all of them in one file.

• We established docker files for our applications so that they could be used in the docker compose file. A docker file is a text document that contains the commands necessary to create an image for an applications. The image is an instance of the application.

• After that, we defined the apps and services that make up our platform in docker- compose.yml to run them together in a single environment. Docker compose will build all the required images and run them.

• Then, by running the command "docker-compose up", the entire platform could start as a single application.

4.2.4.2 Cluster managers

Cluster managers are softwares that can control a cluster of nodes. Cluster managers work together with cluster agents. These agents run on each node of the cluster to manage applications. Some of the most used cluster managers are Google Kubernetes and Docker Swarm. Google Kubernetes is a container management platform that is composed of a master to control the cluster, and nodes which negotiate with the master to run containers of tasks. On the other hand, Docker Swarm works on a similar way to Kubernetes, however, it is a services based manager. Hence, this idea is very useful in our work to perform research experiments. This could be done by initializing a set of docker containers running in multiple

74 physical machines and the cluster manager will manage these containers. For instance, our dockerized application could be installed and run in a set of hosts. Then, the cluster manager will manage the different simulations.

4.3 Big Data integration

After we completed developing the basic idea of the platform, that is, a web platform for scheduling strategies, we had the aim of integrating big data applications to be served by the platform. Our goal was to create a platform that not only serves regular data applications, but as well as applications that work with big data. To achieve this goal, we tried to connect a big data processing framework, namely, Apache Spark to the platform. In section 3.7 we introduced big data applications as well as the most popular big data processing frameworks. However, the complex nature of big data platforms as well as the overall complicated procedure of resource management for big data applications was a barrier from full integration with our platform. Nevertheless, we present here what we have performed with Apache Spark, and the implementation process we followed. We also include the problem we faced during our integration process.

4.3.1 Integration method

To support big data through our platform, the steps we followed could be summarized in the following points:

• Setting up the required infrastructure layer for big data We already had the infrastructure layer we developed for our platform that we explained in section 4.2.4. It is composed of the hardware and virtual resources and the dockerized application with the ability of deploying a cluster manager to manage the resources. Thus, for integrating big data applications we needed an extra component to provide the required support for handling such type of applications. This component is the big data processing framework. We decided to work with Apache Spark due to its powerful features and performance capabilities.

• Updating the operational layer to support big data The responsibility of the operational layer we developed in section 4.2.3 was to run the simulations, the selected scheduler will schedule the tasks and pass them to the

75 simulator. In general, for big data, resource management and tasks scheduling are sophisticated procedures. Furthermore, the service-level agreement (SLA) is greatly affected when running big data applications in cloud environments. In our case, we have to be able to submit our python operations program in Apache Spark as it handles the operations of the operational layer. By this, we will be able to serve big data applications in our platform. To achieve this purpose we performed the next steps :

– PySpark Installation It is known that Apache Spark is Java Virtual Machine based, thus, to be able to access all that functionality via python, we had to install PySpark. PySpark is the python API that supports Apache Spark. We performed an installation in a single node environment as we considered a single computer as a master and a slave at the same time and no cluster is involved.

– PySpark program deployment To execute PySpark programs, it is recommended to use Jupyter notebook. Jupyter notebook is a web application that enables the creation of computer codes and rich text documents. As a result, we integrated PySpark with Jupyter notebook to run our python program. Furthermore, it is worth mentioning that Apache spark enables us to use the Pittsburgh learning approach in the operations program because it has a specific library that supports machine learning called Mlib.

• Upgrading the user interface layer for big data We aimed to upgrade the user interface layer in section 4.2.2 to include a new window for big data that allows the users to upload their big data applications. Then, the updated operational layer of the PySpark program will run to perform their simulations.

4.3.2 Integration issues

We proceeded with the integration procedure as described above, however, due to the fact that the python operations program calls to an external java program which is the simulator, PySpark wasn’t able to handle the process. In addition, another issue we observed is that, in terms of efficiency, the new integrated platform has much lower performance than our first version executed in docker containers. Nevertheless, in chapter6, we set big data integration as one of the future goals by presenting some possible solutions to the issues we faced.

76 CHAPTER 5

RESULTS AND DISCUSSION

This chapter shows our proposed platform successfully running by performing different simulation scenarios and gaining the required results. We will start by presenting the configu- rations we set for our experiments, this include the infrastructure, the database, the knowledge acquisition algorithm and the ports set up. Then, we will show three simulation scenarios that could be performed by the platform. After that, we will finish the chapter by a discussion of how this platform could be useful in comparing different scenarios.

5.1 Simulation scenarios

5.1.1 Configurations

In this section, we will provide the configurations we setup to run the experiments. We will show the infrastructure used, the database construction, the knowledge acquisition algorithm parameters and the ports configurations.

5.1.1.1 Infrastructure configuration

For the infrastructure, we will start by the specifications of the hardware devices. To run our experiments, we used a single hardware machine, a personal computer (PC), with the specifications available in table 5.1. We used this device for all the experiments. As we mentioned in chapter4, we run our platform as a single application using docker with the possibility of having multiple docker nodes managed by a cluster manager. However, for our scenarios we used a single docker node, we refer to this node as worker 0. Nevertheless, if there were many docker nodes, we set the ability to retrieve the name of the current docker running. In our web interface, the user is able to set the configurations for displaying purposes. This is possible through the Data center and the virtualization windows. For our experiments, we didn’t install any virtual machines and this job is left for the simulator.

77 Table 5.1: Machine specifications

Characteristics Description Type Dell PC. System type x64-based PC Operating system Linux Ubuntu 18.04 Processor Intel(R) Core(TM) i7-4600U RAM 8GB installed Virtual memory available 15GB Time zone Central European

5.1.1.2 Database configuration

The database is responsible for storing the simulation results. It will save the different values sent by the user interface layer. We established a database of two tables, simulations table and results table. Simulations table is composed of ten columns and it holds general information about the simulations as well as the final results obtained while results table is composed of four columns and it stores the workflow results of the learning algorithm if it is used. The columns of these tables with their definitions are present in tables 5.2 and 5.3. We constructed the database using an administration tool for MySQL, namely, Phpmyad- min. We created several functions that execute MySQL queries to store the simulation values. Some values will be stored once the simulations start, such as the Id of the current simulation, the input and the scheduler the user selected while other values will be stored later after the simulation is done. Also the status of the simulation will be set to "running" in the beginning and once all the values are stored it will change to "finished" state.

5.1.1.3 Knowledge acquisition algorithm configuration

The knowledge acquisition algorithm is our learning algorithm. We used the Pittsburgh algorithm, with the help of fuzzy rule base systems FRBSs, to implement a scheduler that optimizes energy consumption. This scheduler refers to one of the scheduling strategies present in the platform, namely PowerAwareFuzzySeq. To build the scheduler, a FRBS is used. FRBS requires inputs (antecedents) and a rule base to produce an output (consequent). The output, in this case, corresponds to the scheduling

78 Table 5.2: Simulations table

Column name Definition Id defines the Id given to each simulation scenario. Type specifies the type of simulation, training or validation. Name defines the name given to the scenario. Status shows the status of the simulation, running or finished. BestAllIters shows the the obtained value of running the scenario in terms of the consumed energy. Input specifies the input file the user selected. Scheduler specifies the scheduling method used. Worker defines the docker worker selected. Learning shows the leaning algorithm used. Rules defines the rules used by the learning algorithm.

Table 5.3: Results table

Column name definition Id defines the Id given to the current results. Simulation Id specifies the corresponding simulation Id. Iteration specifies the current iteration number. Results shows the results set for each iteration.

decision. The inputs we considered are MIPs, power, utilization, task length and maximum power. However, the rule base is a key factor in creating a good FRBS, in our case having a good scheduler. For this reason, the learning algorithm is used to get the possible rule set. In this work, we use the Pittsburgh approach as a learning method. We present in table 5.4 the configurations we set for the learning method to perform our experiments. In addition, we have three membership functions for the antecedents, these are [low, average, high], while for the consequent we established a wider range of five membership functions [very low, low, average, high, very high]. Below we show some samples of best rules optimized by the Pittsburgh approach for the scheduler:

79 Table 5.4: Learning algorithm parameters

Parameter Value Original population size (number of FISs) 10 Permanence rate (crossed-population) 0.8 Mutation probability 0.1 Elitism reject number (number of elements not to stay) permanence rate * original population size = 8 Number of elements to stay original population size - elitism reject number = 2 Maximum number rules initially for each FIS 15 Minimum number of rules initially for each FIS 3 Number of inputs of the FRBS 5 Number of outputs of the FRBS 1 Number of operators 1 Number of weight 1

(a) If (MIPS is high) AND (power is average) AND (length is low) AND (utilization is low) AND (powermax is low) THEN (selection is very high)

(b) If (MIPS is average) AND (power is low) AND (length is high) AND (utilization is average) AND (powermax is low) THEN (selection is average)

(c) If (MIPS is low) AND (power is high) AND (length is low) AND (utilization is high) AND (powemax is high) THEN (selection is very low)

The MIPs, power, powermax and utilization are elements related to the host while the length is related to the task. The interpretation of the rules is, for instance, in the first rule it can be seen that if the MIPs of the host is high with low utilization and power consumption then the host has a high probability of being selected to process the tasks. The second rule contains almost average values for most elements which corresponds to average selection. On the other hand, the third rule shows almost a contrast to the first one, which corresponds to a very low selection possibility.

80 5.1.1.4 Ports set up

As the platform is composed of different layers and services, we had to set up different ports to make sure that each service works as required, as follows:

• The user interface service: this is our user interface layer, as we mentioned before, we set up this service to listen on port 8081.

• The operational service : this is the operational layer which listens on port 5000. That is the default port of the Flask application.

• The database service: the database is listening on the default port for the MySQL service which is TCP 3306.

After setting up all the required configurations, we are now ready to perform our scenarios.

81 5.1.2 Simulation scenario 1

The first scenario we present is by using one of the classic scheduling strategies which is MinMin. MinMin will be used with the aim to schedule the tasks. The goal is to find the total consumed energy we get by running this algorithm with an input file composed of 25 tasks. The scenario is implemented as follows:

1. We start by selecting the required input file from the user interface in the input window. In this case, it is Montage 25.

Figure 5.1: Scenario 1, selecting the input from interface

2. Next, we select the scheduling strategy. All the scheduling strategies apart from Power- AwareFuzzySeq are having almost similar behaviour when they run in the platform. In this scenario we select MinMin strategy.

Figure 5.2: Scenario 1, selecting the scheduling algorithm

3. The following step is to set the simulation parameters. We mentioned in section 4.2.2.1 that the simulation window has multiple views and depending on the scheduling strategy selected the views will interchangeably appear. In this case, the basic view will appear and the user will be able to run only a validation simulation. The validation type indicates that we only wants the final result and no learning algorithm is needed. In addition, we set the name of the simulation as "Scenario1" in this window.

82 Figure 5.3: Scenario 1, selecting the simulation parameters

4. Once we hit the run button on the simulation window, two functions will be executed to store initial values in the database, namely, CreateSimulation() and ReadSimulation() functions. By checking the database, we will find that the simulations table is filled with data relative to the scenario such as the input, the scheduler and the type of simulation we selected. Also an Id is given to the simulation. Furthermore, our docker worker 0 value is stored with no need to learning values in this scenario. In addition, it can be seen that the status of the simulation is set to running state.

Figure 5.4: Scenario 1, initial values stored in the database

5. By this time, the simulation data are sent to the operational layer using HTTP post request. The operational layer, through the operations program, will use the data of the scenario in the simulator to retrieve the value of the consumed energy by executing Validation.py file.

6. Finally, once the operational layer finishes the process, it will send the result back to the interface layer with a "done" message. The interface layer will use CreateValidation() function to store the result in the database in the bestalliter column with changing the status of the simulation to finished. The result stored is the total consumed energy.

83 Figure 5.5: Scenario 1, final values stored in the database

7. The simulations table could be viewed at the task list window in the user interface as a result of ReadAllSimulation() function.

Figure 5.6: Scenario 1, final values displayed in task list window

In the case of scenario 1, the total energy needed is 31,0756 Wh. However, the same procedure will run when selecting other input files and classic scheduling strategies. We can see this in figure 5.7 where we have a combination of different scenarios similar to scenario 1. The consumed energy varies accordingly.

Figure 5.7: Scenarios similar to scenario 1

84 5.1.3 Simulation scenario 2

In the second scenario, we will show a different type of simulations where we will run a learning method. For that, we select PowerAwareFuzzySeq scheduling strategy with a learning method set to Pittsburgh algorithm. The configuration we set for the learning algorithm are the ones in section 5.1.1.3. The goal is to find the total energy consumed by scheduling the tasks based on this strategy. We will use the same input file as in scenario 1. The scenario is implemented as follows:

1. We start by selecting the input file from the user interface in the input window, that is, Montage 25.

Figure 5.8: Scenario 2, selecting the input from interface

2. Next, we select the scheduling methodology. In this case it is PowerAwareFuzzySeq.

Figure 5.9: Scenario 2, selecting the scheduling algorithm

3. Then, as we would like to perform a learning scenario, we will select the training simulation type. In this case, the simulation window will show the training view, which is similar to the basic view in scenario 1 but with more options available. The learning approach option will enable us to select the learning algorithm. In this scenario, we select the Pittsburgh method with five number of iterations. As we are having only one

85 docker worker, we won’t set any number of simulations. We give the name "Scenario 2" to this simulation.

Figure 5.10: Scenario 2, selecting the simulation parameters

4. Once we hit the run button on the simulation window, the functions CreateSimulation() and ReadSimulation() will store similar initial values as in scenario 1 in the simulations table but with a learning method set to Pittsburgh and a simulation type as training.

Figure 5.11: Scenario 2, initial values stored in the database

5. Meanwhile, the operational layer through the operations program will execute Pitts- burgh.py file. The learning process will take five iterations while sending, in each iteration, the values of the consumed energy to the user interface layer. The interface layer uses CreateResults() function to insert the data in results table.

6. Once the operational layer finishes the whole process, it will send the best result found to the interface layer with a "done" message. The interface layer will use, CreateRules() and CreateBestalliter() functions to store the best set of rules and the optimized energy obtained. EndSimulation() function is responsible for changing the status of the simulation to "finished".

86 Figure 5.12: Scenario 2, iterations values stored in the database

Figure 5.13: Scenario 2, final values stored in the database

7. The simulations table could be viewed at the task list window in the user interface as a result of ReadAllSimulation() function. By clicking Pittsburgh in the training column in this window, ReadDetail() function will download the data of the results table.

Figure 5.14: Scenario 2, final values displayed in task list window

The consumed energy in this case is 31,0756 Wh , which is similar to scenario 1 due to the fact that the number of tasks is small in size. Nevertheless, this value vary depending on the input file as can be seen in figure 5.15. In this work, we implemented Pittsburgh approach as the learning method, however, Michigan and KASIA are proposed for future work.

87 Figure 5.15: Scenarios similar to scenario 2

5.1.4 Simulation scenario 3

In this third scenario we will validate the result we obtained in the second scenario. By using the set of rules we found from running PowerAwareFuzzySeq scheduling strategy with Pittsburgh learning approach, we will be able to perform the validation process. The scenario is implemented as follows:

1. The same Montage 25 file is used and selected from the input window in the interface with setting PowerAwareFuzzySeq in the schedule Window.

2. When selecting PowerAwareFuzzySeq with validation simulation type, the simulation window will display the basic view but with additional option to upload the rule bases. Thus, we will upload the same rule base we got from scenario 2.

Figure 5.16: Scenario 3, selecting simulation parameters

3. The same procedure follows as in scenario 2, however, this time the operational layer through the operations program will directly call the simulator by attaching the rule base to process the simulation. The final result is sent to the user interface layer, which will use CreateValidation() function to store the final value in the database.

88 Figure 5.17: Scenario 3, final values stored in the database

It can be seen that we validated the result of Scenario 2 by getting the same value of the consumed energy, that is 31,0756 Wh. The validation process could be very useful if one wants to check different sets of rules with a learning approach to perform comparisons.

89 5.2 Discussion

In this section, we will show how our platform could be used for comparing and analysing various results obtained from multiple scenarios for research purposes. We will do this by presenting a group of experiments and compare the gained results. For these experiments, we established the same configurations indicated in section 5.1.1. The first experiment we present is by using Inspiral 100 input file with all the existing schedulers in the platform. The results obtained are shown in table 5.5.

Table 5.5: Results for Inspiral 100 file

Input Scheduler Energy (Wh) Inspiral 100 MinMin 974.6213 Inspiral 100 MaxMin 841.9384 Inspiral 100 Round Robin 839.8463 Inspiral 100 Static scheduling 11644.2984 Inspiral 100 Powerawerfuzzyseq 839.8758

The final output result indicates the consumed energy in "Wh" unit. The consumed energy is the amount of energy needed to implement the specific scenario. Figure 5.18 shows a better view where it can be noticed that static scheduling consumes larger energy compared to other strategies. Also, for this specific scenario, MinMin and MaxMin show average values while Round Robin and Powerawerfuzzyseq have the minimum energy consumption. On the other hand, table 5.6 shows others scenarios using the input files Inspiral 50, Montage 100 and Sipht 100. The bar chart of this table is shown in figure 5.19. Static scheduling in all cases consumes the largest amount of energy. If we compare between the results of Inspiral 100 in the first experiment with Inspiral 50, we can observe that the larger the input file, the more the consumed energy, this is due to the fact that there are more tasks to be served on the file. Furthermore, when performing a comparison between input files with the same number of tasks, for instance Montage 100 and Sipht 100 compared to Inspiral 100, we can see that the type of file significantly affects the consumed energy, in this case, Montage tasks are light weighted compared to Inspiral ones, while Sipht task are massively heavy weighted.

90 MinMin MaxMin Round Robin Static scheduling Powerawerfuzzyseq

12,000

10,000

8,000

6,000

4,000 Consumed energy 2,000

0 Inspiral100

Figure 5.18: Bar chart for the results indicated in table 5.5

MinMin MaxMin Round Robin Static scheduling Powerawerfuzzyseq

12,000

10,000

8,000

6,000

Consumed energy 4,000

2,000

0 Inspiral100 Inspiral50 Montage100 Sipht100

Figure 5.19: Bar chart for the results indicated in table 5.6

91 Table 5.6: Results for different input files

Input Scheduler Energy (Wh) Inspiral 50 MinMin 779.2802 Inspiral 50 MaxMin 779.2802 Inspiral 50 Round Robin 779.2802 Inspiral 50 Static scheduling 6510.0232 Inspiral 50 Powerawerfuzzyseq 779.2802 Montage 100 MinMin 67.5611 Montage 100 MaxMin 67.2356 Montage 100 Round Robin 67.4871 Montage 100 Static scheduling 640.9878 Montage 100 Powerawerfuzzyseq 67.4871 Sipht 100 MinMin 2479.6434 Sipht 100 MaxMin 2476.9803 Sipht 100 Round Robin 2478.7705 Sipht 100 Static scheduling 9616.736 Sipht 100 Powerawerfuzzyseq 2478.7705

Hence the platform could be very useful in such types of analysis especially if there is a wider scope of experiments performed in multiple docker nodes. The scope could be even larger if the platform included other performance goals such as time, cost, multi objective...etc, as well as including multiple learning approaches.

92 CHAPTER 6

CONCLUSION AND FUTURE GOALS

6.1 Conclusions

This work introduced the development of a web platform that could be used to run scheduling simulations in cloud computing environments. The proposed platform aims to provide an environment to perform test scenarios where it is possible to select a set of workloads and a set of scheduling strategies, and the goal is to analyze the performance and see the behaviour of the various strategies. We started by introducing, in chapter3, the various concepts related to our work from cloud computing and scheduling methodologies to fuzzy logic and learning approaches. We also gave a general overview of the simulation environment we used as well as the different programming languages and virtualization platforms we needed. Then, we gave a detailed explanation of the procedure we followed for the development of the proposed platform in chapter4. We explained the layered architecture we built to simplify the process. We divided the work into three layers, the user interface layer, the operational layer and the infrastructure layer. The user interface layer is where the user can submit the simulation parameters; we designed the interface to be user friendly. Furthermore, the user has the ability to multi-parameter selection by providing various options in the interface. On the other hand, the operational layer is responsible for running the simulation, and it is composed of the operations program and the simulator. It handles the parameters passed by the interface layer, and depending on these parameters, the operational layer responds accordingly. Next, we introduced the infrastructure layer required by the platform and the specific infrastructure we built. We finished the chapter by presenting the method we followed to integrate big data applications to be supported by the platform. In chapter5, we showed the configurations we established for our experiments to test the platform. We provided, in a step by step explanation, three simulation scenarios that could be

93 served by the platform. We finished the chapter by a discussion of how the platform could be beneficial in performing analysis and comparisons of the simulations’ results. To conclude, we developed the proposed web platform, however, we set in the next section some future goals that could be done to improve the platform further, and we also welcome any contributions by other researchers to this work.

6.2 Future goals

There are several goals we set for future work, we propose them in the following points:

• Different level of implementation. The platform we developed is based on an ap- proach where we integrated different scheduling strategies so the user can access multiple schedulers to select between them. However, there is another approach where we can build a scheduler selector which will automatically select the appropriate sched- uler depending on the workload type. In this second approach, the platform will work as a privilege selector in which the best strategy will be selected for the specific scenario rather than testing all the existing strategies and then decide which is the best.

• Scope extension. The scope of the already existing platform could be extended further. The platform could include more options for inputs and scheduling strategies. In addi- tion, the current platform supports only Pittsburgh as a learning approach, nevertheless, more learning methods might be involved. We already proposed KASIA and Michigan learning algorithms and included them in the interface. Moreover, the platform could include not only energy as a performance goal but also time, cost or multi objective. Furthermore, experiments could be constructed by including a cluster of hosts. This idea will enable performing scenarios with more resources, and the results would vary greatly and be closer to real world scenarios.

• Big data integration. We aimed to allow the platform to support not only regular applications but also big data ones. However, we faced some issues regarding the integration process. We noticed that Apache Spark can’t handle a python application that calls directly to an external program in Java. In addition, it provides lower performance in this integration. As a result, we propose using another big data processing framework, such as Hadoop or Flink. Moreover, to solve the external program problem, it is possible to build an intermediate auxiliary program.

94 Bibliography

[1] Workflowsimdvfs. https://github.com/lin-cloud/WorkflowSimDVFS.

[2] A. Abd Elkader and K. Eldahshan. Round robin based scheduling algorithms, a compar- ative study. 12 2017.

[3] M. A. Ahad. Modifying round robin algorithm for process scheduling using dynamic quantum precision.

[4] E. Araujo and L. dos Santos Coelho. Particle swarm approaches using lozi map chaotic sequences to fuzzy modelling of an experimental thermal-vacuum system. Appl. Soft Comput., 8:1354–1364, 2008.

[5] Y. Bai and D. Wang. Fundamentals of Fuzzy Logic Control — Fuzzy Sets, Fuzzy Rules and Defuzzifications, pages 17–36. 01 2007.

[6] B. Bashari Rad, H. Bhatti, and M. Ahmadi. An introduction to docker and analysis of its performance. IJCSNS International Journal of Computer Science and Network Security, 173:8, 03 2017.

[7] R. Calheiros, R. Ranjan, A. Beloglazov, C. De Rose, and R. Buyya. Cloudsim: A toolkit for modeling and simulation of cloud computing environments and evaluation of resource provisioning algorithms. Software Practice and Experience, 41:23–50, 01 2011.

[8] D. P. Chandrashekar. Robust and fault-tolerant scheduling for scientific workflows in cloud computing environments. 2015.

[9] W. Chen and E. Deelman. Workflowsim: A toolkit for simulating scientific workflows in distributed environments. pages 1–8, 10 2012.

[10] W.-K. Chen. Editor-in-chief. In W.-K. CHEN, editor, The Electrical Engineering Handbook, pages xvii – xviii. Academic Press, Burlington, 2005.

95 [11] P. Cingolani and J. Alcala-Fdez. Jfuzzylogic: A robust and flexible fuzzy-logic inference system language implementation. pages 1–8, 06 2012.

[12] P. Cingolani and J. Alcala-Fdez. jfuzzylogic: A java library to design fuzzy logic controllers according to the standard for fuzzy control programming. International Journal of Computational Intelligence Systems, 6:61–75, 06 2013.

[13] P. Confluence. Montage. https://confluence.pegasus.isi.edu/ display/pegasus/Montage.

[14] O. Cordón, F. Herrera, F. Hoffmann, and L. Magdalena. Genetic fuzzy systems - evolutionary tuning and learning of fuzzy knowledge bases. In Advances in Fuzzy Systems - Applications and Theory, 2001.

[15] I. Cotes-Ruiz, R. Prado, S. Galan, J. Muñoz-Expósito, and N. Ruiz Reyes. Dynamic voltage frequency scaling simulator for real workflows energy-aware management in green cloud computing. PLoS ONE, 12, 01 2017.

[16] J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. Communications of the ACM, 51(1):107–113, Jan. 2008.

[17] R. Eberhart and J. S. Kennedy. A new optimizer using particle swarm theory. MHS’95. Proceedings of the Sixth International Symposium on Micro Machine and Human Science, pages 39–43, 1995.

[18] P. Florence, V. Shanthi, and C. Simon. Energy conservation using dynamic voltage frequency scaling for computational cloud. The Scientific World Journal, 2016:1–13, 04 2016.

[19] S. Galan, R. Prado, and J. Exposito. Swarm fuzzy systems: Knowledge acquisition in fuzzy systems and its applications in grid computing. IEEE Transactions on Knowledge and Data Engineering, 99:1, 01 2013.

[20] J. A. Goguen. L. a. zadeh. fuzzy sets. information and control, vol. 8 (1965), pp. 338–353. - l. a. zadeh. similarity relations and fuzzy orderings. information sciences, vol. 3 (1971), pp. 177–200. Journal of Symbolic Logic, 38(4):656–657, 1973.

96 [21] D. E. Goldberg. Genetic Algorithms in Search, Optimization and Machine Learning. Addison-Wesley Longman Publishing Co., Inc., USA, 1st edition, 1989.

[22] Google. Kubernetes. https://kubernetes.io/.

[23] T. Guerout, T. Monteil, G. Da Costa, R. Calheiros, R. Buyya, and M. Alexandru. Energy- aware simulation with dvfs. Simulation Modelling Practice and Theory, 39:76–91, 12 2013.

[24] N. A. A. Hafez. Review on scheduling in cloud computing. 2018.

[25] F. Herrera. Genetic fuzzy systems: taxonomy, current research trends and prospects. Evolutionary Intelligence, 1:27–46, 2008.

[26] B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. D. Joseph, R. H. Katz, S. Shenker, and I. Stoica. Mesos: A platform for fine-grained resource sharing in the data center. In NSDI, 2011.

[27] J. H. Holland. Adaptation in natural and artificial systems. 1975.

[28] M. T. Islam and R. Buyya. Resource management and scheduling for big data applica- tions in cloud computing environments, 2018.

[29] Java.com. What is java technology and why do i need it? https://java.com/en/ download/faq/whatis_java.xml.

[30] G. Juve, A. Chervenak, E. Deelman, S. Bharathi, G. Mehta, and K. Vahi. Characterizing and profiling scientific workflows. Future Generation Computer Systems, 29(3):682 – 692, 2013. Special Section: Recent Developments in High Performance Computing and Security.

[31] R. KananiBhavisha and J. ManiyarBhumi. Review on max-min task scheduling algo- rithm for cloud computing. 2015.

[32] K. H. Kim, R. Buyya, and J. Kim. Power aware scheduling of bag-of-tasks applications with deadline constraints on dvs-enabled clusters. In Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid ’07), pages 541–548, 2007.

97 [33] Kobra Etminani and M. Naghibzadeh. A min-min max-min selective algorihtm for grid task scheduling. In 2007 3rd IEEE/IFIP International Conference in Central Asia on Internet, pages 1–7, 2007.

[34] A. Konar. Computational Intelligence: Principles, Techniques and Applications. Springer-Verlag, Berlin, Heidelberg, 2005.

[35] X. Lin, Y. Wang, Q. Xie, and M. Pedram. Task scheduling with dynamic voltage and frequency scaling for energy minimization in the mobile cloud computing environment. Mar. 2015.

[36] E. H. Mamdani. Application of fuzzy algorithms for control of simple dynamic plant. Proceedings of the Institution of Electrical Engineers, 121(12):1585–1588, 1974.

[37] Matlab. Fuzzy Logic Toolbox. https://www.mathworks.com/help/fuzzy/. [Online].

[38] NASA. Montage. http://montage.ipac.caltech.edu.

[39] Nodejs.org. Introduction to Node.js. https://nodejs.dev/.

[40] Oracle. MYSQL 8.0 Documentation. https://docs.oracle.com/cd/ E17952_01/mysql-8.0-en/mysql-8.0-en.pdf.

[41] Oracle. The MySQL website. http://www.mysql.com/.

[42] Oracle.com. About the java technology. https://docs.oracle.com/javase/ tutorial/getStarted/intro/definition.html.

[43] Oracle.com. The java language environment. https://www.oracle.com/ technetwork/java/intro-141325.html#349.

[44] E. Pacini, C. Mateos, and C. Garcia Garino. Multi-objective swarm intelligence sched- ulers for online scientific clouds. Computing, 98, 06 2014.

[45] Pegasu. Workflow managment system. https://pegasus.isi.edu/.

[46] Python.org. What is python? executive summary. https://www.python.org/ doc/essays/blurb/.

98 [47] C. W. Reynolds. Flocks, herds, and schools: A distributed behavioral model. SIGGRAPH Computer Graphics, 21(4):25–34, July 1987.

[48] J. Robinson and Y. Rahmat-Samii. Particle swarm optimization in electromagnetics. IEEE Transactions on Antennas and Propagation, 52(2):397–407, 2004.

[49] R. Rojas. Neural Networks: A Systematic Introduction. Springer-Verlag, Berlin, Heidel- berg, 1996.

[50] D. R. Roshan and D. M. N. Rao. Least-mean difference round robin ( lmdrr ) cpu scheduling algorithm. 2016.

[51] I. T. C. Ruiz. Energy optimization in cloud computing systems, 2016.

[52] B. Russell. Passive Benchmarking with docker LXC, KVM OpenStack. , 2015.

[53] C. Scordino and G. Lipari. Using resource reservation techniques for power-aware scheduling. pages 16–25, 01 2004.

[54] H. Shah and T. Soomro. Node.js challenges in implementation. 05 2017.

[55] E. M. Sharma. A survey in fuzzy logic: An introduction. 01 2014.

[56] N. Sharma, D. sanjay tyagi, and S. Atri. A comparative analysis of min-min and max- min algorithms based on the makespan parameter. International Journal of Advanced Research in Computer Science, 8(3):1038–1041, 2017.

[57] D. Singh and C. Reddy. A survey on platforms for big data analytics. Journal of Big Data, 2, 12 2014.

[58] K. Singh, M. Alam, and S. Kumar. A survey of static scheduling algorithm for distributed computing system. International Journal of Computer Applications, 129:25–30, 11 2015.

[59] P. Singh, M.-T. Student, and A. Jain. Survey paper on cloud computing. International Journal of Innovations in Engineering and Technology (IJIET), 3:84–89, 04 2014.

[60] S. F. Smith. A Learning System Based on Genetic Adaptive Algorithms. PhD thesis, USA, 1980. AAI8112638.

99 [61] P. Srivastava and R. Khan. A review paper on cloud computing. International Journal of Advanced Research in Computer Science and Software Engineering, 8:17, 06 2018.

[62] P. D. P. V. Sukumar Babu Bandarupalli, Neelima Priyanka Nutulapati. Article: A novel cpu scheduling algorithm–preemptive non-preemptive. International Journal of Modern Engineering Research, 2(6):4484–4490, November-December 2012.

[63] T. Takagi and M. Sugeno. Fuzzy identification of systems and its applications to modeling and control. IEEE Transactions on Systems, Man, and Cybernetics, SMC- 15(1):116–132, 1985.

[64] K. Thulasiraman and M. N. S. Swamy. Graphs: Theory and Algorithms. John Wiley Sons, Inc., USA, 1992.

[65] J. Turnbull. The Docker Book: Containerization is the new virtualization. , 2014.

[66] T. Vase. Advantages of Docker. , 2015.

[67] V. Vavilapalli, A. Murthy, C. Douglas, S. Agarwal, M. Konar, R. Evans, T. Graves, J. Lowe, H. Shah, S. Seth, B. Saha, C. Curino, O. O’Malley, S. Radia, B. Reed, and E. Baldeschwieler. Apache hadoop yarn: yet another resource negotiator. 10 2013.

[68] G. Venayagamoorthy, L. Grant, and S. Doctor. Collective robotic search using hy- brid techniques: Fuzzy logic and swarm intelligence inspired by nature. Engineering Applications of Artificial Intelligence, 22:431–441, 04 2009.

[69] J. Verhoosel, E. Luit, D. K. Hammer, and E. Jansen. A static scheduling algorithm for distributed hard real-time systems. Real-Time Systems, 3:227–246, 01 1991.

[70] D. Villegas, I. Rodero, L. Fong, N. Bobroff, Y. Liu, M. Parashar, and S. M. Sadjadi. The Role of Grid Computing Technologies in Cloud Computing, pages 183–218. 08 2010.

[71] D. Wang, D. Tan, and L. Liu. Particle swarm optimization algorithm: an overview. Soft Computing, 01 2017.

[72] R. R. Yager and D. P. Filev. Essentials of Fuzzy Modeling and Control. John Wiley Sons, Inc., USA, 1st edition, 1994.

100 [73] L. A. Zadeh. Outline of a new approach to the analysis of complex systems and decision processes. IEEE Trans. Systems, Man, and Cybernetics, 3:28–44, 1973.

[74] L. A. Zadeh. Making computers think like people [fuzzy set theory]. IEEE Spectrum, 21:26–32, 1984.

101