Clusterable Task Scheduler

Masaryk University Faculty of Informatics Clusterable Task Scheduler Bachelor’s Thesis Ján Michalov Brno, Fall 2019 Masaryk University Faculty of Informatics Clusterable Task Scheduler Bachelor’s Thesis Ján Michalov Brno, Fall 2019 This is where a copy of the official signed thesis assignment and a copy ofthe Statement of an Author is located in the printed version of the document. Declaration Hereby I declare that this paper is my original authorial work, which I have worked out on my own. All sources, references, and literature used or excerpted during elaboration of this work are properly cited and listed in complete reference to the due source. Ján Michalov Advisor: RNDr. Adam Rambousek, Ph.D. i Acknowledgements I would like to sincerely thank my advisor RNDr. Adam Rambousek, Ph.D. for his guidance, patience and precious advice. I am also grateful to my consultant Bc. Matej Lazar, who directed me and helped with the design across countless hours of meetings. I wish to thank my friends and family for their support during stressful days. iii Abstract The purpose of this thesis is to create a microservice that would schedule tasks happening on other devices. The tasks can have different tasks declared as dependencies, and the microservice must execute them in the correct order. Additionally, the microservice must be able to be deployed in a clustered environment, which means ensuring data consistency and preventing duplicate execution of a task. The chosen platform for the microservice is Java. iv Keywords microservice, dependency resolution, scheduling, Java, data consistency, cluster v Contents Introduction 1 1 Theoretical background 3 1.1 Scheduling ..........................3 1.2 Applicaton clustering .....................3 1.3 Microservice architecture ...................4 2 JBoss MSC 5 2.1 Architecture ..........................5 2.1.1 Service . .5 2.1.2 ServiceController . .6 2.1.3 ServiceRegistrationImpl . .7 2.1.4 Dependency and Dependent . .7 2.1.5 ServiceRegistry . .8 2.1.6 ServiceTarget . .8 2.1.7 ServiceContainer . .9 2.2 Transitional period of a ServiceController ..........9 2.3 Concurrency and synchronization .............. 11 2.4 Pros and cons ......................... 12 2.5 Conclusion .......................... 13 3 Infinispan 15 3.1 Client-server mode ...................... 15 3.1.1 Network protocols . 16 3.1.2 Server . 16 3.1.3 Hot Rod Java client . 17 4 Application platform 21 4.1 JBoss EAP ........................... 21 4.2 Thorntail ........................... 22 4.3 Quarkus ............................ 22 4.4 Decision ............................ 23 5 Design 25 5.1 Requirements ......................... 25 5.1.1 Must have . 25 vii 5.1.2 Should have . 25 5.1.3 Could have . 26 5.1.4 Remote entity requirements . 26 5.2 Differences and similarities to JBoss MSC .......... 26 5.2.1 Naming changes . 26 5.2.2 What stayed . 27 5.2.3 What changed . 27 5.3 States, Transitions, Modes and Jobs ............. 28 5.3.1 Modes . 29 5.3.2 Jobs . 29 5.3.3 StageGroups . 29 5.3.4 States . 30 5.4 Modules ............................ 31 6 Implementation 33 6.1 Context dependency injection ................. 33 6.1.1 Maven module problem . 33 6.2 Transactions .......................... 34 6.2.1 Partial updates . 34 6.2.2 Prevention of duplicate execution . 34 6.3 Mapping ........................... 35 6.4 REST ............................. 35 6.5 Installation .......................... 36 6.5.1 Prerequisites . 36 6.5.2 Setting up an Infinispan server . 37 6.5.3 Compilation and execution . 38 7 Testing 39 7.1 Local integration testing ................... 39 7.2 Clustered testing ....................... 40 8 Conclusion 43 8.1 Future improvements ..................... 43 Bibliography 45 A Attached files 47 viii List of Figures 2.1 State-diagram of JBoss MSC. Source: [6] 6 5.1 State-machine diagram of a Task 28 5.2 The diagram of package dependencies in the scheduler 31 ix Introduction This thesis was created as an effort from company Red Hat, to im- prove the scalability of a product called Project Newcastle. Nowadays, scalability is a common problem across products. There are two ways to scale a product. Vertically with an addition of power in the form of memory and CPU cores or horizontally with clustering. However, some products suffer from a sophisticated monolithic design. This problem also persists in Project Newcastle. One of the techniques that solve this dilemma is microservice architecture. Microservice architecture aims to dissect these monoliths into smaller parts, each with their function, that communicates with each other. These parts are simple, therefore easier to maintain, and should be designed in a way to scale in a cluster. One of the functions of Project Newcastle is to schedule tasks, which are executed remotely. Additionally, these tasks have defined dependencies and therefore have to be scheduled in the correct order. The goal of this thesis is to create an open-source remote scheduler with an ability to scale in the cluster and microservice architecture in mind. The thesis is made of seven chapters excluding conclusion. The first chapter focuses on the theoretical aspect of the thesis and introduces the reader to complexities of scheduling, clustering and microservice architecture. The following chapter regards to an analysis of JBoss MSC library. The library implements a scheduling solution with a different use-case but flexible implementation, which concepts are used in the design. Chapter three concentrates on a Red Hat developed datastore solution Infinispan, which intent is to be used for enable- ment of clustering for a variety of applications. Next chapter is briefly introducing available Red Hat application platforms, their strong and weak aspects, and which is the most suitable for the scheduler. Chapter five is the design of the application. The chapter defines and explains the requirements, points out the major distinctions against JBoss MSC and defines the states of a task and other essential models. Chapter six delves into the problematics of the implementation part of the thesis. This chapter points out some flaws of used libraries/frameworks, describes how is data consistency guaranteed and concludes with 1 a guide for compiling from source and subsequent execution of the scheduler. The last chapter before the conclusion is focused on testing. The testing includes local integration tests and clustered tests, which the chapter describes in detail. 2 1 Theoretical background 1.1 Scheduling In a scheduling problem, there is a set of tasks and a set of constraints. These constraints state that executing a specific task could depend on other tasks being completed beforehand. These sets can be mapped into a directed graph, with the tasks as the nodes and the direct pre- requisite constraints as the edges. If this graph has a cycle, then it implies that there exists a task, which is transitively dependent on itself. Therefore, this task has to complete before it can start, which doesn’t make sense. Hence, the graph can not have cycles and can be expressed as a directed acyclic graph (DAG). To schedule a set of tasks an ordering is needed. The order has to respect dependency constraints. For DAG, this ordering is called topological sorting. Every finite DAG has a topological sort. However, there can be more than one possible topological sort.[1] A topological sort can be found by iteratively marking a task that has either no dependencies or all of its dependencies are marked. The ordering of marks suggests a topological sorting. This algorithm can produce a different order if it has more than one task available to mark. For parallel task scheduling, the algorithm mentioned above can be modified. Instead of marking one task each iteration, it marks all tasks ready for marking. All tasks marked in one iteration are independent of each other and can execute concurrently. A further modification could be to only allow a certain number of marks in one iteration, which simulates a situation where resources are limited. 1.2 Applicaton clustering Application clustering typically refers to a method of grouping mul- tiple computer servers into an entity that behaves as a single system. A server in a cluster is referred to as a node. Typically, each node runs the same copy of an application that is usually deployed on an application server which provides clustering features (Wildfly). For 3 1. Theoretical background instance, Wildfly1 application servers can discover each other on a network and replicate the state of a deployed application [2]. Benefits of clustering include [3]: 1. Load Balancing (Scalability): Processing requests are distributed across the cluster nodes. The main objective of load balancing is to limit nodes getting overloaded and possibly shutting down. Adding more nodes to a cluster increases the cluster’s whole computing capabilities. 2. Fail-over (High Availability): Clusters enable services to last for longer periods. Singular servers have a single point of failure. A server can fail unexpectedly due to unforeseen causes such as infrastructure issues, networking problems or software crash- ing. On the other hand, a cluster is more resilient. If one node crashes, there are still other nodes that can handle incoming requests. A direct method of developing a clusterable application is without keeping a state. A stateless application does not retain data for later use. However, the state can be stored in a database instead. Each stateless application can connect to the database where they keep all information. Stateless applications are easily scalable.[4] 1.3 Microservice architecture Microservice is architectural style motivated by a service-oriented architecture (SOA) that appeared due to a need for flexible and conve- niently scalable applications as opposed to monolithic style, which is challenging to use in distributed systems. Microservices handle gradu- ally increasing complexity of large systems by decomposing them into a set of independent services. These services are loosely-coupled, and each should provide specific functionality.

Load more