Research Collection
Total Page:16
File Type:pdf, Size:1020Kb
Research Collection Master Thesis Dynamic Thread Allocation for Distributed Jobs using Resource Tokens Author(s): Smesseim, Ali Publication Date: 2019 Permanent Link: https://doi.org/10.3929/ethz-b-000362308 Rights / License: In Copyright - Non-Commercial Use Permitted This page was generated automatically upon download from the ETH Zurich Research Collection. For more information please consult the Terms of use. ETH Library Dynamic Thread Allocation for Distributed Jobs using Resource Tokens Master Thesis Ali Smesseim August 25, 2019 Advisors Prof. Dr. G. Alonso Dr. I. Psaroudakis Dr. V. Trigonakis Department of Oracle Labs Oracle Labs Computer Science Zurich Zurich ETH Zurich Contents Contentsi 1 Introduction1 2 Background5 2.1 Related work ............................ 5 2.1.1 Parallel programming models............... 5 2.1.2 Simple admission control ................. 6 2.1.3 Thread scheduling..................... 7 2.1.4 Cluster scheduling..................... 8 2.2 System overview .......................... 10 2.2.1 PGX.D overview...................... 10 2.2.2 Callisto runtime system.................. 12 2.2.3 Relation to literature.................... 16 3 Solution Design 19 3.1 Dynamic thread allocation..................... 19 3.2 Scheduler API ........................... 22 3.3 Policies for distributed jobs .................... 24 3.3.1 Outgoing message policy.................. 24 3.3.2 CPU time policy...................... 26 3.3.3 Policy combination..................... 28 3.3.4 Sliding window....................... 30 3.4 Operator assignment within job.................. 30 3.5 Admission control ......................... 31 4 Evaluation 33 4.1 Methodology ............................ 33 4.1.1 Workloads.......................... 33 4.2 Parameter configuration...................... 35 i Contents 4.2.1 Message handling prioritization.............. 35 4.2.2 Network policy configuration ............... 37 4.2.3 Combination function configuration ........... 39 4.3 Microbenchmarks.......................... 42 4.4 Mixture of CPU and realistic jobs ................ 47 4.5 Mixture of workloads........................ 50 4.6 Varying number of machines.................... 53 4.7 Results summary.......................... 54 5 Conclusion 57 5.1 Future work............................. 58 Bibliography 59 ii Chapter 1 Introduction The topic of job scheduling in distributed systems is a widely studied subject [5, 10,14,22,26,32,39]. This comes as no surprise, as distributed systems become ever more prominent. Public cloud computing services were responsible for a worldwide revenue of $182.4 billion in 2018, and this number is expected to increase further in the coming years [16]. Job schedulers are responsible for allocating system resources to jobs. Therefore, in order to maximize efficient usage and minimize energy consumption of their clusters, cloud providers continuously try to improve their job schedulers. Nowadays, organizations have to deal with a large amount of data. With the rise and prominence of fields such as big data analytics and machine learning, the need for scalable data processing has risen to beyond what single machine runtimes can offer. Distributed data processing systems such as Spark [41] and Hadoop [35] have been introduced to deal with the issue of scalability. It is hardly economical to have one of those large and expensive systems for each user. These systems thus support job submission by multiple users. There can hence be many jobs in the system. Each job has certain resource needs. The scheduler is responsible for making sure that those needs are met. A common type of scheduler, which is used in the named distributed systems, is fair scheduling. Fair schedulers allocate the resources in equal amounts to all jobs, if possible. As the name suggests, this is a fair way to allocate resources. However, to increase system utilization beyond what fair scheduling can offer, we need to know more about the job's behavior, such as resource usage during runtime. A smarter scheduler could leverage this information to essentially tailor the allocation to the job's current needs. In this thesis, we will explore the possibilities of resource-aware decentralized job scheduling in clusters. To understand what this entails, we present a distributed system model (figure 1.1). In this model, jobs are executed on all nodes in the cluster. Also, the scheduler only concerns itself with allocating 1 1. Introduction 1 Machine 0 Machine 1 Machine 2 Machine 3 Admission control master node 2 slave node slave node slave node User Core allocation job 0 job 1 3 job 0 job 1 job 0 job 1 job 0 job 1 Thread behavior within job 4 Distributed scheduling Figure 1.1: A high level overview of the target distributed system. This example shows four machines, each machine having four CPU cores. The system has two running jobs. threads to jobs. Any other resource (e.g. network bandwidth, memory) is not allocated through the scheduler. Instead, jobs access these resources directly from the system. Resources are finite; if jobs attempt to use more of a resource than is available, the jobs will have to spend some time waiting for the resource to become sufficiently available again. We call this resource contention. In order to improve job performance and system utilization, resource contentions should be avoided. To mitigate resource contentions, we distinguish four job scheduling approaches that could be applied for our model. These approaches are visualized in fig- ure 1.1, and are: 1. Admission control. Instead of performing naive actions, such as admit- ting jobs as long as their minimum requirements can be met, or admit- ting jobs in first-come-first-served order, one could instead admit jobs in a different order, or have additional constraints for the job to be ad- mitted. Admission control decisions could either be taken on one node or multiple. 2. Core allocation. Instead of giving a fair allocation to admitted jobs, one could reschedule threads such that resource contentions are prevented. These decisions will be taken by each machine independently. 3. Thread behavior within job. A job can define multiple parallel tasks. It is possible to alter which task gets prioritized. For example, if we know that one task heavily contributed to resource contention, then other tasks ought to be prioritized. 2 4. Distributed scheduling. A general concept referring to machine-local schedulers exchanging information to schedulers on other machines. This could help achieve further increases in performance. In this thesis, we have explored the possibilities of increasing job throughputs and mitigating resource contentions through two of the mentioned approaches: core allocation and altering thread behavior within jobs. We present a decentralized, modular, and workload-agnostic scheduler that makes local decisions to schedule threads to distributed jobs. Decisions are based on resource tokens and their corresponding user-defined policies. A resource token of a certain type represents the usage of a particular system resource. The scheduler is agnostic to the nature of those tokens (i.e. what resource they represent). The application is responsible for informing the scheduler about the resource tokens of each running job. Once the scheduler knows about the resource usages of all jobs through to- kens, it tries to assign an efficiency factor to the jobs. User-defined policies specify how such an assignment is done. The efficiency factor is equal to the proportion of threads the job will receive. The scheduler so far is still generic. We define two tokens types and policies that are applicable to our generic system model. They are: 1. Outgoing message tokens: which represent the number of messages sent by a job but not yet acknowledged by the destination node. An increas- ing number of outgoing message tokens is a sign of network contention. The outgoing message policy will therefore lower the thread allocation for jobs with an increasing number of outgoing message tokens. 2. CPU tokens: which represent the CPU usage of each job. The higher the number of CPU tokens, the more efficient the job is perceived to perform, and therefore should receive a higher thread allocation. The scheduler also considers incoming message tokens, which represent the number of messages received by a job but not yet handled. An increasing number of those tokens suggest that the job receives more messages than it can handle, and therefore message handling should be prioritized. In order to evaluate our solution in a real-life system, we implemented this scheduler for PGX.D [20], which is a distributed graph processing engine de- veloped by Oracle Labs. On this system, we run a variety of both synthetic and realistic benchmarks. The average job latencies are improved in all exper- iments. Individual job latencies are decreased by up to 50% for highly CPU time efficient jobs, due to them receiving a higher thread allocation. Even jobs that, due to contention, receive a lowered thread allocation will still benefit with a decrease in latency of up to 30%. 3 1. Introduction In this thesis, we have also briefly touch upon admission control, and present the design for a scheduler that incorporates it. Chapter2 explores the relevant work in the domain of both single-machine and distributed scheduling, and presents a high-level overview of the PGX.D system. Chapter3 motivates and presents the design of the scheduler, includ- ing the API. Chapter4 evaluates the performance of the solution. Chapter5 summarizes and concludes the thesis work, and provides future work. 4 Chapter 2 Background To understand what kinds of scheduling solutions are applicable, it helps to both gain an understanding about job schedulers in literature, as well as the system at hand. In this chapter, we will perform a literature review and obtain a high-level overview of the PGX.D system. 2.1 Related work Distributed job scheduling has been the subject of intensive research. There are various angles with which one could approach the topic. We shall first look at literature concerning single-machine task and job scheduling, as the concepts are relevant even in the case of distributed systems.