Research Collection

Master Thesis

Dynamic Allocation for Distributed Jobs using Resource Tokens

Author(s): Smesseim, Ali

Publication Date: 2019

Permanent Link: https://doi.org/10.3929/ethz-b-000362308

Rights / License: In Copyright - Non-Commercial Use Permitted

This page was generated automatically upon download from the ETH Zurich Research Collection. For more information please consult the Terms of use.

ETH Library Dynamic Thread Allocation for Distributed Jobs using Resource Tokens

Master Thesis Ali Smesseim August 25, 2019

Advisors Prof. Dr. G. Alonso Dr. I. Psaroudakis Dr. V. Trigonakis Department of Oracle Labs Oracle Labs Computer Science Zurich Zurich ETH Zurich

Contents

Contentsi

1 Introduction1

2 Background5 2.1 Related work ...... 5 2.1.1 Parallel programming models...... 5 2.1.2 Simple admission control ...... 6 2.1.3 Thread ...... 7 2.1.4 Cluster scheduling...... 8 2.2 System overview ...... 10 2.2.1 PGX.D overview...... 10 2.2.2 Callisto runtime system...... 12 2.2.3 Relation to literature...... 16

3 Solution Design 19 3.1 Dynamic thread allocation...... 19 3.2 Scheduler API ...... 22 3.3 Policies for distributed jobs ...... 24 3.3.1 Outgoing message policy...... 24 3.3.2 CPU time policy...... 26 3.3.3 Policy combination...... 28 3.3.4 Sliding window...... 30 3.4 Operator assignment within job...... 30 3.5 Admission control ...... 31

4 Evaluation 33 4.1 Methodology ...... 33 4.1.1 Workloads...... 33 4.2 Parameter configuration...... 35

i Contents

4.2.1 Message handling prioritization...... 35 4.2.2 Network policy configuration ...... 37 4.2.3 Combination function configuration ...... 39 4.3 Microbenchmarks...... 42 4.4 Mixture of CPU and realistic jobs ...... 47 4.5 Mixture of workloads...... 50 4.6 Varying number of machines...... 53 4.7 Results summary...... 54

5 Conclusion 57 5.1 Future work...... 58

Bibliography 59

ii Chapter 1

Introduction

The topic of job scheduling in distributed systems is a widely studied subject [5, 10,14,22,26,32,39]. This comes as no surprise, as distributed systems become ever more prominent. Public cloud computing services were responsible for a worldwide revenue of $182.4 billion in 2018, and this number is expected to increase further in the coming years [16]. Job schedulers are responsible for allocating system resources to jobs. Therefore, in order to maximize efficient usage and minimize energy consumption of their clusters, cloud providers continuously try to improve their job schedulers. Nowadays, organizations have to deal with a large amount of data. With the rise and prominence of fields such as big data analytics and machine learning, the need for scalable data processing has risen to beyond what single machine runtimes can offer. Distributed data processing systems such as Spark [41] and Hadoop [35] have been introduced to deal with the issue of scalability. It is hardly economical to have one of those large and expensive systems for each user. These systems thus support job submission by multiple users. There can hence be many jobs in the system. Each job has certain resource needs. The scheduler is responsible for making sure that those needs are met. A common type of scheduler, which is used in the named distributed systems, is fair scheduling. Fair schedulers allocate the resources in equal amounts to all jobs, if possible. As the name suggests, this is a fair way to allocate resources. However, to increase system utilization beyond what fair scheduling can offer, we need to know more about the job’s behavior, such as resource usage during runtime. A smarter scheduler could leverage this information to essentially tailor the allocation to the job’s current needs. In this thesis, we will explore the possibilities of resource-aware decentralized job scheduling in clusters. To understand what this entails, we present a distributed system model (figure 1.1). In this model, jobs are executed on all nodes in the cluster. Also, the scheduler only concerns itself with allocating

1 1. Introduction

1 Machine 0 Machine 1 Machine 2 Machine 3 Admission control master node

2 slave node slave node slave node User Core allocation job 0 job 1

3 job 0 job 1 job 0 job 1 job 0 job 1 Thread behavior within job

4 Distributed scheduling

Figure 1.1: A high level overview of the target distributed system. This example shows four machines, each machine having four CPU cores. The system has two running jobs.

threads to jobs. Any other resource (e.g. network bandwidth, memory) is not allocated through the scheduler. Instead, jobs access these resources directly from the system. Resources are finite; if jobs attempt to use more of a resource than is available, the jobs will have to spend some time waiting for the resource to become sufficiently available again. We call this resource contention. In order to improve job performance and system utilization, resource contentions should be avoided. To mitigate resource contentions, we distinguish four job scheduling approaches that could be applied for our model. These approaches are visualized in fig- ure 1.1, and are:

1. Admission control. Instead of performing naive actions, such as admit- ting jobs as long as their minimum requirements can be met, or admit- ting jobs in first-come-first-served order, one could instead admit jobs in a different order, or have additional constraints for the job to be ad- mitted. Admission control decisions could either be taken on one node or multiple. 2. Core allocation. Instead of giving a fair allocation to admitted jobs, one could reschedule threads such that resource contentions are prevented. These decisions will be taken by each machine independently. 3. Thread behavior within job. A job can define multiple parallel tasks. It is possible to alter which task gets prioritized. For example, if we know that one task heavily contributed to resource contention, then other tasks ought to be prioritized.

2 4. Distributed scheduling. A general concept referring to machine-local schedulers exchanging information to schedulers on other machines. This could help achieve further increases in performance.

In this thesis, we have explored the possibilities of increasing job throughputs and mitigating resource contentions through two of the mentioned approaches: core allocation and altering thread behavior within jobs.

We present a decentralized, modular, and workload-agnostic scheduler that makes local decisions to schedule threads to distributed jobs. Decisions are based on resource tokens and their corresponding user-defined policies. A resource token of a certain type represents the usage of a particular system resource. The scheduler is agnostic to the nature of those tokens (i.e. what resource they represent). The application is responsible for informing the scheduler about the resource tokens of each running job.

Once the scheduler knows about the resource usages of all jobs through to- kens, it tries to assign an efficiency factor to the jobs. User-defined policies specify how such an assignment is done. The efficiency factor is equal to the proportion of threads the job will receive.

The scheduler so far is still generic. We define two tokens types and policies that are applicable to our generic system model. They are:

1. Outgoing message tokens: which represent the number of messages sent by a job but not yet acknowledged by the destination node. An increas- ing number of outgoing message tokens is a sign of network contention. The outgoing message policy will therefore lower the thread allocation for jobs with an increasing number of outgoing message tokens.

2. CPU tokens: which represent the CPU usage of each job. The higher the number of CPU tokens, the more efficient the job is perceived to perform, and therefore should receive a higher thread allocation.

The scheduler also considers incoming message tokens, which represent the number of messages received by a job but not yet handled. An increasing number of those tokens suggest that the job receives more messages than it can handle, and therefore message handling should be prioritized.

In order to evaluate our solution in a real-life system, we implemented this scheduler for PGX.D [20], which is a distributed graph processing engine de- veloped by Oracle Labs. On this system, we run a variety of both synthetic and realistic benchmarks. The average job latencies are improved in all exper- iments. Individual job latencies are decreased by up to 50% for highly CPU time efficient jobs, due to them receiving a higher thread allocation. Even jobs that, due to contention, receive a lowered thread allocation will still benefit with a decrease in latency of up to 30%.

3 1. Introduction

In this thesis, we have also briefly touch upon admission control, and present the design for a scheduler that incorporates it. Chapter2 explores the relevant work in the domain of both single-machine and distributed scheduling, and presents a high-level overview of the PGX.D system. Chapter3 motivates and presents the design of the scheduler, includ- ing the API. Chapter4 evaluates the performance of the solution. Chapter5 summarizes and concludes the thesis work, and provides future work.

4 Chapter 2

Background

To understand what kinds of scheduling solutions are applicable, it helps to both gain an understanding about job schedulers in literature, as well as the system at hand. In this chapter, we will perform a literature review and obtain a high-level overview of the PGX.D system.

2.1 Related work

Distributed job scheduling has been the subject of intensive research. There are various angles with which one could approach the topic. We shall first look at literature concerning single-machine task and job scheduling, as the concepts are relevant even in the case of distributed systems. We shall then look at how distributed systems, specifically clusters, approach job scheduling.

2.1.1 Parallel programming models We will start with single-machine parallel programming models, of which there are many. OpenMP [12] is a prominent example of a model in this class. It exposes a set of compiler directives that make it possible to easily parallelize serial loops. An example of the use of OpenMP shown in figure 2.1. We could specify how the iterations are scheduled to threads by specifying the SCHED TYPE parameter [9]. Setting it to static give each thread a static and more or less equal assignment of iterations. This works well if the duration of the iterations close to equal. Otherwise, the workload is imbalanced across threads. Alternatively, SCHED TYPE could be set to dynamic, in which case each thread executes a chunk of iterations of size CHUNK SIZE, then requests another chunk until they are all distributed. The lower the value of CHUNK SIZE, the smaller the work imbalance across threads, but the higher the synchronization overheads. Tuning this parameter based on workload and infrastructure is therefore required.

5 2. Background

std::vector vec(N, 0);

#pragma omp schedule(SCHED_TYPE, CHUNK_SIZE) for (size_ti = 0;i

Figure 2.1: An example of the ease of parallelization using OpenMP. In order to turn the serial for-loop into a parallel loop, only the pragma directive is added.

Alternative shared-memory task schedulers include Intel TBB [30] and Intel Cilk Plus [1]. Callisto [18] is a parallel runtime system that also supports the fine-grained scheduling of loops. This is the runtime system used by PGX.D. Callisto largely mitigates the necessity of tuning the chunk size by lowering the synchro- nization overheads through per-core iteration counts and a novel synchronous request combining technique. Callisto is not strictly a task scheduler; it sup- ports the execution of jobs, which on their part contain tasks (e.g. a parallel loop). Callisto schedules an equal number of threads to all running jobs. The job can then use the thread allocation to parallelize its tasks.

2.1.2 Simple admission control Traditionally, jobs submitted to clusters are assumed to have a predefined and static requirement for the number of processors. This requirement is certainly necessary if the job is programmed to use a certain amount of processors. However, with the rise and prominence of parallel programming models, such as the ones discussed previously or MapReduce [13], jobs do not have a strict processor requirement any more. They can use as many processors that are assigned to them. Let us first consider the case where the processor requirement is the only requirement in a single-machine space-shared environment. A straightforward approach of admitting jobs to the system is first-come-first-served (FCFS) [36]. With this approach, jobs are admitted in the order of their arrival, as long as the processor requirement can be satisfied. Consider the case where jobs x0, x1, x2, x3 arrive in that order, with a processor requirement of 2, 2, 5, 1 respectively. If the system has 5 processors in total, then only jobs x0 and x1 will be admitted. Job x2 can of course not be admitted at this point, as there is only 1 available processor. There is space however for job x3, but as this job arrived after x2, it cannot be admitted at this point. While FCFS allows for an easy-to-understand and easy-to-implement strategy that often works

6 2.1. Related work well enough, the fact that jobs in the queue must wait while there is space for them on the system is an obvious inefficiency. To combat this, FCFS is often paired with backfilling. A backfilling strategy skips over queued jobs that cannot be scheduled, and admits the first job that can. Backfilling with the previous example would lead to job x3 being admitted. While this backfilling approach may increase system utilization, it has a major downside: starvation. If, in the previous example, jobs with small weights would continuously arrive, they would continuously serve as the backfill, leading job x2 to never be admitted. Aggressive (a.k.a. EASY) backfilling [25] would only consider a job for backfilling if admitting that job will not cause first job in the queue to be delayed. Aggressive backfilling is only possible if the user gives a job duration estimate upon submission. Other backfilling strategies exist; conservative backfilling [15] only allows jobs to backfill as long as they do not delay all other waiting jobs. The previous examples perform admission control to maximize system utiliza- tion in a fair way. This is done using a heuristic (FCFS with backfilling). If one desires to admit jobs purely to maximize system utilization, without con- sideration for fairness, this problem reduces to the bin-packing problem [24]. Sheikhalishahi et al. [34] propose a more complex multi-resource scheduling system based on bin-packing. The paper defines a heuristic to the (NP-hard) bin-packing problem, as well as a policy which calculates the most suitable node for a job to run on.

2.1.3 Thread scheduling An example of a task scheduler that is widely used is Linux’ Completely Fair Scheduler (CFS), where CPU cycles are divided among threads in proportion to their weight [27]. As the name says, this is an example of a fair scheduler. Such schedulers balance load across CPU cores and minimize their idle time. What fair schedulers fail to consider is that tasks live on the same system, and share resources with each other. Resource contentions could negatively affect task performance. Zhuravlev et al. explore content-mitigation via thread scheduling [42]. In their work, they developed two scheduling algorithms that attempt to mitigate contentions. This is done by distributing the threads across caches such that the cache miss rates are evenly distributed among them. Finding a valid metric that represents resource contention (in this case “cache miss rates”) is important. This work evaluated several metrics, but ultimately chose the cache miss rate, as the miss rate is both easy to obtain online com- pared to other covered metrics, and simpler, making it more likely that it can be adopted in general-purpose operating systems. The two introduced sched- ulers are Distributed Intensity (DI) and Distributed Intensity Online (DIO). The difference lies in the way they calculate the miss rate: DI relies upon stack-distance profiles [11], whereas DIO dynamically obtains the miss rates

7 2. Background

via performance counters. Both DI and DIO achieved similar results, with the added advantage of DIO being online. DIO improves performance by up to 13% and has causes a 4% decrease in performance in the isolated case where it performs worse than the baseline. Interestingly, the variance in completion times is also much reduced compared to the baseline. Applying a contention- aware scheduler is therefore interesting from a quality-of-service (QoS) point of view. Review paper [43] names [7, 21, 23, 28, 37] as other contention-aware thread schedulers. The previous contention-aware schedulers used predefined input metrics as a sole guiding factor for contention mitigation. Improvements could be achieved by letting a statistical model make predictions that guide thread scheduling and thread migrations. Such a system is FACT [31], which obtains the per- formance metrics online, learns to predict the effects of interfering threads on the performance, and then makes thread scheduling decisions. FACT consists of two phases: first it trains a statistical, supervised learning model to predict the instructions per cycle (IPC) of co-scheduled workloads. FACT allows for any kind of input predictor to the model, but as the goal will be to maximize the IPC, the paper names MPI (misses per instruction), API (accesses per instruction) and of course IPC as possible predictors. FACT uses a fuzzy-rule based model to map the input vector to a scalar output. Training the model can be done offline. Next, as workloads are submitted to the system, FACT finds the best pairs of workloads to co-schedule (i.e. attempting to maximize their IPC). While these workloads run, performance data is collected which is used to improve the model online. As with DI and DIO, the performance compared to a baseline scheduler is increased, and the variance of execution time of workloads is decreased.

2.1.4 Cluster scheduling In this section, we explore job scheduling solutions used in large-scale comput- ing clusters. We do this by looking at several prominent schedulers used in industry, and relate them to each other. Google’s Borg system [39] is a cluster manager that combines approaches such as admission control and over-commitment. Borg is internal-facing (i.e. used by Google employees) and is used to run Google’s applications and services. The application domain of large scale clusters demands certain design choices. In a cluster (“cell” in Borg terminology), there is one machine designated as the centralized controller (“Borgmaster”). Every other machine runs a local Borg agent (“Borglet”). Jobs, which contain one or more tasks, are submitted to the Borgmaster. Borgmaster’s scheduler scans tasks with a decreasing pri- ority, and performs two things: it finds out on which machines the task could run (feasibility checking), then picks the best feasible machine for the task to run (scoring). In feasibility checking, the scheduler finds a set of machines

8 2.1. Related work that meet the task’s constraints and also have enough “available” resources – which includes resources assigned to lower-priority tasks that can be evicted. In scoring, the scheduler determines the “goodness” of each feasible machine. It uses a hybrid approach between E-PVM [6], which spreads load across all machines, and “best fit”, which fills machines as tightly as possible. This task is then set to be run by a Borglet. If the Borglet does not have enough re- sources to satisfy the requirements of an incoming task, it will kill tasks with a lower priority until it does. Not all jobs will be considered for scheduling how- ever. Borg uses quotas to determine which jobs will be considered. A quota is a vector of resource requirements. It specifies the maximum amount of each resource a job can ask for, subject to a period of time. Strictly adhering to the quotas, and admitting no more jobs if the quota can not be fulfilled, will ensure that the contract is upheld, but leads to resource underutilization. To combat this, the Borg system oversells quota for lower-priority jobs. Google Borg makes use of quotas to reduce occurrences where the resource requirements of a job cannot be satisfied. A prominent alternative to that approach is Dominant Resource Fairness (DRF) [17]. Where many other (ear- lier) fair resource schedulers support only one resource type, DRF explicitly attempts to achieve fair scheduling for multiple types. DRF is a generalization of max-min fairness, which maximizes the minimum allocation for each user. It does that by first computing the share of each resource allocated to each user. The dominant share is the maximum share of each user. The DRF algo- rithm maximizes the allocation across resources with the constraint that the dominant share of each user is equal. DRF has several nice properties, which include incentivizing resource sharing, strategy-proofness (lying about require- ments does not increase allocation), and envy-freeness. Commonly-used fair schedulers in clusters, such as Hadoop Fair Scheduler [40], divide nodes with all of their resources into slots, and allocate entire slots to users. For example, a machine with 8 CPU cores and 32 GB of RAM can be divided into two slots with 4 CPU cores and 16 GB of RAM each. DRF achieves a significant performance increase compared to those slot-based schedulers. Google Borg is an example of a mainly centralized scheduler, as the Borgmas- ter is responsible for the major scheduling decisions. More decentralized ap- proaches are also possible. Apollo [10] is a distributed system by Microsoft. Each running job has a scheduler that independently performs scheduling de- cisions. The per-job scheduler decides on which worker node the tasks will be queued [22]. Apollo uses an estimation-based approach to guide scheduling decisions. Each node maintains a prediction matrix of waiting times based on the requested amount of CPU and memory. The job then queues its tasks to a suitable worker node, by finding the node for which the estimated completion time of the task is the earliest. Since scheduling decisions are made per-job, and can be competing with decisions of other jobs, Apollo has correction mechanisms, which reevaluate scheduling decisions. These mechanisms are

9 2. Background

not invoked during scheduling time (pessimistic), rather only after the tasks have been queued at the node (optimistic). Apollo also supports “opportunis- tic” task scheduling, where tasks use as much unused system resources as possible. This increases system utilization. As is the case with decentralized schedulers in general, the local scheduling decisions are generally not globally optimal. Instead, what one gains is a more scalable system. A common theme seen in these schedulers is the desire to maximize system re- sources. In the case of Borg, this is done through over-commitment; in the case of Apollo, this is done through the option to schedule tasks opportunistically. Apache Mesos [19] is “a platform for fine-grained resource sharing in the data center”. Mesos tackles the case where multiple frameworks, such as Hadoop and MPI [8], are running on a single system. Some of these frameworks like to assume full control over their assigned resources, such as processors or ma- chines. If resources are allocated to these frameworks, they cannot be used for other purposes, and risk idling. Mesos does this by offering resources to the framework, then letting them decide which resources to use. This two-level mechanism is generic and agnostic to the frameworks at hand. However, as with Apollo, these scheduling decisions may very well not be optimal. Mesos is used as a part of more elaborate systems. One of such systems is Apache Spark [41]. Spark, on its turn, uses delay scheduling [40] to schedule resources to tasks, which causes tasks to be scheduled locally to their data. If this constraint cannot be fulfilled for a certain amount of time, then the task is scheduled non-locally. This approach largely achieves both locality and fairness. YARN [38] is a resource manager for Hadoop. Like Mesos, YARN is a two- level scheduler. Unlike Mesos, where the scheduler offers resources to the frameworks, YARN is request-based, and requires the application to ask for resources.

2.2 System overview

Now that prior literature about job scheduling has been covered, we shall now look at the high-level overview of our use case and how it relates to the covered concepts in literature.

2.2.1 PGX.D overview PGX.D [20] is a distributed graph processing engine. The primary use cases for PGX.D are loading graphs and performing queries or algorithms on them. Graph queries and algorithms are submitted to PGX.D in the form of jobs. For performance reasons, all graphs are loaded in-memory. The advantage the distributed aspect brings compared to a shared-memory counterpart is

10 2.2. System overview that PGX.D allows loading larger graphs, for it can use the total memory of multiple machines. An additional advantage is the ability to leverage more processors to execute jobs.

PGX.D partitions graphs over all nodes in the system. This is an important point, as it has far-reaching implications to the eventual design of the job scheduler. Running a graph query or algorithm necessarily involves all nodes. Therefore, a job needs to be submitted to each node in the system. Because jobs have incomplete information about the graph they are processing, they inevitably have to be in communication with jobs on other nodes. This shall be clear through examples later in this section.

A user can submit a job by sending a command to the designated “master” node. The master node propagates this command to all other nodes. From then on, each node constructs a job object. Where the command only con- tained a description of what is to be run, a job contains a precise definition in terms of code (e.g. in C++), which can now be executed. We demonstrate this using an example. If a user wants to execute the PageRank algo- rithm [29] on a certain previously loaded graph A, it needs to send a message with the meaning “execute PageRank on graph A” to the master node. This node then propagates the same message to all other nodes. All nodes create a job object containing the PageRank algorithm code. The code is equal for all nodes. There are as many job objects as there are nodes; each job object executes the algorithm on the partition of the graph found on the local node. We shall use the notation job i, j to refer to a job object of kind i found on node j, or simply job i if the particular node is not relevant. Jobs originating from the same command are of the same kind. If the PageRank command was sent to PGX.D running on four nodes, this would result in job 0, 0, job 0, 1, job 0, 2 and job 0, 3. We use the term corresponding job to refer to jobs of the same kind found on other machines.

A job only has access to the local partition of the graph. A partition includes vertices and the outgoing edges connected to the local vertices. Of course, edges might lead to vertices that are not found on the local node (it is a remote vertex). If an algorithm depends on interacting with a vertex’ neighbors—like PageRank does—this presents a problem, as the neighbor might not be present on the current node. To solve this issue, message passing is used. A job i, j sends a message whenever it follows an edge. This message travels to the node where the neighbor is found. There, the node k passes the message to job i, k, which is responsible for handling this message. Any job that uses mes- sage passing hence has two responsibilities: performing the algorithm on the local partition (which includes sending messages if interaction with a remote vertex is required), and handling incoming messages from the corresponding jobs. These operations are performed concurrently. A high-level overview of this structure is shown in figure 2.2. The interaction of threads with these

11 2. Background

job thread 0 thread 1

choose operator

local work handle (incl. send incoming messages) messages

Figure 2.2: The anatomy of a job. In this example, two threads are assigned to the job.

operations shall be covered later in this section. There are several addenda to the presentation of the job model. First of all, graph algorithms only serve as a specific use case where the dual-operation job model could be applied, though this is certainly not the only use case. We shall therefore abstract from the graph aspect and define the two operations as: performing work which may include message sending, and message han- dling. Also, partly following from the previous abstraction, the topic of graph partitioning will not be a subject of this thesis. This model causes a job and their corresponding jobs to be tightly coupled, as they communicate constantly. Jobs of the same kind must be running on all nodes in order for them to function. Admitting jobs only on certain nodes and not on others will cause the admitted jobs to not perform work while still occupying system resources, causing those resources to be wasted. Similarly, preempting jobs of the same kind only on certain nodes will lead to the same result. These actions require global coordination. This model stands in contrast with many cluster scheduling solutions covered in section 2.1.4.A major component of cluster schedulers is deciding on which node the task should run. With our model however, there is no such option. A job must run on all machines, or not at all. Thread scheduling does remain a viable route however. To understand how to navigate that route, we must first understand the runtime system at hand.

2.2.2 Callisto runtime system Job execution is the responsibility of the Callisto runtime system [18], which has also been covered in section 2.1.1. Callisto is a single-machine runtime, and executes job objects such as those described previously. Callisto itself

12 2.2. System overview has no knowledge of other nodes. Instead, the distributed aspect of the job is encoded in its operators, which Callisto will execute. Upon startup, Callisto creates as many software threads as the machine has hardware threads. In this thesis, we use “threads” to refer to software threads, and “cores” are used interchangeably with “threads”. A job is executed by having threads assigned to it. The threads execute operators fairly. Threads randomly pick any of the available operators, execute it for a certain amount of time, then randomly pick an operator again. This repeats until the job is finished. The threads’ behavior is schematically shown in figure 2.2. Because multiple threads can be assigned to a single operator, the programming model must support such execution. Operators in Callisto are defined by a domain (e.g. integers from 0 to n, or all local vertices), and a function that takes an element from the domain as input. Multiple threads can thus independently execute the operator function in parallel. Workload distribution across threads in Callisto is described in [18]. Specifically looking at our dual-operator job model, we will see how this pro- cess plays out. We must analyze what it means for an operator to be available. The local work operator has a finite workload. This operator is available when the workload has not yet been fully executed, and is then finished, never to become available again. On the other hand, the message handling operator is only available when there are incoming messages in the job’s queue, otherwise it is not. It can therefore switch between being available and unavailable con- tinuously. The operator is only truly finished when the local work operator of all corresponding jobs has been finished, and there are no more messages to be received. PGX.D implements a termination protocol that detects when this is the case. Callisto assumes tight control over the threads in the system. It is designed in such a way to minimize thread context switches. In theory, at any given time, all threads would be executing operator functions. Therefore, whenever a thread is measured to have a low CPU usage, it is because the operator function (by extension the job) is not able to make efficient use of the assigned CPU time. Threads in Callisto are assigned fairly to all jobs. Reallocations occur only whenever a job is admitted or leaves the system. This way, all running jobs receive an equal share of the threads. An example of this process is shown in figure 2.3. Callisto uses a FCFS approach to determine which jobs will be allowed to run. Jobs must specify a processor requirement. As job tasks are programmed through operator functions to be agnostic to the number of threads assigned to it, this requirement is typically set to 1. What this means is that there can only be as many jobs as there are processors. The admission process is shown in figure 2.4. The processor requirement is the only specifiable requirement. If a job wants

13 2. Background

RTS job queue assign job to cores

job 1 job 0

prev free free free free status core 0 core 1 core 2 core 3 (a) Before admission

RTS job queue

new job 0 job 0 job 1 job 1 status core 0 core 1 core 2 core 3 (b) After admission

Figure 2.3: An example of Callisto’s fair scheduling on a 4-core machine. Two jobs arrive at the same time and each receives the same allocation (two cores).

to use other resources, such as memory or network bandwidth, it will directly access them from the system. All jobs share the same resource space. For example, if a job allocates a chunk of memory, the remaining available memory is reduced for all jobs. This increases resource utilization, and simplifies the design of the runtime system. However, this also means that jobs are not guaranteed an allocation of a resource. If, due to usage by other threads, a thread is not able to access a resource, we call this resource contention. It can occur that during a normal operation, the operator function has to wait for a certain event to occur. Callisto provides a mechanism which can be called from operators to park the running thread and resume it at a later time. This is called blocking. When a thread decides to block, another thread is resumed in its place. Callisto is designed to minimize context switches. However, whenever block is called, a context switch is incurred. Evidently,

14 2.2. System overview

RTS job queue assign job to cores

job 4 job 3 job 2

prev job 0 job 0 job 1 job 1 status core 0 core 1 core 2 core 3 (a) Before admission

RTS job queue

job 4

new job 0 job 2 job 1 job 3 status core 0 core 1 core 2 core 3 (b) After admission

Figure 2.4: An example of Callisto’s admission control on a 4-core machine. Three jobs arrive at the same time while two jobs are already running. we want to minimize the occurrences of blocks. For this, we need to understand why threads running PGX.D might decide to block. In distributed graph systems, especially those that deal with a lot of data, jobs across machines have to communicate to process the data. With larger graphs, this puts strain on the network. Whenever an operator function wants to send a message, it requests a thread-local and destination-specific buffer from the buffer pool (if the thread does not have such a buffer yet). These buffers are registered with the network card; writing to a buffer means writing directly to the network card. If there are n nodes in the system, a thread could have at most n − 1 such buffers at any single time. The message is then written to the buffer. When the buffer is full, it is passed to a low-level library that will finally transmit the buffer over the network. The operator then requests another thread-local, destination-specific buffer. The sent buffer will only be

15 2. Background

readded to the buffer pool once the receiving end acknowledges the message. Figure 2.5 shows this process in a schematic form. PGX.D will call block in the following cases: 1. If the thread requests a buffer from the pool, but the pool is empty, block is called. 2. If the thread tries to send a buffer, but the number of unacknowledged messages is already higher than a preset configuration value, block is called. 3. If the lower-level library that is responsible for message transmission reports an error for any reason, block is called. 4. If an application that is built on top of PGX.D (e.g. an asynchronous query execution engine) has their own flow control limits, then block is called. To reduce the occurrences of blocks, therefore, it is imperative to reduce the intensity of messaging if the network capabilities can no longer handle the messaging load.

2.2.3 Relation to literature Now that the system design is clear, we can see if and how schedulers found in literature could be applied to our distributed system model, and specifically, PGX.D. Solutions such as FACT [31] are designed for single-machine runtimes, such as general-purpose operating systems. They do not consider the distributed nature of jobs. Communication between nodes could be a bottleneck and a major cause of contentions. By merely attempting to improve processor performance metrics, global oversight is neglected, and this may be the cause of adverse effects. Cluster schedulers require jobs to specify their resource needs. The cluster will (largely) respect those needs, giving users certain guarantees about per- formance and job function. In the case where clusters are exposed to external users, there are little alternatives to this approach. If users have paid for their resource needs, then it is obvious that they should be able to make use of what they paid for. However, if the goal is to improve the total system utilization, resource guarantees and contracts are an obstacle in this course. User esti- mates of resources are notoriously unreliable and err on the high side [44]. By removing the resource contracts, one could achieve a higher sharing potential of resources. Admission control is a challenging concept in our distributed system model. Jobs either have to be running at all nodes, or not at all. Because of this, if

16 2.2. System overview

thread

1 local work writing to (incl. send message messages) buffer

replaced with sent to network, 3 fresh buffer 2 once full

network

buffers in pool

returned to pool, 4 once acknowledged

(a) Normal message sending process

thread

local work (incl. send messages)

buffer not available

network

buffers in pool no buffers in pool

(b) No buffers available; the operator function will block.

Figure 2.5: Schematic diagrams showing the message sending process.

17 2. Background

a “smart” admission control scheme is to be implemented, then this decision could better be taken centrally. In cluster schedulers such as Google Borg, admission control is in fact done centrally. In the next chapter, we will further relate the design choices to the related work.

18 Chapter 3

Solution Design

The high level system design shows an important limitation: jobs do not know about the resource usage of other jobs running in the system. This presents a problem when a resource is highly contested: jobs will have to spend a part of their runtime waiting or context switching, and both situations are highly undesirable. There are several ways to mitigate resource contention between jobs. In chapter1 we presented several dimensions through which resource contention can be mitigated. In this chapter we shall look at the design of dynamic thread allocation, operator assignment within jobs, and admission control, with a focus on the former two dimensions.

3.1 Dynamic thread allocation

Fair scheduling entails allocating an approximately equal amount of CPU time to each of the tasks. Fair scheduling is commonly used, for example in schedulers (e.g. Linux’ Completely Fair Scheduler [27]). If fairness is the goal, then fair scheduling is the way to go. However, it is possible to improve job performance (e.g. throughput) beyond what fair scheduling can offer, at the expense of fairness. In this section, we describe a “smart” thread allocation algorithm that reschedules threads based on the job’s resource usage during runtime. This solution is generic and easily extensible for various resource types. We shall start with an example. If, given a certain allocation, a resource is oversaturated, then a scheduler could reallocate threads from jobs that participate in the resource oversaturation to jobs that do not. The effect is twofold: firstly, it alleviates the load from a resource, potentially improving the performance of the jobs accessing this resource; secondly, the jobs that receive additional threads can now run faster. An example of these claimed positive effects with two jobs are shown in figure 3.1. With fair scheduling, both jobs initially receive an equal share of the cores. It is clear that job

19 3. Solution Design

job 0 20 cores cpu usage(%) job 1 20 cores

time

(a) Fair scheduling 30 cores job 0 10 cores 20 cores Speedup due to higher allocation

Speedup due

cpu usage(%) to lower job 1 20 contention cores

time

(b) Smart reallocation

Figure 3.1: An example of the twofold purported effects of core reallocation. These graphs show the CPU usage over time of two running jobs on a 40-core machine.

0 makes much better use of its CPU time than job 1. A low CPU time signifies an inefficient job. Examples of when this happens are contention for system calls (e.g. malloc) or when waiting for signals (e.g. when acquiring a mutex). If the scheduler however decides to reallocate a part of the cores of the inefficient job to the efficient job, both jobs can achieve a speedup: the efficient job because it can use more cores efficiently, and the inefficient job because it has to wait less due to a lowered contention. This kind of smarter reallocation is only possible if the scheduler understands what resources the running jobs are using. We introduce the concept of tokens, which represent the usage of a system resource. For example, one CPU token could represent one percent-point of CPU usage of a job. Therefore, if a job has 100 CPU tokens, it means that its CPU usage is 100% and therefore perfectly efficient in terms of CPU time. Tokens of different types are not inherently comparable. For example, if one would also have a memory bandwidth token, where one of such tokens rep- resents one GiB/s of bandwidth. Such a token is naturally incomparable to

20 3.1. Dynamic thread allocation

Tokens Efficiency factors Job type A type B type A type B

job 0 xA,0 xB,0 fA(xA,0) fB(xB,0) job 1 xA,1 xB,1 fA(xA,1) fB(xB,1)

Table 3.1: Two running jobs and two defined resource types. The exact def- initions of policies fA and fB are not relevant. Each job gets two efficiency factors. a CPU token. In order to compare and combine tokens, the concept of an efficiency factor is introduced. An efficiency factor is a value between 0 and 1 that represents how efficiently a job is using a certain resource. The higher the efficiency factor, the more efficient the job is perceived to use the resource. Jobs with a high efficiency factor will ultimately also receive a higher thread allocation. We now have two concepts: tokens (which represent a job’s usage of a certain resource) and efficiency factors (which represent a job’s perceived efficiency regarding a resource). Tokens are mapped to efficiency factors through policies. Let us use the case of CPU usage again. If the CPU usage of a job is 100%, then the number of CPU tokens for that job is 100. We know that this jobs is as efficient as it can get in terms of CPU usage. Therefore, 100 CPU tokens should map to an efficiency factor of 1. This can simply be achieved using the following function:

x f (x) = CPU 100

It is useful to keep the ultimate purpose of policies in mind. To mitigate con- tentions and increase system utilization, we want jobs that do not participate in resource contentions to receive a high efficiency factor, and jobs that do participate to receive a low factor. To summarize, there are three concepts: tokens, policies, and efficiency factors. Every job has a number of tokens, which through policies are mapped to an efficiency factor. It is imperative that this is understood. In table 3.1, there are two running jobs (job 0 and 1) and two defined resource types (type A and B). Functions fA and fB are the policies for the tokens of type A and B, respectively. The only constraint for policy function is that their result must lie between 0 and 1. Each job receives as many efficiency factors as there are resource types. Efficiency factors are used to determine the job’s thread allocation. A job receives a proportion of the threads that is equal to the proportion of the

21 3. Solution Design

Tokens Efficiency factors Job CPU CPU Thread proportion

job 0 xCPU,0 = 100 fCPU(xCPU,0) = 1.0 0.66 job 1 xCPU,1 = 50 fCPU(xCPU,1) = 0.5 0.33

Table 3.2: Example of the thread allocation calculation. Two running jobs and one defined resource type.

job’s efficiency factor divided by the sum of all efficiency factors. To unpack this, table 3.2 shows an example with two jobs and one defined resource type: CPU usage. fCPU reprises its role as the policy for CPU tokens. Firstly, the scheduler collects the tokens of each job. The policy function converts the CPU tokens to CPU efficiency factors. The sum of efficiency factors of all jobs is 1.5. The proportion of threads allocated to the jobs is therefore equal to their efficiency factor divided by 1.5. The previous scheme assumes that each job has one efficiency factor. This is clearly not always the case, as we have seen in table 3.1. There must therefore be a way to convert efficiency factors of different types into one combined fac- tor. This is the responsibility of the combination function. Table 3.3 continues the example of table 3.1, where now multiple efficiency factors are combined into one, using combination function g. The combination function must al- ways output a valid efficiency factor. In our example, g could simply take the average of the input efficiency factors, or be more complex. This rescheduling algorithm is designed to be run periodically. Running this algorithm frequently will make the thread allocation reflect the current effi- ciency factors better. This is done by creating a thread which collects the tokens and reschedules on every interval. In summary, the rescheduler collects the tokens of all running jobs. Through user-defined policies, each job receives several efficiency factors. Next, these factors are combined into an all-encompassing factor, using a combination function. The scheduler then reallocates threads based on these combined efficiency factors. This solution is still generic, and is not system-specific. In the next few sections, we will employ this scheduling scheme for our distributed system model by defining our own policies. Hopefully, those sections can also help the reader devise their own policies.

3.2 Scheduler API

The solution design is intentionally defined in a generic manner. It supports any number of token types and policies. Of course, such a generic solution does not do anything by itself. Function depends on the defined tokens and

22 3.2. Scheduler API

Tokens Efficiency factors Job type A type B type A type B Combined factor

job 0 xA,0 xB,0 fA(xA,0) fB(xB,0) g(fA(xA,0), fB(xB,0)) job 1 xA,1 xB,1 fA(xA,1) fB(xB,1) g(fA(xA,1), fB(xB,1))

Table 3.3: Two running jobs and two defined resource types. Combination function g converts multiple efficiency factors into one. class resource_manager{ public: resource_manager(job_t job_id); void acquire_tokens(int64_t tokens, token_t token_type); void release_tokens(int64_t tokens, token_t token_type); void set_tokens(int64_t tokens, token_t token_type); };

Figure 3.2: API of resource manager

policies. These definitions are the responsibility of the user. In this section, we present an API that allows for such definitions.

Firstly, for each job and for each token type, the amount of tokens need to be tracked. For this, we introduce resource managers (see figure 3.2), which are a job-specific object used to track the token usage of various types. A resource manager is created with a unique identifier to the running job. Typically, this is either a pointer to a job object, if such an object exists, or a unique job ID. Using the resource manager object, one can now manipu- late the number of tokens (of each token type) in use. Three methods are provided: acquire tokens to increase the number of tokens of a given type, release tokens to decrease, and set tokens to set to a specific number. Each method of resource manager is thread-safe, because jobs are executed on multiple threads, so information about token usage could also come from multiple threads.

Tokens are converted to efficiency factors using policies. Method set policy is provided by the scheduler (see figure 3.3) and allows the user to define a mapping from tokens to efficiency factor for a certain token type. Method set combination function defines the combination function, which takes as input the efficiency factors of a job and outputs the combined efficiency factor.

23 3. Solution Design

class scheduler{ public: void set_policy( std::function&) fn); void reallocate(); }

Figure 3.3: API of the scheduler

3.3 Policies for distributed jobs

The thread allocation algorithm presented is generic and allows for application in various domains (e.g. single machine scheduling, cluster scheduling). The algorithm entirely relies upon user-specified policies. In this section, we shall introduce two policies for which we argue are applicable to distributed systems that perform data processing in general.

3.3.1 Outgoing message policy Network is a scarce resource in distributed systems, particularly distributed data processing engines. As jobs can only directly access the local data, to ac- cess remote data, nodes need to communicate with each other. The goal of the scheduler is to mitigate contentions. The outgoing message policy addresses the contention when sending messages. First, we must understand how contentions can occur when sending messages. An obvious limit to the message sending capability is the link bandwidth between the nodes. It is physically not possible to send data faster over the link than the bandwidth. If a job attempts to send data faster than the link supports, then the job will need to spend some time waiting such that the data transfer rate matches the bandwidth. The link bandwidth is by far not the only possible bottleneck; any networking middleware could be responsible for restricting the rate of messaging. A policy that targets the contention when sending messages should therefore not rely upon where exactly the bottleneck lies. The input to the policy is the number of unacknowledged messages (which we call the network tokens). Messages are classified as unacknowledged from the moment the job instructs to send them until the moment an acknowledgment is received from the desti- nation node. This metric has several desirable properties: 1. An increasing number of unacknowledged messages means that the job is sending messages faster than the “network” can handle. The bottleneck

24 3.3. Policies for distributed jobs

1.0

0.8

0.6

0.4 Efficiency factor 0.2

0.0 0 500 1000 1500 2000 Number of outgoing message tokens Figure 3.4: Efficiency factor plot for the network policy, where c = 128.

may lie anywhere in the network and the metric would display the same behavior. An increasing number of unacknowledged messages is thus an indication for network contention. 2. Obtaining the number of unacknowledged messages is trivial if a message acknowledgement mechanism has already been implemented, which is often the case. Outgoing message tokens are tracked with the resource manager API. When- ever a job sends a message, acquire tokens is called. When that job’s mes- sage is finally acknowledged, release tokens is called. An extremely high number of unacknowledged messages is very suggestive of network contention, and should therefore result in a low efficiency factor (so the jobs that participate in network contention will received a lowered alloca- tion). A low number of unacknowledged messages however is expected during normal operation, as jobs will invariably send messages to other nodes, and the messages will at one point be unacknowledged. Therefore, such cases must not receive a lowered efficiency factor. These constraints can be achieved through many policy functions. We present the following function (with corresponding plot in figure 3.4).

( 1 if x < c fnetwork(x) = c x otherwise

Here, c is a parameter that signifies from what point forward the number of unacknowledged messages is considered to represent network congestion. If the number of unacknowledged messages is lower than c, the network is not considered congested at all, and the function output will be 1. The parameter

25 3. Solution Design

c should be carefully configured over several representative workloads. Setting the parameter too low will unnecessarily penalize jobs that use the network, even though they may not necessarily congest the network. Similarly, setting the parameter too high will render this policy ineffective. Tuning this param- eter for PGX.D workloads will be performed in chapter4. A similar approach to tuning can be applied to other systems as well.

3.3.2 CPU time policy As explained in section 2.2, a job is typically either performing work on the local dataset, which includes sending messages to other nodes, or handling incoming messages. If a job has an allocation of threads, these threads can be put to work for either operator. If a job on one machine has finished performing local work, but the corresponding job on other machines have not, the job would use its assigned threads only to handle incoming messages. This presents a problem when there are few incoming messages: threads spend part of their time idling because there are no incoming messages to handle. These threads would be better served assigned to another job. There are several ways to build a policy targeting this phenomenon. The policy could assign a low efficiency factor to jobs which have finished their local work, but are receiving few messages. This policy effectively targets the situation where threads idle due to few incoming messages, but on the other hand, this is the only situation the policy can handle. Rather, it is more useful that the policy targets the underlying reason why the situation is undesirable, namely that threads are idling. This can happen for multiple reasons, of which few incoming messages is only one. Other common reasons why threads could be idling are when they are waiting for signals (e.g. from a ) or when there is contention on system calls (e.g. when multiple threads attempt to get a memory allocation, and the OS cannot serve them fast enough). The policy must then take as input the “idleness” or “non-idleness” of each thread. A straightforward way of measuring “non-idleness” is by simply taking the average CPU time of each thread assigned to a job. Linux provides several interfaces for this purpose, which shall be described shortly. Another measure that can be used is IPC, which was the primary input metric for FACT [31]. Using either metric is valid, but there are implications of using one over the other. Consider a situation in which there are two jobs: job A and job B. When they are running alongside each other, job A has an inherently higher IPC than job B, but their CPU time as reported by the OS is exactly equal. If IPC is chosen as the measure, then job A would receive a higher allocation of cores, whereas with CPU time, they would receive an equal allocation. The latter is more desirable from a fairness perspective, as both jobs are performing meaningful

26 3.3. Policies for distributed jobs

1.0

0.8

0.6

0.4 Efficiency factor 0.2

0.0 0.0 0.2 0.4 0.6 0.8 1.0 Number of CPU tokens Figure 3.5: Efficiency factor plot for the CPU policy. work for the same amount of time. One job just happens to be a workload that inherently executes more instructions per cycle. Note that in any case, should a user desire that IPC is considered in thread scheduling, the API allows for its inclusion. In our case, we have set the input x to the policy function to be average CPU time of threads assigned to the job in the past time interval, divided by the length of the past time interval. In other words, it measures the average CPU usage of threads assigned to the job. This parameter logically has a range of [0, 1]. The policy function is set to be:

2 fCPU(x) = x

The plot for this function is shown in figure 3.5. This plot is not linear; while it certainly could have been defined as linear, a quadratic function would penalize jobs with low CPU usage disproportionately more than jobs with a higher CPU usage. This is desirable, because jobs with a low CPU usage are clearly contending for resources and should be penalized heavily, whereas jobs with a relatively higher CPU usage are not necessarily contending for resources as much, so they should be penalized relatively less.

Implementation details The collection of CPU tokens is not system-specific, rather platform-specific. We will show how this can be performed on Linux. While the network token perfectly fit into the acquire/release paradigm, CPU tokens cannot exactly be seen individual parts of resources which jobs can possess a finite amount of. Instead, CPU tokens represent CPU time of a job.

27 3. Solution Design

In Linux and other POSIX systems, thread-specific CPU time can be gathered through either system calls (such as getrusage or pthread getcpuclockid) or accessing thread statistics from the proc filesystem [4]. These methods not equivalent; different methods are required to extract the thread-specific CPU time based on the interface chosen. We will describe how this can be achieved through all three interfaces. The system call getrusage reports, among other statistics, the system time and user time of the calling thread. Every thread in the system therefore has to call this function periodically, and pass the information to the job the thread is running. This is slightly convoluted; aside from threads being responsible to execute operator functions, they now also have to periodically perform a system call unrelated to the job function. The system call pthread getcpuclockid and using the proc filesystem allow threads to gather CPU statistics of other threads. This simplifies the process, as the CPU statistics can simply be gathered by the rescheduling thread. This approach is less intrusive, as the application logic is now contained in the rescheduling component, and nowhere else. In the end, we have opted to query the statistics from the proc filesystem, as this approach is slightly easier to implement. A drawback from using the proc filesystem is that statistics are gathered through file accesses, which are typically slower than system calls such as pthread getcpuclockid. For reference, on a 72 thread machine, the file accesses contribute to several milliseconds of execution time. The exact files in which the relevant statistics are gathered are /proc// task//stat, where is the process ID of the entire application, and is the thread ID of the worker threads. The scheduler can, in a separate thread, retrieve the system time and user time of the worker threads. Added together, they form the CPU time. Note that the system and user time are since the start of the thread. We are not interested in information of such a long duration. Instead, every time the rescheduler is executed, it collects the CPU time for each thread and subtracts it with the CPU time gathered in the previous invocation, and divides this result with the length of the interval. This results in the CPU usage as a fraction in the past interval.

3.3.3 Policy combination Now that two policies have been introduced, the only question that remains is how to combine their efficiency factors. One could simply take the arithmetic mean efficiency factor, which is an easy-to-understand method. Alternative methods include those which dynamically prioritize either policy depending on other circumstances. We have chosen for the latter approach. If the node sends little to no messages, then the network efficiency factors are all 1.0. In this case, the network efficiency factor is not a useful metric to base the allocation on, as it is equal for all jobs. Instead, in this case one should look

28 3.3. Policies for distributed jobs

1.0

0.8

0.6 CPU policy Network policies 0.4

Policy weighing 0.2

0.0 0 500 1000 1500 2000 Number of outgoing message tokens of all jobs Figure 3.6: Weight of the CPU and network policy, as functions over the number of outgoing message tokens. d = 1024 in this example. more at the CPU policy, as that policy still outputs a useful efficiency factor to allocate with. However, if the node has a steadily increasing number of unacknowledged messages, then it is more useful to prioritize the network policy, in order to quickly alleviate the problem. Note that the prioritization is based on the number of outgoing message tokens of all jobs running on the system, not merely of the current job. By basing the prioritization on a system-specific metric, rather than a job-specific one, all jobs will prioritize the same policy. Collectively, the jobs are able better able to achieve the goal of prioritizing a policy, such as quickly reducing the number of unacknowledged messages. The weight of the CPU policy is given by the function:

 p2  h (p) = max 0, − + 1 d d

The parameter d here signifies the number of outgoing message tokens that causes the network policy to be fully prioritized at the expense of the CPU policy. The weight of the network policy is given as (1 − hd(p)). Therefore, to combine both policies, the function will be:

gp,d(xCPU, xnetwork) = hd(p)fCPU(xCPU) + (1 − h(p))fnetwork(xnetwork)

Here, p is the number of outgoing message buffers. A plot of the weights of both policies is shown in figure 3.6. As can be seen in this figure, the functions are once again quadratic. This function form is chosen such that the weights change little initially. An example of the combination function in use is shown in figure 4.8.

29 3. Solution Design

3.3.4 Sliding window As we desire not to make erratic and drastic changes in allocation based on outliers in token information, tokens of each type are added to a sliding window at every interval the rescheduling algorithm is run. The rescheduling algorithm then considers not only the tokens found during this interval, but the average of the tokens found across the last n intervals. This smooths out the token information over a time period which is longer than the rescheduling interval, thereby preventing drastic changes based on short-term outliers.

3.4 Operator assignment within job

After the thread allocation algorithm has run, each job has an allocation of threads it can use. A typical job can either perform local work with those threads, or handle incoming messages. Callisto by default tries to treat these operators fairly. Each thread assigned to a job will have an equal chance of picking up either operator. The thread works on the operator for a fixed amount of time. After the time is up, the thread picks up either operator randomly again. This approach is does not consider what kind of work the operators actually perform, and is therefore very fair to either. The process was shown in figure 2.2. With the understanding of a distributed system however, one could design a smarter way for the threads to be assigned to operators. We know that only one operator can send messages and the other cannot. By assigning the threads to the first operator more often, one increases the peak intensity of message sending. This can cause network congestion. Likewise, by assigning threads to the second operator more often instead, the peak intensity of mes- sage sending is decreased, making network contentions less likely. The latter situation is more desirable. The remaining question is how prioritized the second operator should be over the first. A naive way is to assign all threads to message handling, as long as there are messages to be handled. Consider the case where there is one running job on node 0, and it both has remaining local work and many remaining messages to be handled. The corresponding jobs on the other machines have already finished performing their local work. Those jobs’ responsibility is to handle incoming messages from the job on machine 0. If the job on node 0 decides to only handle messages, the corresponding jobs will be idling. If, however, that job also spends some time performing local work (which includes sending messages), the corresponding jobs need not idle any more. We distinguish between two ways to perform the prioritization. Static prior- itization sets the proportion of CPU time each of the operators will receive. For example, without any prioritization, the proportion of CPU time to each

30 3.5. Admission control of the two operators is 50%. With prioritization, one operator could receive a proportion higher than the other. This type of prioritization is ultimately used in our evaluation (see section 4.2.1). Dynamic prioritization uses a to- ken metric—similar to the outgoing message tokens—for this purpose. Let us define incoming message tokens, which represent the number of received but unhandled messages. The higher the number of incoming message tokens, the more the message handling operator should be prioritized. Finding a specific function that achieves this with dynamic prioritization is left as future work.

3.5 Admission control

If it is known beforehand how much of a resource a job will use at most, then one can schedule jobs accordingly. Prior literature distinguishes between optimistic and pessimistic scheduling strategies. Pessimistic strategies assume that a job will use as much of a resource as is specified beforehand, and admit jobs in such a way that the resource limit is not reached. An example of such strategy is FCFS with backfilling. This can mitigate resource contention entirely, but often leads to an underuse of resources, as jobs often use less than the specified maximum. Optimistic strategies account for this empirical observation, and correspondingly admit jobs even though their total specified maximum resource usage exceeds the limit. Both strategies require a priori knowledge about the job’s resource usage, something which is not always available. “A priori resource usage knowledge” itself is a vague term, and it serves to delve into what it means for a job to have a certain resource usage needs. First of all, there are different types of resources on the system. A resource usage requirement should therefore encode requirements of every applicable resource type, e.g. through a vector. There is another dimension to the resource requirement, which is temporal. Tasks within a job can have vastly different needs. For example, in the case of PGX.D’s graph querying engine [33], performing a GROUP BY [3] operation is initially CPU intensive before becoming network intensive. By only having one requirement vector for each job, we fail to consider the temporal aspect. Adding a temporal dimension to the requirement vector (i.e. having a require- ment vector for each task) will solve this issue. Of course, creating such a matrix is a complicated endeavor. As the network is given as the most prominently contended resource in our distributed system model, we can constrict ourselves to only consider the network requirement to determine admission. This network requirement could be formulated as the average number of network tokens the job or task will use during its execution. We consider the average of resource usage, not the maximum as is common with many admission control solutions. This is because there will be no allocation performed based on this token requirement. Jobs and tasks

31 3. Solution Design

can freely exceed this amount; the amount is only used as an indication for the scheduler. Now if the user specifies a network token requirement for the entire job, the following steps could be taken: 1. Users send commands, along with the network token requirement, to the master node. 2. The master node keeps a count of network tokens requirements of run- ning jobs. It will propagate the command to all slave nodes if doing so would not let the count exceed a preset maximum. 3. The master node will continuously backfill with jobs that, despite arriv- ing later than a previously queued job, would not cause the token count to exceed the maximum. 4. To avoid starvation, the first job in the queue (that is waiting because admission would exceed the token maximum) keeps a count of how often another job has served as backfill. If that count reaches a limit, the job will be admitted even if doing so would exceed the token maximum. This approach will lower the peak intensity of messaging, as there is admission control maintains a maximum for network tokens of running jobs. This ap- proach operates on a job-level, despite the fact that we previously established that tasks within jobs can have widely differing resource needs. The above algorithm could fairly simply be amended to operate on a task-level. Firstly, the user will need to specify the network token requirement for each task pertaining to the command. Secondly, the master node’s job queue will be substituted with a task queue. When commands initially arrive to the system, the first task of the job shall be put into the queue, along with its token requirement. Admission control now takes the same actions as before, the only difference being that tasks are admitted based on their token require- ment instead of entire jobs. There still remains one final major amendment. Previously, the jobs would—once admitted—run to completion and then exit the system. Now, since a job can contain multiple tasks, once a task has finished running, the job it belongs to cannot simply exit the system. There might very well still be tasks that have not been run. The remaining amend- ment is that tasks would upon completion enqueue their following tasks to the master node. This task would again go through the process of admission control. This process is repeated until all tasks of a job have finished running. Aside from decreasing messaging intensity, performing token-based admission control has another theorized benefit: the admitted tasks are likely to be more diverse in terms of resource requirements. The described thread allocation algorithm works by taking advantage of the differences in perceived efficiencies in jobs, by extension, the differences in resource requirements.

32 Chapter 4

Evaluation

This chapter will evaluate the solution as described in chapter3. First, we will determine the best parameters to the network policy and combination function. Once these policy parameters are set, we will evaluate the reallo- cation algorithm on various workloads. We will start with purely synthetic workloads, before evaluating more realistic and elaborate ones.

4.1 Methodology

PGX.D can be run on a cluster of heterogeneous machines. However, for reproducablility and explainability reasons, the machines used will be of a single kind. Unless otherwise noted, the experiments run on four machines, each equipped with a 36-core Intel Xeon E5-2699 v3 CPU clocking at 2.30GHz. Intel HyperThreading is enabled, so 72 threads are available for allocation. All experiments are repeated 7 times. During tests where the rescheduler is enabled, the rescheduler thread is woken up every 50 ms. The size of the sliding window is set to 10. In other words: every 50 ms, the rescheduler takes decisions based on the token information of the last 500 ms.

4.1.1 Workloads To gain a comprehensive idea about the scheduler’s performance, we have assembled a wide variety of jobs, each with different behaviors and character- istics. Here, we will describe them. CPU job performs a set amount of pause assembly instructions (i.e. only spins). This job has a CPU usage of close to 100%. It receives the maximum efficiency factors by both the CPU policy (as the CPU usage is high) and the network policy (as it sends no messages), and is therefore considered a highly efficient job.

33 4. Evaluation

Network job only sends messages to each corresponding job, and concur- rently handles incoming messages. This job achieves a high network band- width utilization and saturates the system’s networking capabilities. The job can be parameterized by a network/cpu ratio, which specifies the fraction of time the job spends on messaging. In the rest of the time, the job executes nop instructions. For example, if the ratio is 0.3, the job’s work operator will spend 30% of its time sending messages and 70% of its time spinning. The other operator will still handle incoming messages as usual. The periods of message sending and spinning are interleaved. This parameter is introduced to emulate more realistic workloads, as jobs typically do not exclusively send messages.

SELECT graph query job executes a graph query that generates a resulting table. The rows of the table consist of vertices a, b, c, where a has an outgoing edge to b, and b has an outgoing edge to c. A shorthand notation for this query is (a) → (b) → (c). With the graphs used in our experiments, the number of rows could easily be in the billions. To construct the final frame, this job will need to spend a significant portion of its time allocating memory.

COUNT graph query job counts the occurrences of (a) → (b) → (c) in a graph. This job is similar to the SELECT job, and therefore, in order to understand the differences between their results, it makes sense to compare them:

• The COUNT job does strictly less work than the SELECT job; the former does not need to construct a (large) resulting table.

• As the COUNT job does not create a large resulting table, it tries to allocate much less memory.

• As both jobs need to send an equal number of messages, but the COUNT job is shorter in duration, the job sends messages much more intensely.

GROUP-BY job resembles the SQL “GROUP BY” construct [2]. It aggre- gates groups of rows, thereby turning a large table into a smaller one. Both input and resulting table are partitioned across all nodes.

PageRank job executes the PageRank algorithm with a fixed number of rounds. Note that there are synchronization barriers after each round; if the workload is imbalanced across machines, the job on certain machines might spend a significant portion of their time waiting at the barrier.

All jobs are configured to take an equal amount of time with the baseline fair scheduler. In all experiments, we measure the average latencies of each type of tested jobs, the average latencies of all jobs, and time it takes for the experiment (i.e. the last job) to finish. We call the last measurement the experiment latency.

34 4.2. Parameter configuration

4.2 Parameter configuration

The message handling prioritization, the network policy and the combination function as described in chapter3 were defined with yet-to-be-specified pa- rameters. We have argued that parameterizing these functions makes them more generic and applicable to various distributed systems. In order to apply them to our system, we designed several experiments that help us find out the best parameters. These experiments can serve as a guide to find the best parameters for other systems as well.

4.2.1 Message handling prioritization

Message handling prioritization reduces the peak intensity of message sending, thereby making network contentions less likely. The open question is how much the message handling should be prioritized over the work operator. To determine this, we designed an experiment that evaluates the influence of various levels of message handling prioritization. Specifically, we run two jobs alongside each other: one CPU job and one network job. We vary the network/cpu ratio of the network job, such the network job represents a wide range of workloads. The influence of the message handling prioritization is measured as the number of blocks per second. Recall that threads in a job will block if they attempt to use an unavailable resource. In the case of the network job, blocks are primarily issued due to reaching the sending capability limit of the system. The result of this experiment is shown in figure 4.1a.

Figure 4.1a shows that for the network/cpu ratio range of [0.2, 1.0], message handling prioritization has a minor effect. This is because—even when priori- tizing the message handling completely—the messaging intensity is still higher that the system can handle. Typically, realistic workloads do not fall within the higher end of the network/cpu ratio in any case. However, for a low ratio, the difference in performance is profound. It is important to note that priori- tizing the message handling completely does not actually completely prevent messages from being sent. It only does so when there are unhandled received messages.

It serves to consider a smaller, lower range of ratios and view what the pri- oritization’s effects there are. These results are shown in figure 4.1b. Any prioritization gives overall positive results. However, the difference between the various levels of prioritization is not profound. While for this restricted range, higher prioritization levels appear to perform better, it cannot be said with complete confidence that the higher the prioritization, the better the per- formance will be. Nonetheless, in any future experiment where prioritization is enabled, the prioritization level is set to be at the highest.

35 4. Evaluation

1.50 1.0 0.9 1.25 0.8

0.7 1.00 0.6

0.5 0.75 network/cpu ratio 0.4 0.50 0.3 0.2 #blocks / s (normalized by baseline) 0.25 0.1 +0% +20% +40% +60% +80% +100% -0% -20% -40% -60% -80% -100% message handling vs. work prioritization

(a) wide range of ratios

1.0 0.13

0.8 0.125

0.12 0.6 0.115 0.4 network/cpu ratio 0.11

0.2 0.105 #blocks / s (normalized by baseline)

0.1 0.0 +0% +20% +40% +60% +80% +100% -0% -20% -40% -60% -80% -100% message handling vs. work prioritization

(b) restricted range of ratios

Figure 4.1: The effects of message handling prioritization on various net- work/cpu ratios. On the x-axis, the top and bottom number refer to the message handling and work operator prioritization, respectively.

36 4.2. Parameter configuration

4.2.2 Network policy configuration Recall that the network policy is defined as: ( 1 if x < c fnetwork(x) = c x otherwise x is the number of network tokens, and c is the parameter that signifies at what point the policy will start penalizing the job (inversely to the number of tokens). The higher the value of c, the less aggressive this policy is in penalization. There is a desire to choose the highest value of c which is still effective in increasing performance. This desire stems from the fact that— while choosing a lower number might result in a better performance for the workload of this sensitivity analysis—we also would not want to overfit on this workload. Choosing a higher c makes this policy less aggressive in penalization and likely more robust. The workload of this analysis consists again of two jobs: the CPU job and the network job, and the network job is again parameterized by the network/cpu ratio. We measure the latencies of both jobs and the intensity of blocks when varying both the network/cpu ratio and c. As the CPU job does not consume any network tokens, the value of the network policy will always be 1. The network policy with its parameter c will only affect the network job. The CPU policy is not in use for this experiment, and will be considered later. The results of this experiment are shown in figures 4.2 to 4.5. Figure 4.2 shows the latencies of the CPU job. Low values of c correspond with a lower CPU job latency. This is to be expected; as with lower values, the network job will have a low efficiency factor compared to the CPU job, thus giving the CPU job a higher allocation of threads. In terms of latency, the CPU job will always benefit from a higher allocation, as its workload is very highly parallelizable. With high network/cpu ratios, even less aggressive values of c result in a high speedup for the CPU job. This is because the network job has so many network tokens that even a less aggressive policy will give it a low efficiency factor. If we instead consider the network job latencies as shown in figure 4.3, we see a more interesting pattern. Aggressively penalizing the network job (the leftmost columns) disadvantages it in terms of performance compared to less aggressive policies. This is conforming expectation; the scheduler is designed in such a way that it might disadvantage certain lowly efficient jobs in favor of highly efficient ones. More interesting is the sweet spot in performance where c ∈ [8, 64]. We hypothesize that in this range, the policy is aggressive enough to mitigate contention for network buffers (compared to higher values of c), while not disadvantaging the network job too much (compared to lower values of c).

37 4. Evaluation

1.0 37.5 0.9 35.0 0.8 0.7 32.5 0.6 0.5

30.0 latency (s) network/cpu ratio 0.4

27.5 0.3 0.2 25.0 0.1 1 2 4 8 16 32 64 128 256 512 10242048 c

Figure 4.2: Network policy configuration; CPU job latencies 1.0 42.5 0.9 0.8 40.0 0.7

0.6 37.5 0.5 latency (s) network/cpu ratio

0.4 35.0 0.3 32.5 0.2 0.1 1 2 4 8 16 32 64 128 256 512 10242048 c

Figure 4.3: Network policy configuration; network job latencies

38 4.2. Parameter configuration 1.0

0.9 36.0 0.8

34.5 0.7 0.6 33.0 0.5 latency (s) network/cpu ratio 0.4 31.5 0.3 0.2 30.0 0.1 1 2 4 8 16 32 64 128 256 512 10242048 c

Figure 4.4: Network policy configuration; average job latencies

We see that the CPU job and the network job are differently affected by c. To recap, the CPU job benefits from the lowest value of c, whereas the network job has more of a sweet spot where c ∈ [8, 64]. While the rescheduler inherently allocates threads in favor of highly efficient jobs, we would still like to maintain some degree of fairness to each job, if possible. Figure 4.4 shows the average latencies of both jobs, and figure 4.5 shows the experiment latencies. Both display a pronounced sweet spot in the range of c ∈ [8, 64], where both jobs experience a benefit. In any further experiment, the value of c = 128 is used. This value is pur- posefully higher than the “optimal” value from the experiment results. This is because we desire to use the least aggressive policy that still displays good results, which c = 128 conforms to. Also note that this experiment strictly compares the effects of several levels of aggressiveness of the network policy, but does not compare the performance of the policy against a fair scheduling baseline. This shall be investigated in sections 4.3 to 4.5.

4.2.3 Combination function configuration Both the CPU policy and the network policy have now been fully defined. For the sake of completeness, we will list the policies again:

2 fCPU(x) = x

39 4. Evaluation 1.0

42.5 0.9 0.8 40.0 0.7 0.6 37.5 0.5 latency (s) network/cpu ratio 0.4 35.0 0.3 0.2 32.5 0.1 1 2 4 8 16 32 64 128 256 512 10242048 c

Figure 4.5: Network policy configuration; total experiment latencies

( 1 if x < 128 fnetwork(x) = 128 x otherwise

Section 3.3.3 explained the rationale and form of a possible combination func- tion. In summary, the network policy should be prioritized whenever the number of network tokens of all jobs on a machine (a value which we call p) is high. Otherwise, the CPU policy should be prioritized. A formula hd(p) that determines the weight of the CPU policy and satisfies our specified condition was given in section 3.3.3 and listed again here:

 p2  h (p) = max 0, − + 1 d d

The weight of the network job is given as (1 − h(p)). Therefore, to combine both policies, the function will be:

gp,d(xcpu, xnetwork) = hd(p)fcpu(xcpu) + (1 − hd(p))fnetwork(xnetwork)

Note that function hd(p) has an unspecified parameter d. This parameter determines at what point the network policy should be fully prioritized at

40 4.2. Parameter configuration

136 1.0 0.9 128 0.8

0.7 120 latency (s) 0.6

network/cpu ratio 112 0.5 0.4 104 0.3 128 256 512 1024 2048 4096 8196 16384 32768 d

Figure 4.6: Combination function configuration; CPU job latencies the expense of the CPU policy. To find a valid value for d, we employ the same type of test as for the network policy configuration, where we varied the network/cpu ratio and the policy parameter (in this case d). Instead of performing the experiment with only one CPU job and one network job, we now run four CPU jobs and four network jobs. The reason why more jobs (specifically more network jobs) are necessary for this experiment, is because p is the sum of the network tokens of all jobs. If only one network job is running, then p is equal to xnetwork of that job. It therefore gives us no more information that the network policy does not already have as an input. If more network job are running however, then p is different from the per-job xnetwork, and says something different about the total state of the machine. As before, we vary the network/cpu ratio and the value of the parameter (now d). The range of d in this experiment is [c, 32768]. For our system, the upper value is much higher than p typically encountered. The results are shown in figures 4.6 and 4.7. Firstly, we shall consider the latencies of the CPU jobs as shown in figure 4.6. There does not seem to be a single optimal value for d, as the latency is fairly uniform across the entire evaluated range of d. However, for very high values of d (i.e. when prioritizing the CPU policy), the performance is negatively affected, as can be seen with the lighter column to the right. Also interesting to note: for lower network/cpu ratios, the value of d required to maintain a good performance is also lowered.

41 4. Evaluation 1.0 152 0.9

0.8 144 0.7 136 latency (s) 0.6 network/cpu ratio

0.5 128 0.4 120 0.3 128 256 512 1024 2048 4096 8196 16384 32768 d

Figure 4.7: Combination function configuration; network job latencies

The latencies of the network jobs are shown in figure 4.7. One can observe that the latencies decrease as the network/cpu ratio decreases, yet the amount of work being done increases. This supports the hypothesis outlined in section 3.4 that interleaving networking and non-networking workloads is beneficial. This figure also supports that setting the value of d too high will result in a worsened performance. It would be interesting to observe, with a practical example, of how the ef- ficiency factors are now combined. Figure 4.8 shows such an example for a SELECT graph query job.

4.3 Microbenchmarks

To evaluate the policy parameters, we have relied upon synthetic microbench- marks. So far, the experiments have only compared a range of policy param- eters (e.g. c ∈ [1, 2048]). What has not been investigated is the employment of the CPU and network policies against a baseline. In this section, we will do exactly that. Prioritizing the message handling and rescheduling threads are two approaches which stand orthogonal to each other. In this section, as well as in all further experiments in this chapter, the behavior of fair thread allocation and fair op- erator allocation (baseline) will be evaluated against prioritizing the message

42 4.3. Microbenchmarks

1.0 CPU policy Network policy 0.8 Combined factor

0.6

0.4 efficiency factor 0.2

0.0 time

Figure 4.8: Practical example of efficiency factor combination, using a SE- LECT graph query job. handling (prioritization), rescheduling threads based on tokens (reallocation) and finally against both approaches used concurrently (prio. + realloc.). In some experiments (including this one), we additionally visualize the schedul- ing decisions taken by the baseline fair scheduler, and our solution where pri- oritization and reallocation are enabled. These decisions are shown in two different figures, but the x-axis (while unlabeled) is of equal length in both figures. Also, the scheduling figures are visualized for only one repetition of only one machine. No aggregation is performed. We shall start with a workload that we are familiar with and have already executed several times, which consists for 50% of CPU jobs and 50% of net- work jobs. The network/cpu ratio is set to 1.0, so the network job only sends messages and does not intermittently spin. This workload is executed for a several number of concurrent jobs. For example, when the number of concur- rent jobs is 8, that means that 4 CPU jobs and 4 network jobs are running. The results from this experiment are shown in figures 4.9a to 4.9d. In these figure, “baseline” overlaps with “prioritization”, and “reallocation” with “prio. + realloc.”. It serves to first outline the expectation for CPU job latencies. If the thread rescheduler allocates all threads to the CPU job (i.e. doubles the CPU job allocation compared to the baseline), then the CPU job latency should halve. From figure 4.1a, we expect the prioritization of message handling to display little to no effects for this workload. In figure 4.9a, this prediction does indeed

43 4. Evaluation

8 baseline 8 baseline prioritization prioritization reallocation reallocation 6 prio. + realloc. 6 prio. + realloc.

4 4 baseline with 2 jobs baseline with 2 jobs

latency as a factor of 2 latency as a factor of 2

0 0 2 4 8 16 2 4 8 16 #concurrent jobs #concurrent jobs (a) CPU job latencies (b) network job latencies

8 baseline 8 baseline prioritization prioritization reallocation reallocation 6 prio. + realloc. 6 prio. + realloc.

4 4 baseline with 2 jobs baseline with 2 jobs

latency as a factor of 2 latency as a factor of 2

0 0 2 4 8 16 2 4 8 16 #concurrent jobs #concurrent jobs (c) average job latencies (d) experiment latencies

100 CPU job 100 Network job 80 80

60 60 CPU job Network job 40 40

20 20 threads allocated (%) threads allocated (%)

0 0 time (s) time

(e) fair allocation over time (f) token-based thread allocation over time

Figure 4.9: Workload consisting of 50% CPU & 50% network jobs

44 4.3. Microbenchmarks hold for the CPU job. As the latencies are halved, we know that that the allocation has doubled. To confirm, we consider the fair scheduling actions and the token-based actions, which are visualized in figures 4.9e and 4.9f respectively. Indeed, the token-based scheduler reallocates all threads to the CPU job.

Recall also that each job has a minimum allocation. The rescheduler cannot exactly double the allocation of the CPU jobs, as that would could the network jobs to receive no threads. Because there is this minimum for each job, the effects of rescheduling on CPU jobs become more muted as the number of jobs in the system increase. In fact, if the number is equal to the number of cores, then the rescheduling algorithm has no effect.

As a result of the reallocation, each network job receives the minimum alloca- tion possible (2 threads). Although the allocation is smaller, figure 4.9b shows that this actually improves the performance of network jobs as well, because of lowered contention and context switching. Again, the improvements are more muted at a high number of concurrent jobs, because now each network job re- ceives the minimum allocation, and that would still make the total allocation to network jobs fairly high.

In chapter1, we outlined that the goal of prioritization and reallocation is to improve the job throughput, while not disadvantaging the experiment ex- ecution time (i.e. when the last job finishes) too much. Figure 4.9c shows the average job latency and figure 4.9d shows the experiment latencies. We measure a 35.47% maximum average job latency decrease, which suggest that the throughput of this system would be significantly increased. Also, the experiment latencies have all improved, by up to 28.56%.

In the previous experiment, prioritization did not have any measurable effect. From figure 4.1a, this was to be expected, as the messaging intensity is too high for prioritization to give any benefits. Instead, if the messaging intensity of the network job is decreased (by setting the network/cpu ratio to 0.3), we observe a different behavior as can be seen in figures 4.10a to 4.10d. The CPU job latencies (figure 4.10a) are expectedly not affected by prioritization, because CPU jobs have no message handling operator, so their behavior re- main the same. Compared to the workload where the network/cpu ratio was equal to 1.0 (figure 4.9a), the CPU jobs receive less of a benefit, because the network jobs are not penalized as much (as verified in figures 4.10e and 4.10f). The network job latencies (figure 4.10b) experience a minor positive effect from prioritization, which propagates to the aggregate latencies (figures 4.10c and 4.10d).

45 4. Evaluation

12 baseline 12 baseline prioritization prioritization 10 reallocation 10 reallocation prio. + realloc. prio. + realloc. 8 8

6 6

4 4 baseline with 2 jobs baseline with 2 jobs latency as a factor of latency as a factor of 2 2

0 0 2 4 8 16 2 4 8 16 #concurrent jobs #concurrent jobs (a) CPU latencies (b) network job latencies

12 baseline 12 baseline prioritization prioritization 10 reallocation 10 reallocation prio. + realloc. prio. + realloc. 8 8

6 6

4 4 baseline with 2 jobs baseline with 2 jobs latency as a factor of latency as a factor of 2 2

0 0 2 4 8 16 2 4 8 16 #concurrent jobs #concurrent jobs (c) average job latencies (d) experiment latencies

100 CPU job 100 Network job 80 80

60 60 CPU job Network job 40 40

20 20 threads allocated (%) threads allocated (%)

0 0 time time

(e) fair allocation over time (f) token-based thread allocation over time

Figure 4.10: Workload consisting of 50% CPU & 50% network jobs (with network/cpu ratio 0.3)

46 4.4. Mixture of CPU and realistic jobs

baseline baseline 5 prioritization 5 prioritization reallocation reallocation 4 prio. + realloc. 4 prio. + realloc.

3 3

2 2 baseline with 2 jobs baseline with 2 jobs latency as a factor of latency as a factor of 1 1

0 0 2 4 8 2 4 8 #concurrent jobs #concurrent jobs (a) CPU job latencies (b) SELECT graph query job latencies

baseline baseline 5 prioritization 5 prioritization reallocation reallocation 4 prio. + realloc. 4 prio. + realloc.

3 3

2 2 baseline with 2 jobs baseline with 2 jobs latency as a factor of latency as a factor of 1 1

0 0 2 4 8 2 4 8 #concurrent jobs #concurrent jobs (c) average job latencies (d) experiment latencies

Figure 4.11: Workload consisting of 50% CPU & 50% SELECT graph query jobs

4.4 Mixture of CPU and realistic jobs

So far, only synthetic benchmarks have been considered, and the implemen- tation performs well for those. Of course, these synthetic microbenchmark cannot give an exhaustive view on the performance of realistic jobs. The fol- lowing experiments will run CPU jobs alongside with realistic jobs of a certain type, from those that have been described in section 4.1.1. The advantage of such a test is twofold: firstly, as each type of realistic job is run alongside the same kind of job (i.e. CPU job), the effects of rescheduling and prior- itization on different realistic jobs can be investigated. Secondly, the CPU job’s behavior is known and predictable. For example, its latency is inversely proportional to its allocation.

First, we will investigate the performance of a workload consisting of CPU jobs and SELECT graph query jobs (figures 4.11a to 4.11d). The CPU job latencies (figure 4.11a) show a familiar decrease from receiving a higher alloca- tion, with prioritization again not influencing the performance numbers. More

47 4. Evaluation

baseline baseline 5 5 prioritization prioritization reallocation reallocation 4 prio. + realloc. 4 prio. + realloc.

3 3

2 2 baseline with 2 jobs baseline with 2 jobs latency as a factor of latency as a factor of 1 1

0 0 2 4 8 2 4 8 #concurrent jobs #concurrent jobs (a) CPU job latencies (b) COUNT graph job latencies

baseline baseline 5 5 prioritization prioritization reallocation reallocation 4 prio. + realloc. 4 prio. + realloc.

3 3

2 2 baseline with 2 jobs baseline with 2 jobs latency as a factor of latency as a factor of 1 1

0 0 2 4 8 2 4 8 #concurrent jobs #concurrent jobs (c) average job latencies (d) experiment latencies

Figure 4.12: Workload consisting of 50% CPU & 50% COUNT graph query jobs

interesting are the SELECT job performance numbers (figure 4.11b) which are somewhat negatively affected by reallocation, due to a lowered thread alloca- tion. However, prioritization does give a sizable benefit for this workload. As the influence of prioritization is similar to what was seen with network jobs with a lower network/cpu ratio, we can also conclude that a low network/cpu ratio does meaningfully simulate realistic workloads. Combining reallocation and prioritization gives an even higher performance benefit. Now it would be interesting to compare this workload to another that is similar, yet with prominently different characteristics. One such workload is combin- ing CPU jobs with COUNT graph query jobs. It is known that a COUNT job is more network intensive than its SELECT job counterpart. Figure 4.12a shows no major changes in CPU latencies compared to those run alongside the SELECT query (figure 4.11a). The difference mostly lies between “real- location” and “prio. + realloc.”. While one might think that prioritization should not have any effect on the performance of CPU jobs, this is only true to a certain degree. Prioritization does not affect CPU jobs directly. What

48 4.4. Mixture of CPU and realistic jobs

baseline baseline 4 4 prioritization prioritization reallocation reallocation 3 prio. + realloc. 3 prio. + realloc.

2 2 baseline with 2 jobs baseline with 2 jobs

latency as a factor of 1 latency as a factor of 1

0 0 2 4 8 2 4 8 #concurrent jobs #concurrent jobs (a) CPU job latencies (b) GROUP-BY latencies

Figure 4.13: Workload consisting of 50% CPU & 50% GROUP-BY jobs it does affect, are jobs that use the network (of which the COUNT job is one). Prioritization has changed that job’s behavior in such a way that it is deemed more efficient by our policies, resulting in a higher thread allocation. Of course, there is a fixed number of threads, and the CPU job now will not get as many threads as it would have gotten otherwise. This phenomenon was also visible with the CPU jobs in the previous experiment (figure 4.11a), however there this effect was more muted. Because the CPU job latencies were slightly worse under the influence of pri- oritization, one would hope that prioritization has a positive effect on the COUNT job latencies. Unfortunately such an effect is not measured (fig- ure 4.12b). Both graph query jobs are fairly network intensive. It is worthwhile to inves- tigate the effects of jobs that do not saturate the network capacities, such as a GROUP-BY or PageRank job. For the sake of consistency, these jobs will again be run alongside CPU jobs. Figures 4.13a and 4.13b show the results for the GROUP-BY experiment. Both prioritization and reallocation have little to no effect, which is understandable, as from the point of view of our policies, both the CPU jobs and the GROUP-BY jobs are as efficient as each other. The GROUP-BY job both sends little messages (thus unpunished by the network policy) and has a high CPU usage (thus unpunished by the CPU policy). These figures let us conclude two points: one is that the reschedul- ing algorithm is fair if the efficiency factors are equal; the other is that the rescheduler carries little to no overhead. The same experiment is repeated, but using PageRank jobs instead of GROUP- BY jobs (figures 4.14a and 4.14b). As both GROUP-BY and PageRank jobs are CPU intensive, we expect to see similar results, which we do. An inter- esting observation in the case of PageRank jobs however is that prioritization (without reallocation) has a positive influence on the PageRank job latencies,

49 4. Evaluation

4 baseline 4 baseline prioritization prioritization reallocation reallocation 3 prio. + realloc. 3 prio. + realloc.

2 2 baseline with 2 jobs baseline with 2 jobs

latency as a factor of 1 latency as a factor of 1

0 0 2 4 8 2 4 8 #concurrent jobs #concurrent jobs (a) CPU job latencies (b) PageRank job latencies

Figure 4.14: Workload consisting of 50% CPU & 50% PageRank jobs

and it is not immediately clear why. The reason for this behavior will be given in the next section.

4.5 Mixture of workloads

We depart from the rigid structure of pairing CPU jobs with other jobs of a single type. Instead, in this section, we will be executing both workloads with a higher variety of jobs, as well as pairing realistic workloads in single experiments. First, we consider the workload of 50% PageRank jobs and 50% SELECT jobs, whose results are visualized in figure 4.15. The large gains in performance are due to the PageRank job, especially due to the message handling prioritization. Recall that the PageRank executes over several rounds. In between each round, there is a synchronization barrier. With the baseline fair scheduler, if a job is waiting on such a barrier, it will receive the minimum allocation until it has passed the barrier. In figure 4.15e, we see that this happens often. However, with prioritization and reallocation enabled, waiting on barriers almost never occurs (see figure 4.15f, where prioritization and reallocation are both enabled). This greatly increases performance. The most likely reason why the PageRank job spends less time waiting on barriers, is that the round duration of the job across machines is less disparate. We now run an experiment with a workload consisting of four jobs: PageRank, GROUP-BY, CPU and SELECT. The results can be seen in figure 4.16. It is interesting to first look at the fair allocation graph. While the fair allo- cation scheme should result into giving each job an equal allocation, this is not what happens. The PageRank job often receives the minimum allocation. This was also seen in the previous experiment. The cause for this is waiting on synchronization barriers. With prioritization and reallocation enabled; the

50 4.5. Mixture of workloads

baseline baseline prioritization prioritization 3 reallocation 3 reallocation prio. + realloc. prio. + realloc.

2 2

1 1 baseline with 2 jobs baseline with 2 jobs latency as a factor of latency as a factor of

0 0 2 4 8 2 4 8 #concurrent jobs #concurrent jobs (a) SELECT graph query job latencies (b) PageRank job latencies

baseline baseline prioritization prioritization 3 reallocation 3 reallocation prio. + realloc. prio. + realloc.

2 2

1 1 baseline with 2 jobs baseline with 2 jobs latency as a factor of latency as a factor of

0 0 2 4 8 2 4 8 #concurrent jobs #concurrent jobs (c) average job latencies (d) experiment latencies

100 SELECT job 100 PageRank job 80 80

60 60

40 40

20 20 SELECT job threads allocated (%) threads allocated (%) PageRank job 0 0 time time

(e) fair allocation over time (f) token-based thread allocation over time

Figure 4.15: Workload consisting of 50% PageRank jobs & 50% SELECT graph query jobs

51 4. Evaluation

baseline prioritization 2 reallocation 2 prio. + realloc.

1 1 baseline

baseline with 4 jobs baseline with 4 jobs prioritization latency as a factor of latency as a factor of reallocation prio. + realloc. 0 0 4 8 4 8 #concurrent jobs #concurrent jobs (a) CPU job latencies (b) GROUP-BY job latencies

baseline prioritization 2 2 reallocation prio. + realloc.

1 1 baseline

baseline with 4 jobs prioritization baseline with 4 jobs latency as a factor of latency as a factor of reallocation prio. + realloc. 0 0 4 8 4 8 #concurrent jobs #concurrent jobs (c) SELECT graph query job latencies (d) PageRank job latencies

baseline prioritization 2 reallocation 2 prio. + realloc.

1 1 baseline

baseline with 4 jobs baseline with 4 jobs prioritization latency as a factor of latency as a factor of reallocation prio. + realloc. 0 0 4 8 4 8 #concurrent jobs #concurrent jobs (e) average job latencies (f) total experiment latencies

100 CPU job 80 SELECT job 80 PageRank job 60 GROUP-BY job 60 40 40 CPU job SELECT job 20 20 PageRank job threads allocated (%) threads allocated (%) GROUP-BY job 0 0 time time

(g) fair allocation over time (h) token-based thread allocation over time 52 Figure 4.16: Workload consisting of 25% CPU, 25% GROUP-BY, 25% SE- LECT graph query & 25% PageRank jobs 4.6. Varying number of machines

PageRank job does not wait on barriers any more. This results in a massive 54% decrease in latency. Figure 4.16d shows that this is due to prioritization only. Recall that jobs wait less on synchronization barriers if the job dura- tion across machines up to the point of reaching the barrier is more uniform. This result makes the argument that prioritization does therefore improve the “balance” between jobs.

As far as the other jobs go, the CPU job receives an almost doubled allocation when both prioritizing and reallocating, which results in the latencies halving. While we previously established that prioritization should not affect the per- formance of the CPU job (as it has no message handling to prioritize), there is a difference between only enabling reallocation, and enabling reallocation with prioritization. Prioritization does not directly affect the CPU job, but it does make the PageRank job much faster. As the PageRank job exits the system faster, there is more thread time available to be allocated to other jobs, including the CPU job.

Both the SELECT job and the GROUP-BY job perform slightly worse under prioritization and reallocation. This is due to those jobs receiving a lower allocation. In chapter1, we have stated that the goal is to improve the job throughput compared to a fair scheduler, without negatively affecting the experiment latency too much. This is exactly the case in this experiment. With 8 concurrent jobs, the improvement in average latency is 24.55%, while the increase in experiment latency is just 4.59%.

4.6 Varying number of machines

All experiments have been run on four machines. As the scheduler is presented as generic, we would also like to show that it improves performance across dis- tributed system size. We execute one of the previously encountered workloads: 1 CPU job and 1 COUNT job running concurrently. This workload is run on a varying number of machines, all of type Intel Xeon E5-2690 v3 with 24 cores running at 2.60GHz. These machines have a fewer number of threads than the machines used in the previous experiments, so the ability for reallocation is also lessened. This experiments only serves to point out that the results are fairly similar even when varying the number of machines. The results are shown in figure 4.17.

From figure 4.12, we expect to see a mild increase in performance for both jobs. Indeed, the experiment latencies are decreased by up to 5.92%. The aggregate latency figures show that the difference between the baseline and “prio. + realloc.” is roughly equal for varying number of machines. This is the most important takeaway from this experiment, and shows that the scheduler is largely agnostic to the size of the distributed system.

53 4. Evaluation

baseline baseline prioritization prioritization reallocation reallocation 1 1 prio. + realloc. prio. + realloc. baseline with 1 node baseline with 1 node latency as a factor of latency as a factor of

0 0 1 2 4 8 16 1 2 4 8 16 #nodes #nodes (a) CPU job latencies (b) COUNT graph job latencies

baseline baseline prioritization prioritization reallocation reallocation 1 1 prio. + realloc. prio. + realloc. baseline with 1 node baseline with 1 node latency as a factor of latency as a factor of

0 0 1 2 4 8 16 1 2 4 8 16 #nodes #nodes (c) average job latencies (d) experiment latencies

Figure 4.17: Workload consisting of 1 CPU & 1 COUNT graph query job.

4.7 Results summary

In this chapter, we first configured the static prioritization of the message han- dling. In turned out that any prioritization to message handling is beneficial, and the jobs are set to make their threads always work on handling incoming messages if there are any. We further configured the parameters to the net- work policy and the combination function, which were found to perform well at c = 128 and d = 1024 respectively. When there is a highly efficient job running, such as the CPU job, the scheduler will give this job a higher allocation, increasing its performance. Less efficient jobs are given a lower allocation, which still often increases their performance as well. This leads to a decrease in average latencies of up to 40%, and a decrease in experiment latencies of up to 30%. In virtually all experiments, both the average latencies and the experiment latencies of the token-based scheduler are lower or equal to the fair scheduling baseline. The overheads of the scheduler and related activities (such as token collection) were shown not to noticeably affect the performance, which is ideal, as there

54 4.7. Results summary are no detrimental effects from the scheduler merely running. An interesting observation from the results is that the job that heavily relies upon synchronization barriers (i.e. the PageRank job) received a 54% decrease in latency due to prioritization. This supports the hypothesis that consider- ing the flow of incoming messages (through prioritization) balances the job’s duration across machines, leading to less waiting on barriers. With more elaborate workloads, it can very well so happen that the experiment latency increases because less efficient jobs are disadvantaged in favor of more efficient jobs. However, this effect is smaller than the decrease in average latency, making the trade-off well worth it, depending on the user’s preference.

55

Chapter 5

Conclusion

In this thesis, we have presented a generic and modular thread scheduler that allocates threads dynamically during run time. The scheduler devises allocations based on user-defined resource tokens and policy. Users could, through the scheduler’s API, tailor the scheduler to their own distributed system. We also presented a distributed system model, which is an abstraction of the PGX.D system used in evaluation. For this model, we defined two policies that would improve performance by mitigating contention and allocating threads to jobs that make better use of them. The outgoing message policy penalizes jobs that oversaturate the network capabilities. The CPU policy allocates more threads to jobs with a higher CPU usage. Finally, the handling of incoming messages is prioritized, which both decreases the likelihood of jobs oversaturating the network, and improves the balance regarding job duration across machines. When this solution is applied to PGX.D, the average job latencies were de- creased significantly (up to 40%) compared to the fair scheduling baseline, and individual job latencies are down by up to 50%. This is both due to increasing the thread allocation of highly efficient jobs, and due to decreasing the contention in other jobs. Also, the experiment latencies (the time it takes of all jobs to finish) were also virtually always less than the baseline. This makes our scheduling solution a formidable improvement upon fair schedulers when regarding performance. The prioritization of message handling indeed improves the job duration balance across machines. The built-in policies could be used as-is in distributed systems, although there are parameters that need to be tuned to the system and workload. This configuration is performed for the PGX.D system in section 4.2, and can serve as a guide for other systems as well. The prioritization of message handling gives positive results by itself, and is

57 5. Conclusion

fairly unintrusive in terms of implementation. It is therefore recommended that other systems, if anything, would at least implement this aspect of our solution.

5.1 Future work

This report is written in conclusion of a master thesis, and as with any master thesis, the research is limited in scope. Several topics have been touched upon, but were ultimately unfeasible to fully pursue in this thesis. One of such topics is admission control. An approach for admission control was described in section 3.5. Due to time constraints, this design has unfortunately not been fully implemented and evaluated. This could be done as a future work. Regarding the built-in policies, we have presented some parameterized func- tions that, from interacting and measuring the system, are known to work well for PGX.D. In fact, we believe these concepts will largely carry over to other distributed systems. However, no analysis is done on other systems. Our conclusions could be formulated stronger if the policies were implemented and evaluated on other distributed systems such as Apache Spark [41]. The policies as described in chapter3 are derived through expert knowledge. The policy functions are defended based on desired characteristics. For ex- ample, it is desired that the network policy function maps a low number of network tokens to a high efficiency factor, and a high number to a low ef- ficiency factor. A function is chosen that satisfies the these characteristics. However, no claims regarding optimality are made. Thread scheduling solu- tions such as FACT [31] used a reinforcement learning approach to train their policies. This approach could also be applied to our scheduler, either to train the policy functions and/or to train the combination function. The type of distributed scheduler where all nodes coordinate thread allocation was not touched upon in this thesis. We argued that machine-local scheduling, when considering both the number of incoming and outgoing messages, is able to achieve a more global overview of the entire system. Our thread reallocation solution did in fact give an improvement in performance on all machines. Still, it could be fruitful if an exploration was made into how information from other machines could be leveraged to improve thread scheduling.

58 Bibliography

[1] Intel Cilk Plus. https://www.cilkplus.org/. Accessed: 20.08.2019.

[2] Oracle database SQL language reference. https://docs.oracle.com/ database/121/SQLRF/. Accessed: 21.08.2019.

[3] PGQL 1.2 specification. http://pgql-lang.org/spec/1.2/. Accessed: 21.08.2019.

[4] POSIX Programmer’s Manual, 2013.

[5] Mart´ınAbadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Man´e,Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasude- van, Fernanda Vi´egas,Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-scale machine learning on heterogeneous distributed systems, 2015.

[6] Y. Amir, B. Awerbuch, A. Barak, R. S. Borgstrom, and A. Keren. An opportunity cost approach for job assignment in a scalable computing clus- ter. IEEE Transactions on Parallel and Distributed Systems, 11(7):760– 768, July 2000.

[7] M. Banikazemi, D. Poff, and B. Abali. PAM: A novel performance/power aware meta-scheduler for multi-core systems. In SC ’08: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, pages 1–12, Nov 2008.

59 Bibliography

[8] Brandon Barker. Message passing interface (MPI). In Workshop: High Performance Computing on Stampede, volume 262, 2015.

[9] OpenMP Architecture Review Board. OpenMP application programming interface, 2018.

[10] Eric Boutin, Jaliya Ekanayake, Wei Lin, Bing Shi, Jingren Zhou, Zheng- ping Qian, Ming Wu, and Lidong Zhou. Apollo: Scalable and coordinated scheduling for cloud-scale computing. In 11th USENIX Symposium on Op- erating Systems Design and Implementation (OSDI 14), pages 285–300, Broomfield, CO, October 2014. USENIX Association.

[11] Calin Cas, caval and David A. Padua. Estimating cache misses and locality using stack distances. In Proceedings of the 17th Annual International Conference on Supercomputing, ICS ’03, pages 150–159, New York, NY, USA, 2003. ACM.

[12] L. Dagum and R. Menon. OpenMP: an industry standard API for shared- memory programming. IEEE Computational Science and Engineering, 5(1):46–55, Jan 1998.

[13] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: Simplified data process- ing on large clusters. Commun. ACM, 51(1):107–113, January 2008.

[14] Pamela Delgado, Florin Dinu, Anne-Marie Kermarrec, and Willy Zwaenepoel. Hawk: Hybrid datacenter scheduling. In 2015 USENIX Annual Technical Conference (USENIX ATC 15), pages 499–510, Santa Clara, CA, July 2015. USENIX Association.

[15] D. G. Feitelson and A. M. Weil. Utilization and predictability in schedul- ing the ibm sp2 with backfilling. In Proceedings of the First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing, pages 542–546, March 1998.

[16] Gartner Inc. Gartner forecasts worldwide public cloud revenue to grow 17.5 percent in 2019. https: //www.gartner.com/en/newsroom/press-releases/ 2019-04-02-gartner-forecasts-worldwide-public-cloud-revenue-to-g, apr 2019. ”[Accessed July 29, 2019]”.

[17] Ali Ghodsi, Matei Zaharia, Benjamin Hindman, Andy Konwinski, Scott Shenker, and Ion Stoica. Dominant resource fairness: Fair allocation of multiple resource types. In Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation, NSDI’11, pages 323–336, Berkeley, CA, USA, 2011. USENIX Association.

60 Bibliography

[18] Tim Harris and Stefan Kaestle. Callisto-RTS: Fine-grain parallel loops. In 2015 USENIX Annual Technical Conference (USENIX ATC 15), pages 45–56, Santa Clara, CA, July 2015. USENIX Association.

[19] Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, An- thony D. Joseph, Randy Katz, Scott Shenker, and Ion Stoica. Mesos: A platform for fine-grained resource sharing in the data center. In Pro- ceedings of the 8th USENIX Conference on Networked Systems Design and Implementation, NSDI’11, pages 295–308, Berkeley, CA, USA, 2011. USENIX Association.

[20] Sungpack Hong, Siegfried Depner, Thomas Manhardt, Jan Van Der Lugt, Merijn Verstraaten, and Hassan Chafi. PGX.D: A fast distributed graph processing engine. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’15, pages 58:1–58:12, New York, NY, USA, 2015. ACM.

[21] Y. Jiang, X. Shen, C. Jie, and R. Tripathi. Analysis and approxima- tion of optimal co-scheduling on chip multiprocessors. In 2008 Interna- tional Conference on Parallel Architectures and Compilation Techniques (PACT), pages 220–229, Oct 2008.

[22] Konstantinos Karanasos, Sriram Rao, Carlo Curino, Chris Douglas, Kishore Chaliparambil, Giovanni Matteo Fumarola, Solom Heddaya, Raghu Ramakrishnan, and Sarvesh Sakalanaga. Mercury: Hybrid central- ized and distributed scheduling in large shared clusters. In 2015 USENIX Annual Technical Conference (USENIX ATC 15), pages 485–497, Santa Clara, CA, July 2015. USENIX Association.

[23] R. Knauerhase, P. Brett, B. Hohlt, T. Li, and S. Hahn. Using OS ob- servations to improve performance in multicore systems. IEEE Micro, 28(3):54–66, May 2008.

[24] B. Korte and J. Vygen. Combinatorial Optimization: Theory and Algo- rithms. Algorithms and Combinatorics. Springer Berlin Heidelberg, 2007.

[25] David A. Lifka. The ANL/IBM SP scheduling system. In Dror G. Feit- elson and Larry Rudolph, editors, Job Scheduling Strategies for Parallel Processing, pages 295–303. Springer, 1995.

[26] David Lo, Liqun Cheng, Rama Govindaraju, Parthasarathy Ran- ganathan, and Christos Kozyrakis. Heracles: Improving resource effi- ciency at scale. In Proceedings of the 42Nd Annual International Sym- posium on Computer Architecture, ISCA ’15, pages 450–462, New York, NY, USA, 2015. ACM.

61 Bibliography

[27] Jean-Pierre Lozi, Baptiste Lepers, Justin Funston, Fabien Gaud, Vivien Qu´ema, and Alexandra Fedorova. The linux scheduler: A decade of wasted cores. In Proceedings of the Eleventh European Conference on Computer Systems, EuroSys ’16, pages 1:1–1:16, New York, NY, USA, 2016. ACM.

[28] Andreas Merkel, Jan Stoess, and Frank Bellosa. Resource-conscious scheduling for energy efficiency on multicore processors. In Proceedings of the 5th European Conference on Computer Systems, EuroSys ’10, pages 153–166, New York, NY, USA, 2010. ACM.

[29] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The pagerank citation ranking: Bringing order to the web. Technical Report 1999-66, Stanford InfoLab, November 1999. Previous number = SIDL- WP-1999-0120.

[30] Chuck Pheatt. Intel® threading building blocks. J. Comput. Sci. Coll., 23(4):298–298, April 2008.

[31] Kishore Kumar Pusukuri, David Vengerov, Alexandra Fedorova, and Vana Kalogeraki. Fact: A framework for adaptive contention-aware thread migrations. In Proceedings of the 8th ACM International Con- ference on Computing Frontiers, CF ’11, pages 35:1–35:10, New York, NY, USA, 2011. ACM.

[32] Xiaoqi Ren, Ganesh Ananthanarayanan, Adam Wierman, and Minlan Yu. Hopper: Decentralized speculation-aware cluster scheduling at scale. SIGCOMM Comput. Commun. Rev., 45(4):379–392, August 2015.

[33] Nicholas P. Roth, Vasileios Trigonakis, Sungpack Hong, Hassan Chafi, An- thony Potter, Boris Motik, and Ian Horrocks. PGX.D/Async: A scalable distributed graph pattern matching engine. In Proceedings of the Fifth International Workshop on Graph Data-management Experiences & Sys- tems, GRADES’17, pages 7:1–7:6, New York, NY, USA, 2017. ACM.

[34] Mehdi Sheikhalishahi, Richard M. Wallace, Lucio Grandinetti, Jos´eLuis Vazquez-Poletti, and Francesca Guerriero. A multi-dimensional job scheduling. Future Generation Computer Systems, 54:123 – 131, 2016.

[35] K. Shvachko, H. Kuang, S. Radia, and R. Chansler. The hadoop dis- tributed file system. In 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), pages 1–10, May 2010.

[36] S. Srinivasan, R. Kettimuthu, V. Subramani, and P. Sadayappan. Charac- terization of backfilling strategies for parallel job scheduling. In Proceed- ings. International Conference on Parallel Processing Workshop, pages 514–519, Aug 2002.

62 Bibliography

[37] Kai Tian, Yunlian Jiang, and Xipeng Shen. A study on optimally co- scheduling jobs of different lengths on chip multiprocessors. In Proceed- ings of the 6th ACM Conference on Computing Frontiers, CF ’09, pages 41–50, New York, NY, USA, 2009. ACM.

[38] Vinod Kumar Vavilapalli, Arun C. Murthy, Chris Douglas, Sharad Agar- wal, Mahadev Konar, Robert Evans, Thomas Graves, Jason Lowe, Hitesh Shah, Siddharth Seth, Bikas Saha, Carlo Curino, Owen O’Malley, Sanjay Radia, Benjamin Reed, and Eric Baldeschwieler. Apache hadoop yarn: Yet another resource negotiator. In Proceedings of the 4th Annual Sym- posium on Cloud Computing, SOCC ’13, pages 5:1–5:16, New York, NY, USA, 2013. ACM.

[39] Abhishek Verma, Luis Pedrosa, Madhukar Korupolu, David Oppen- heimer, Eric Tune, and John Wilkes. Large-scale cluster management at Google with Borg. In Proceedings of the Tenth European Conference on Computer Systems, EuroSys ’15, pages 18:1–18:17, New York, NY, USA, 2015. ACM.

[40] Matei Zaharia, Dhruba Borthakur, Joydeep Sen Sarma, Khaled Elmele- egy, Scott Shenker, and Ion Stoica. Delay scheduling: A simple technique for achieving locality and fairness in cluster scheduling. In Proceedings of the 5th European Conference on Computer Systems, EuroSys ’10, pages 265–278, New York, NY, USA, 2010. ACM.

[41] Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. Spark: Cluster computing with working sets. In Proceed- ings of the 2Nd USENIX Conference on Hot Topics in Cloud Computing, HotCloud’10, pages 10–10, Berkeley, CA, USA, 2010. USENIX Associa- tion.

[42] Sergey Zhuravlev, Sergey Blagodurov, and Alexandra Fedorova. Address- ing shared resource contention in multicore processors via scheduling. In Proceedings of the Fifteenth Edition of ASPLOS on Architectural Support for Programming Languages and Operating Systems, ASPLOS XV, pages 129–142, New York, NY, USA, 2010. ACM.

[43] Sergey Zhuravlev, Juan Carlos Saez, Sergey Blagodurov, Alexandra Fe- dorova, and Manuel Prieto. Survey of scheduling techniques for ad- dressing shared resources in multicore processors. ACM Comput. Surv., 45(1):4:1–4:28, December 2012.

[44] D. Zotkin and P. J. Keleher. Job-length estimation and performance in backfilling schedulers. In Proceedings. The Eighth International Sympo- sium on High Performance Distributed Computing (Cat. No.99TH8469), pages 236–243, Aug 1999.

63

Declaration of originality

The signed declaration of originality is a component of every semester paper, Bachelor’s thesis, Master’s thesis and any other degree paper undertaken during the course of studies, including the respective electronic versions.

Lecturers may also require a declaration of originality for other written papers compiled for their courses. ______

I hereby confirm that I am the sole author of the written work here enclosed and that I have compiled it in my own words. Parts excepted are corrections of form and content by the supervisor.

Title of work (in block letters):

Dynamic Thread Allocation for Distributed Jobs using Resource Tokens

Authored by (in block letters): For papers written by groups the names of all authors are required.

Name(s): First name(s): Smesseim Ali

With my signature I confirm that − I have committed none of the forms of plagiarism described in the ‘Citation etiquette’ information sheet. − I have documented all methods, data and processes truthfully. − I have not manipulated any data. − I have mentioned all persons who were significant facilitators of the work.

I am aware that the work may be screened electronically for plagiarism.

Place, date Signature(s) Zürich, 25.08.2019

For papers written by groups the names of all authors are required. Their signatures collectively guarantee the entire content of the written paper.