Hesrpt: Optimal Scheduling of Parallel Jobs

heSRPT: Optimal Scheduling of Parallel Jobs with Known Sizes Benjamin Berg Rein Vesilo Mor Harchol-Balter Carnegie Mellon University Macquarie University Carnegie Mellon University ABSTRACT consider the case where jobs are hardly parallelizable and a Nearly all modern data centers serve workloads which are ca- single job receives very little benefit from additional servers. pable of exploiting parallelism. When a job parallelizes across In this case, the optimal policy is to divide the system equally multiple servers it will complete more quickly, but jobs receive between jobs, a policy called EQUI. When jobs are neither diminishing returns from being allocated additional servers. embarrassingly parallel nor completely sequential, the optimal Because allocating multiple servers to a single job is ineffi- policy with respect to mean flow time must split the difference cient, it is unclear how best to share a fixed number of servers between EQUI and SRPT, favoring short jobs while still re- between many parallelizable jobs. In this paper, we provide specting the overall efficiency of the system. the first closed form expression for the optimal allocation of servers to jobs. Specifically, we specify the number of servers 10.0 efficient inefficient that should be allocated to each job at every moment in time. 7.5 region region Overhead Our solution is a combination of favoring small jobs (as in } SRPT scheduling) while still ensuring high system efficiency. 5.0 Parameter p 1 We call our scheduling policy high-efficiency SRPT (heSRPT). s(k) Speedup 2.5 0.75 0.5 2.5 5.0 7.5 10.0 0.25 1. INTRODUCTION Number of Servers (k) In this paper we consider a typical scenario where a data Figure 1: A variety of speedup functions of the form s(k) = p center composed of N servers is tasked with completing a set k , shown with varying values of p. When p = 1 we say of M parallelizable jobs, where typically M is much smaller that jobs are embarrassingly parallel, and hence we con- than N. In our scenario, each job has a different inherent size sider cases where 0 < p < 1. Note that all functions in this (service requirement) which is known up front to the system. family are concave and lie below the embarrassingly par- In addition, each job can be run on any number of servers allel speedup function (p = 1). at any moment in time. These assumptions are reasonable Our Model for many parallelizable workloads such as training neural net- works using TensorFlow [1, 8]. Our objective is to allocate Our model assumes that all M jobs are present at time t = 0. servers to jobs so as to minimize the mean flow time across Job i is assumed to have some inherent size xi where, WLOG, all jobs, where the flow time of a job is the time until the job x ≥ x ≥ ::: ≥ x : leaves the system. What makes this problem difficult is that 1 2 M jobs receive a concave, sublinear speedup from parallelization In general we will assume that all jobs follow the same + + – jobs have a decreasing marginal benefit from being allocated speedup function, s : R ! R , which is of the form additional servers (see Figure 1). Hence, in choosing a job to p receive each additional server, one must keep the overall ef- s(k) = k ficiency of the system in mind. In this paper, we will derive for some 0 < p < 1. If a job i of size x is allocated k servers the optimal policy for allocating servers to jobs when all jobs i for its entire lifetime, it will complete at time follow a realistic sublinear speedup function. x The optimal allocation policy will depend heavily on the i : how parallelizable the jobs are. To see this, first consider the s(k) case where jobs are embarrassingly parallel. In this case, we In general, the number of servers allocated to a job can change observe that the entire data center can be viewed as a single over the course of the job’s lifetime. It therefore helps to think server that can be perfectly utilized by or shared between jobs. of s(k) as a rate1 of service where the remaining size of job i Hence, from the single server scheduling literature, it is known after running on k servers for a length of time t is that the Shortest Remaining Processing Time policy (SRPT) will minimize the mean flow time across jobs. By contrast, xi −t · s(k): 1WLOG we assume the service rate of a single server to be 1. More generally, we could assume the rate of each server to be Copyright is held by author/owner(s). m, which would simply replace s(k) by s(k)m in every formula 25 allelizable fraction of f = :9, the optimal split is to allocate 20 Benchmark 63:5% of the system to one of the jobs. If we imagine a set 15 blackscholes of M arbitrarily sized jobs, one suspects that the optimal pol- 10 bodytrack icy favors shorter jobs, but calculating the exact allocations for canneal this policy is not trivial. Speedup s(k) 5 0 Why Finding the Optimal Policy is Hard 0 10 20 30 Number of Cores (k) At first glance, solving for the optimal policy seems amenable to classical optimization techniques. However, naive applica- Figure 2: Various speedup functions of the form s(k) = kp tion of these techniques would require solving M! optimiza- (dotted lines) which have been fit to real speedup curves tion problems (corresponding to each of the possible comple- (solid lines) measured from jobs in the PARSEC-3 parallel tion orders of the jobs), each consisting of O(M2) variables benchmarks [9]. The three jobs, blackscholes, bodytrack, (corresponding to each job’s allocation between consecutive and canneal, are best fit by the functions where p = :89, departures) and O(M) constraints (which enforce the com- p = :82, and p = :69 respectively. pletion order being considered). Furthermore, although these We choose the family of functions s(k) = kp because they sub- techniques could produce the optimal policy for a single prob- linear and concave and can be fit to a variety of empirically lem instance, it is unlikely that they would yield a closed form measured speedup functions (see Figure 2). Similarly, [6] as- solution. We instead advocate for finding a closed form solu- sumes s(k) = kp where p = 0:5. tion for the optimal policy, which allows us to build intuition In general, we assume that there is some policy, P, which about the underlying dynamics of the system. allocates servers to jobs at every time, t. When describing the state of the system, we will use mP(t) to denote the number of P 2. PRIOR WORK remaining jobs in the system at time t, and xi (t) to denote the remaining size of job i at time t. We also denote the completion Despite the prevalence of parallelizable data center work- P loads, it is not known, in general, how to optimally allocate time of job i under policy P as Ti . When the policy P is implied, we will drop the superscript. servers across a set of parallelizable jobs. There has been We will assume that the number of servers allocated to a job extensive work from the theoretical computer science com- need not be discrete. In general, we will think of the N servers munity [7, 4, 5, 2] regarding how to schedule parallelizable as a single, continuously divisible resource. Hence, the policy jobs with speedup functions in order to minimize mean flow P can be defined by an allocation function q P(t) where time. However, this work has concentrated on competitive analysis rather than direct optimization. Recently, the perfor- P P P P mance modeling community has considered this problem [3], q (t) = (q1 (t);q2 (t);:::;qM(t)): but this work only considers jobs with unknown, exponentially P M P Here, 0 ≤ qi (t) ≤ 1 for each job i, and ∑i=1 qi (t) ≤ 1 . An distributed sizes. We therefore present the first closed form P allocation of qi (t) denotes that under policy P, at time t, job analysis of the optimal policy when parallelizable jobs have P i receives a speedup of s(qi (t) · N). We will always assume known sizes. that completed jobs receive an allocation of 0 servers. We denote the optimal allocation function which minimizes ∗ ∗ ∗ ∗ 3. MINIMIZING MEAN FLOW TIME mean flow time as q (t). Similarly, m (t), xi (t), and Ti denote the corresponding quantities under the optimal policy. The purpose of this section is to determine the optimal allocation function q ∗(t) which defines the allocation for each job Why Server Allocation is Counter-intuitive at any moment in time, t, and minimizes mean flow time. As previously noted, the space of potential allocation policies is Consider a simple system with N = 10 servers and M = 2 iden- :5 too large to be amenable to direct optimization. Our approach tical jobs of size 1, where s(k) = k , and where we wish to is therefore to reduce this search space by proving a series of minimize mean flow time. One intuitive argument would be properties of the optimal allocation function, q ∗(t) (Theorems that, since everything in this system is symmetric, the optimal 1,2,3,4). These properties will allow us to derive the optimal policy would allocate half the servers to job one and half the allocation function in Theorem 5. servers to job two.

Hesrpt: Optimal Scheduling of Parallel Jobs

Shanlu › About › Cv › CV Shanlu.Pdf Shan Lu

Software Engineering 0835

A ACM Transactions on Trans. 1553 TITLE ABBR ISSN ACM Computing Surveys ACM Comput. Surv. 0360‐0300 ACM Journal

ACM JOURNALS S.No. TITLE PUBLICATION RANGE :STARTS PUBLICATION RANGE: LATEST URL 1. ACM Computing Surveys Volume 1 Issue 1

FCRC 2011 June 4 - 11, San Jose, CA TIMELINE SCHEDULE

Curriculum Vitae

March 2, 2020 Remzi Arpaci-Dusseau Professor And

PERMDNN: Efficient Compressed DNN Architecture with Permuted

ASPLOS XX Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems

Teaching Parallel Thinking to the Next Generation of Programmers

SIGARCH Annual Report July 2014 - June 2015

List of Related Organizations and Conferences