Space-Efficient Scheduling of Nested Parallelism

Space-Efficient Scheduling of Nested Parallelism GIRIJA J. NARLIKAR and GUY E. BLELLOCH Carnegie Mellon University Many of today's high-level parallel languages support dynamic, fine-grained parallelism. These languages allow the user to expose all the parallelism in the program, which is typically of a much higher degree than the number of processors. Hence an efficient scheduling algorithm is required to assign computations to processors at runtime. Besides having low overheads and good load balancing, it is important for the scheduling algorithm to minimize the space usage of the parallel program. This article presents an on-line scheduling algorithm that is provably space efficient and time efficient for nested-parallel languages. For a computation with depth D and serial space requirement S1, the algorithm generates a schedule that requires at most S1 + O(K · D · p)space (including scheduler space) on p processors. Here, K is a user-adjustable runtime parameter specifying the net amount of memory that a thread may allocate before it is preempted by the scheduler. Adjusting the value of K provides a trade-off between the running time and the memory requirement of a parallel computation. To allow the scheduler to scale with the number of processors, we also parallelize the scheduler and analyze the space and time bounds of the computation to include scheduling costs. In addition to showing that the scheduling algorithm is space and time efficient in theory, we demonstrate that it is effective in practice. We have implemented a runtime system that uses our algorithm to schedule lightweight parallel threads. The results of executing parallel programs on this system show that our scheduling algorithm significantly reduces memory usage compared to previous techniques, without compromising performance. Categories and Subject Descriptors: D.1.3 [Programming Techniques]: Concurrent Program- ming|Parallel programming; D.3.4 [Programming Languages]: Processors|Run-time envi- ronments; F.2.0 [Nonnumerical Algorithms and Problems]: Analysis Of Algorithms and Problem Complexity General Terms: Algorithms, Languages, Performance Additional Key Words and Phrases: Dynamic scheduling, multithreading, nested parallelism, parallel language implementation, space efficiency 1. INTRODUCTION Many of today's high-level parallel programming languages provide constructs to express dynamic, fine-grained parallelism. Such languages include data-parallel languages such as Nesl [Blelloch et al. 1994] and HPF [HPF Forum 1993], as well as control-parallel languages such as ID [Arvind et al. 1989], Cilk [Blumofe et al. Authors' address: CMU Computer Science Department, 5000 Forbes Avenue, Pittsburgh, PA 15213; email: fnarlikar, [email protected]. This research is sponsored by the Wright Laboratory, Aeronautical Systems Center, Air Force Materiel Command, USAF, and the Advanced Research Projects Agency (ARPA) under grant DABT63-96-C-0071. Permission to make digital/hard copy of all or part of this material without fee for personal or classroom use provided that the copies are not made or distributed for profit or commercial advantage, the ACM copyright/server notice, the title of the publication, and its date appear, and notice is given that copying is by permission of the ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists requires prior specific permission and/or a fee. c 1999 ACM 0164-0925/99/0100-0138 $05.00 ACM Transactions on Programming Languages and Systems, Vol. 21, No. 1, January 1999, Pages 138{173. Space-Efficient Scheduling of Nested Parallelism · 139 1995], CC++ [Chandy and Kesselman 1992], Sisal [Feo et al. 1990], Multilisp [Hal- stead 1985], Proteus [Mills et al. 1990], and C or C++ with lightweight thread libraries [Powell et al. 1991; Bershad et al. 1988; Mueller 1993]. These languages allow the user to expose all the parallelism in the program, which is typically of a much higher degree than the number of processors. The language implementation is responsible for scheduling this parallelism onto the processors. If the scheduling is done at runtime, then the performance of the high-level code relies heavily on the scheduling algorithm, which should have low scheduling overheads and good load balancing. Several systems providing dynamic parallelism have been implemented with efficient runtime schedulers [Blumofe and Leiserson 1994; Chandra et al. 1994; Chase et al. 1989; Freeh et al. 1994; Goldstein et al. 1995; Hseih et al. 1993; Hummel and Schonberg 1991; Nikhil 1994; Rinard et al. 1993; Rogers et al. 1995], result- ing in good parallel performance. However, in addition to good time performance, the memory requirements of the parallel computation must be taken into considera- tion. In an attempt to expose a sufficient degree of parallelism to keep all processors busy, schedulers often create many more parallel threads than necessary, leading to excessive memory usage [Culler and Arvind 1988; Halstead 1985; Rugguero and Sargeant 1987]. Further, the order in which the threads are scheduled can greatly affect the total size of the live data at any instance during the parallel execution, and unless the threads are scheduled carefully, the parallel execution of a program may require much more memory than its serial execution. Because the price of the memory is a significant portion of the price of a parallel computer, and parallel computers are typically used to run big problem sizes, reducing memory usage is often as important as reducing running time. Many researchers have addressed this problem in the past. Early attempts to reduce the memory usage of parallel computations were based on heuristics that limit the parallelism [Burton and Sleep 1981; Culler and Arvind 1988; Halstead 1985; Rugguero and Sargeant 1987] and are not guaranteed to be space efficient in general. These were followed by scheduling techniques that provide proven space bounds for parallel programs [Blumofe and Leiserson 1993; 1994; Burton 1988; Burton and Simpson 1994]. If S1 is the space required by the serial execution, these techniques generate schedules for a multi- threaded computation on p processors that require no more than p·S1 space. These ideas are used in the implementation of the Cilk programming language [Blumofe et al. 1995]. A recent scheduling algorithm improved these space bounds from a multiplicative factor on the number of processors to an additive factor [Blelloch et al. 1995]. The algorithm generates a schedule that uses only S1 + O(D · p) space, where D is the depth of the parallel computation (i.e., the length of the longest sequence of dependencies or the critical path in the computation). This bound is asymptotically lower than the previous bound of p · S1 when D = o(S1), which is true for parallel computations that have a sufficient degree of parallelism. For example, a simple algorithm to multiply two n × n matrices has depth D =Θ(logn) and serial space 2 2 2 S1 =Θ(n ), giving space bounds of O(n + p log n) instead of O(n p)onprevious 1 systems. The low space bound of S1 + O(D · p) is achieved by ensuring that 1 More recent work provides a stronger upper bound than p · S1 for space requirements of regular ACM Transactions on Programming Languages and Systems, Vol. 21, No. 1, January 1999. 140 · G. J. Narlikar and G. E. Blelloch the parallel execution follows an order that is as close as possible to the serial execution. However, the algorithm has scheduling overheads that are too high for it to be practical. Since it is synchronous, threads need to be rescheduled after every instruction to guarantee the space bounds. Moreover, it ignores the issue of locality|a thread may be moved from processor to processor at every timestep. In this article we present and analyze an asynchronous scheduling algorithm called AsyncDF. This algorithm is a variant of the synchronous scheduling algorithm proposed in previous work [Blelloch et al. 1995] and overcomes the above- mentioned problems. We also provide experimental results that demonstrate that the AsyncDF algorithm does achieve good performance both in terms of memory and time. The main goal in the design of the algorithm was to allow threads to execute nonpreemptively and asynchronously, allowing for better locality and lower scheduling overhead. This is achieved by allocating a pool of a constant K units of memory to each thread when it starts up, and allowing a thread to execute nonpreemptively on the same processor until it runs out of memory from that pool (and reaches an instruction that requires more memory), or until it suspends. In practice, instead of preallocating a pool of K units of memory for each thread, we can assign it a counter that keeps track of its net memory allocation. We call this runtime, user-defined constant K the memory threshold for the scheduler. When an executing thread suspends or is preempted on a memory allocation, the processor accesses a new thread in a nonblocking manner from a work queue; the threads in the work queue are prioritized according to their depth-first, sequential execution order. The algorithm also delays threads performing large block allocations by effectively lowering their priority. Although the nonpreemptive and asynchronous nature of the AsyncDF algorithm results in an execution order that differs from the order generated by the previous algorithm [Blelloch et al. 1995], we show that it maintains an asymptotic space bound of S1 + O(K · D · p). Since K is typically fixed to be a small, constant amount of memory, the space bound reduces to S1 + O(D · p), as before. This bound includes the space required by the scheduling data structures. The scheduler in the AsyncDF algorithm is serialized, and it may become a bottleneck for a large number of processors. Therefore, to allow the scheduler to scale with the number of processors, this article also presents a parallelized version of the scheduler.

Load more