Scalable Load Distribution and Load Balancing for Dynamic Parallel Programs E

Scalable Load Distribution and Load Balancing for Dynamic Parallel Programs E. Berger and J. C. Browne Department of Computer Science University of Texas at Austin Austin, Texas 78701 USA 01-512-471-{9734,9579} {emery,browne}@cs.utexas.edu ABSTRACT Little attention has been paid to combining load distribution and load balancing for parallel programs This paper reports design and preliminary evaluation which dynamically create and delete units of of an integrated load distribution-load balancing computation during execution. algorithm which was targeted to be both efficient and scalable for dynamically structured computations. The This paper proposes a load distribution/load balancing computation is represented as a dynamic hierarchical (LD/LB) algorithm which combines an initial static dependence graph. Each node of the graph may be a graph partitioning with dynamic thread level workload subgraph or a computation and the number of distribution by the “work stealing” algorithm [3,4]. instances of each node is dynamically determined at Work stealing is attractive as a dynamic scheduling runtime. The algorithm combines an initial algorithm since it can be shown to be optimal under partitioning of the graph with application of plausible circumstances and because it admits randomized work stealing on the basis of subgraphs to derivation of bounding values for important problem refine imbalances in the initial partitioning and parameters. This study, while it is ultimately balance the additional computational work generated experimental, is based on theoretical models of the by runtime instantiation of subgraphs and nodes. variants of work stealing which we use. Dynamic computations are modeled by an artificial We describe several methods of load distribution program (k-nary) where the amount of parallelism is based on simple (non-optimal) but plausible graph parameterized. Experiments on IBM SP2s suggest partitioning and analyze the effect of combining these that the load balancing algorithm is efficient and methods with work-stealing. We show that integration scalable for parallelism up to 10,000 parallel threads of work-stealing with load distribution improves for closely coupled distributed memory architectures. performance, and that by combining work-stealing with a load distribution algorithm which utilizes Keywords scalable, load distribution, load information on data locality in the graph, we get the balancing, work stealing best performance improvement. It is further shown that the algorithm scales according to the definition of 1. INTRODUCTION scalability given following. The motivation for this research is the emergence of Load Distribution – A load distribution is an dynamically structured computations executing on assignment of work to a set of processing elements. dynamic parallel and distributed resource sets. Maintenance of a "good" balance of work across Load Balancing – Load balancing is the process of processing elements is essential for such transferring units of work among processing elements computational systems. during execution to maintain balance across processing elements. The parallel programs most in need of effective load distribution and load balancing (LD/LB) are those Scalability - A scalable LD/LB Algorithm must be which are both large and dynamically structured. both efficient in execution and must attainment load LD/LB is often rendered even more difficult by the balance for high degrees of parallelism and large constraints on moving of units of computation created numbers of processing units. Let n be the number of by the large volumes of data often associated with the units of computation which are to be executed during units of computation. Scalability of LD/LB algorithms the execution of the total computation. Let p be the have received relatively little attention. number of processors over which the units of computation are to be distributed. Scalability efficient implies that the total time taken to execute the LD/LB method should be negligible for small values of n and should grow no faster than O(p) for large values of n and p. Scalability effective implies that the degree of load balance attained should be independent of n and p. 2. ALGORITHM DEFINITION and control flow. A CODE program is referred to as a graph. A graph may contain any number of nodes, 2.1 Overview each corresponding to a subgraph or an alias to a The most commonly used load distribution algorithms subgraph (thereby permitting recursion). CODE nodes are based on graph partitioning algorithms contain firing rules that determine when the node is [references]. We adopt the representation of ready to be executed, the computation to be executed, computations as directed acyclic graphs (dags) where and routing rules that determine where data should be arcs represent dependencies (serial and parallel) and sent after the computation. They may also contain nodes represent computations. static (i.e., persistent) variables and automatic Let w be the total amount of computation, or work in variables. CODE graphs are dynamic}: new nodes and the dag. Let d be the depth of the dag (the critical subgraphs may be created at runtime (each path of the computation). Then w corresponds to the instantiation of a new object has one or more indices runtime for a serial execution of the program, while d associated with it). is the runtime on an infinite-processor machine. The CODE model of computation is general and Let s (=speedup) be the runtime for a serial execution architecture-neutral. Programs written in CODE may divided by the runtime for a parallel execution. Then be compiled for execution on shared-memory the maximum s for p processors maxs(p) = w/(w/p + multiprocessors (SMP's), distributed-memory (DMPs) d) (It's not possible to do better than to do all of the or cluster architecture multiprocessors. The problem work in parallel and then add the critical path). Let is providing an efficient and scalable mechanism for degree of parallelism be the maximum exploitable executing CODE programs on DMP's and on clusters. parallelism in a given computation. This corresponds to the maximum possible speedup on an infinite 2.3 Load Distribution Algorithms number of processors (maxs()) = w/d. Because CODE graphs are dynamic, instantiations of The central result of [3,4] states that for any dag, if graphs and computation nodes occur at runtime and it work-stealing is used to schedule computations, the is not known at compile-time how many objects will expected runtime on p processors T_p = w/p + be created: placement of these objects must be {O}(d). However, this result ignores the problem of managed at runtime. load distribution, that is, the mapping of processes to 2.3.1 Load Distribution at the Node Level processors. While the Cilk model of [4] allows Vokkarne [11] describes an initial distributed version computations to be executed on any processor, for of CODE which maps objects to processors after the many high-performance codes, processes must be manner of the early versions of the widely used PVM more or less permanently assigned to processors [6] distributed programming environment. We use because they rely on large amounts of data that would this mode of initial distribution as a basis for be expensive to move. discussion of the issues in partitioning graphs of We describe several methods of load distribution dynamic programs where the nodes may have based on simple (non-optimal) but plausible graph substantial data associated with them. Each object has partitioning and analyze the effect of combining these a path, defined as the list of graphs (with their indices) methods with work-stealing. We show that integration which contain the object plus the object name and its of work-stealing with load distribution improves indices. The path sum is defined as the sum of the performance, and that by combining work-stealing indices along the path plus the unique identifiers of with a load distribution algorithm which utilizes every object. The object (i.e., execution and storage information on data locality in the graph, we get the associated with the object) is placed on processor best performance improvement. It is further suggested number (path sum/mod(N)), where N is the total that the algorithm scales according to the definition of number of processors and processors are indexed {0 scalability given preceding. …N-1}. The algorithm we used is different in a small but important way. The "path sum" is defined as the 2.2 Program Representation product of the indices. This is important for Effective load distribution algorithms for the class of scalability since mod(N) of the original path sum will computations of concern here must include generate substantial variations in workload for large consideration of data access patterns. The numbers of processors. representation of the program which is used as the This algorithm which we subsequently refer to as basis for partitioning is a dynamic, hierarchical node-level mapping has possibly serious flaws for structured dependence or data flow graph. The graphs with substantial communication among nodes: program representation we have chosen is the CODE (i) It destroys locality. Objects are dispersed across all [2,7] graph-oriented parallel programming language. processors, and no attempt is made to ensure that A program can be constructed via a graphical user objects which are ``close together'' (e.g., in the same interface by placing nodes representing computations graph) are kept together. Opportunities for avoiding (typically calls to C or C++ functions) or subgraphs (a unnecessary communication are not exploited and in basic node is a degenerate case of a subgraph) and fact are almost inevitably discarded. This is only connecting them with arcs which represent both data acceptable if communication is very small compared queue, while theft prefers the light queue. This is done to computation. to reduce communication costs: while work-stealing a (ii) It makes load balancing difficult.

Scalable Load Distribution and Load Balancing for Dynamic Parallel Programs E

Distributed Algorithms with Theoretic Scalability Analysis of Radial and Looped Load ﬂows for Power Distribution Systems

Scalability and Performance Management of Internet Applications in the Cloud

Scalability and Optimization Strategies for GPU Enhanced Neural Networks (Genn)

Parallel System Performance: Evaluation & Scalability

Multiprocessing and Scalability

Pascal Viewer: a Tool for the Visualization of Parallel Scalability Trends

Computer Architecture: Parallel Processing Basics

Parallel Programming with Openmp

Challenges for the Message Passing Interface in the Petaflops Era

Scalable SIMD-Efficient Graph Processing on Gpus

Computing at Massive Scale: Scalability and Dependability Challenges

CUDA Basics Murphy Stein New York University