IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 7, NO. 10, OCTOBER 1996 1065 Scheduling In and Out Forests in the Presence of Communication Delays

Theodora A. Varvarigou, Member, IEEE, Vwani P. Roychowdhury, Member, IEEE, Thomas Kailath, Fellow, /€E€, and Eugene Lawler

Abstract-We consider the problem of scheduling tasks on multiprocessor architectures in the presence of communication delays. Given a set of dependent tasks, the schedulingproblem is to allocate the tasks to processors such that the pre-specified precedence constraints among the tasks are obeyed and certain cost-measures(such as the computation time) are minimized. Several cases of the scheduling problem have been proven to be NP-complete [16],[lo]. Nevertheless, there are polynomial time algorithms for interesting special cases of the general scheduling problem [12], [14], [lo]. Most of these results, however, do not take into consideration the delays due to message passing among processors. In this paper we study the increase in of scheduling problems due to the introduction of communication delays. In particular, we address the open problem of scheduling Out-forests (In-forests)in a multiprocessor system of m identical processors when communication delays are considered. The corresponding problem of scheduling Out-forests (In-forests) without communication delays admits an elegant polynomial time solution as presentedfirst by Hu in 1961 1121; however, the problem in the presence of communication delays has remained unsolved. We present here first known polynomial time algorithms for the computation of the optimal schedule when the number of available processors is given and bounded and both computation and communication delays are assumed to take one unit of time. Furthermore, we present a linear-time algorithm for computing a near-optimalschedulefor unit-delay out-forests. The schedule’s length exceeds the optimum by no more than (m- 2) time units, where m is the number of processors. Hence for two processors the computed schedule is strictly optimum.

Index Terms-Communication delays, out-forest precedence graphs, multiprocessor architectures, out-forest precedence graphs, optimal deterministic schedules, polynomial-time algorithms.

1 INTRODUCTION E consider deterministic scheduling problems in the Nevertheless, there are polynomial time algorithms for W presence of communication delays among the proc- several interesting special cases of the general scheduling essors. The area of deterministic scheduling is of obvious problem. One of the first polynomial time algorithms was importance, and has been extensively studied since the developed by Hu [12], who presented a linear time algo- early 1950s. In most of the cases, however, the communica- rithm for computing the optimum schedule of an In-forest tion delays among processors were ignored. We study here (or Out-forest) precedence graph, for a given number of the increase in the complexity of scheduling problems due available processors. The general strategy used in this algo- to the introduction of communication delays. rithm is the highest-level-fiust strategy, modified versions of Scheduling of dependent tasks in a multiprocessor sys- which give optimal results for other scheduling problems tem is such a well studied topic that any reference to the such as for the interval oder precedence graphs 1141, [71 or for available material has to be selective. The general problem the special case where only two processors are available for of scheduling a set of n unit delay and arbitrarily depend- the computation of the tasks [6], [8]. Polynomial time algo- ent tasks on a set of m identical processors where both m rithms have been presented for the special cases of oppos- and n are variables of the problem and unbounded has ing forests 1101, [51 as well as level order 151 precedence been proven to be NP-complete [161. Moreover, the sched- graphs. uling problem remains NP-complete even in some cases of The above mentioned results and algorithms assume restricted kind of dependencies among the tasks (see 1101, that the communication delays due to message passing 1131, [191, [Ill, [91). among processors that compute dependent tasks is negligi- ble. A more realistic approach is to take communication delays into coiisideration for the computation of the opti- T.A. Varvarigou is with the Department of Electronic and Computer Engi- mal schedule. Ilowever, introducing communication delays neering, Technical University of Crete, Chania 73100, Crete. may destroy the optimality of the schedules computed for Email: [email protected]. precedence graphs without communication delays. Hence a V.P.Roychowdhury is with the Department of Electrical Engineering, new analysis of the scheduling problem is needed when Univeusit?yof Cnlifurnia at Los Angeles, Los Angeles, CA90095. T. Knilath is with the Department of Electrical Engineering, Stanford Uni- communication delays are considered. Unfortunately, the versity, Stanford, CA 94305. general scheduling problem with communication delays is E. Lnzulev was with the Department of Electrical Engineering and Com- NP-complete; this is because it is a general version of the puter Science, University of California, Berkeley. He is deceased. usual scheduling problem (i.e., without communication Manuscript received Sept. I, 1992. delays), which is NP-complete [161. The problem remains For informatioil on obtaining reprints of this article, please send e-mail to: tmnspds~computcr.org,and reference IEEECS Log Number D95195. NP-complete wen when the number of available proces-

1045-9219/96$05.00 01996 IEEE 1066 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 7, NO. 10, OCTOBER 1996 sors is arbitrary (number of processors is said to be arbitrary 1.1 The Model and Results if the algorithm can use unrestricted number of processors). The general model that we are going to assume in this pa- In particular, Chretienne showed that for general prece- per is similar to the models used by other researchers (see dence graphs, and for arbitrary computational and com- [l], [SI, [lo]) and can be described as follows: The model munication delays, the problem of computing the optimal consists of m identical processors PI,P,, . . ., P, where m is a schedule remains NP-complete, even if no restrictions are parameter of the problem and bounded by a constant c. imposed on the number of available processors (see [31). There exist n computational tasks TI, T,, ..., T, that can be Note that the corresponding scheduling problem with ar- executed in any of the m available processors. There is a bitrary num.ber of processors but without communication partial order among the tasks. This implies that if Ti+ TI delays admits a simple linear time algorithm [91. An even (i.e., Ti is dependent on Ti)then the computation of Tiin stronger NP-complete result was shown by Papadimitriou some processor Pj cannot start before task Tiis computed and Yannakakis in 1990. They proved that even if the com- (say in processor P,) and the results of this computation putation delays are restricted to be unit, and the communi- have been transmitted from processor PI to processor Pi. cation delays are restricted to be a positive integer t, the The partial order among the tasks introduces a precedence problem still remains NP-complete (see 1151). graph associated with the scheduling problem. The tasks {Til However, as already mentioned, there are polynomial are the nodes of the graph and we assume directed edges time algoritl~msfor different special cases of the scheduling between the nodes Ti + Ti whenever there is a dependency problem, such as In/Out forest precedence graphs, interval between tasks T,and Ti. order precedence graphs, etc., without communication de- For our purposes we shall assume that the precedence lays. The question we ask is: How does the complexity of these graph is of the Out-forest (or In-forest) form which is de- special cases change when communication delays are introduced? fined as follows: A precedence graph is an Out-Forest (In- In particula.r, we investigate the complexity of the schedul- Forest) if and only if: For every task T, there is only one task ing problem, with communication delays for the special case TI for which the dependency TI -+ Ti(Ti -+ Ti) exists. All the of In/Out forest precedence graphs. tasks are of unit computational time, i.e., it takes one unit of An atternpt to incorporate communication delays in time to compute any of the tasks TI,..., T, in any of the scheduling of general graphs was made by Colin and processors PI, ..., P,n. The processors are fully connected, Chretienne .in [4]. They consider general precedence graphs, i.e., any processor can communicate with any other proces- and they allow short communication delays, as well as du- sor. Moreover, the computation of the tasks in the proces- plication of tasks. The number of available processors is sors is independent of the communication among them. considered ito be arbitrary. Under these constraints, for each This implies that the processors can execute tasks at the task i, they propose polynomial time algorithms to compute same time that communication is taking place among them. lower boun'ds b, of the starting time of any of its copies. A The communication delay among any two processors takes first attempt to consider communication delays in In-tree unit time, i.e., it takes one unit of time for the transmission precedence graphs was made by Chretienne in [2]. He de- of the results of the computation of any task from any proc- veloped a greedy but optimal algorithm for computing the essor P, to any processor PI different than Pi. There is no optimal schLedule of In-tree precedence graphs under the communication delay involved when some task Ti and its assumptions of arbitrary number of available processors and dependent task Ti are computed in the same processor. short communication delays. The same problem was also con- Immediately after the completion of the computation of any sidered by Anger et al. in [l],and the proposed algorithm task T,, the results are sent to the processors where the tasks works for the case of In-forests/Out-forests precedence that depend on T,are going to be computed. graphs and arbitrary number of processors. None of these Given the above general model, the scheduling problem algorithms, however, is applicable to the case where the can be formulated as follows: number of ;processors at each time step are not as many as 1. Given a set of processors P,, . . ., P, and a set of de- needed (i.e.,arbitrary) but given as a variable of the problem. PROBLEM pendent tasks , , ., T, whose corresponding precedence We address here the open problem of scheduling Out- TI, graph is of the Out-forest (In-forest) form, find a valid forest/In-folrest precedence graphs with communication schedule for tke assignment of the given tasks to the proces- delays when the number of available processors is not arbi- sors, i.e., assign each task Ti to a processor P, and to a time trary but a uariable of the problem. This case where m is a con- slot t, in such a way that: stant or a variable of the problem (but bounded) is the most frequently encountercd case in practical applications. We pres- 1) If T, + TI then Ti is not scheduled to begin before T, is ent algorithms of O(n2"-') time complexity for the problem, completed and the results of T, have been sent from PI where m is the number of available processors, and the (where T, is computed), to Pj zuhere Tiis computed; and computation and communication delays are assumed to 2) The total completion time is minimum. take unit time. Hence, for constant number of processors In this paper we first solve Problem 1 and prove the fol- the problem of scheduling In/Out forest precedence graphs lowing result (see Theorem 6): admits polynomial time solution. This is the first known re- Given an out-forest G with communication delays, the op- sult of its type and demonstrates that certain scheduling problems timal schedule for G can be computed in O(n2"-') time. remain of pvlynomial time complexity even in the presence of communicat,;ondelays. We next give a detailed description of Note that our algorithm is of polynomial time complex- our model, assumptions, and our results. ity when m is bounded by a constant. The question whether ~

VARVARIGOU ET AL.: SCHEDULING IN AND OUT FORESTS IN THE PRESENCE OF COMMUNICATION DELAYS 1067 the problem is NP-complete for unbounded m remains un- The rest of tlhe paper is as follows: In Section 2, we pres- answered. The analysis of the In-forest precedence graph ent some results on scheduling dependent tasks when scheduling problem reduces to the Out-forest precedence communication delays are ignored. In Section 3.1, we dis- graph scheduling problem by just reversing the direction of cuss the favored child property that allows us to derive delay the dependencies between the tasks and then inverting the free graphs out of graphs that are subject to communication resulting optimal schedule. delays. In Section 3.2.1, we introduce the notion of the short- In the process of proving this theorem we develop effi- est delay free graph. In Sections 3.2.2 and 3.2.3, we develop cient ways of transforming the given graphs that are subject polynomial time algorithms for computing first the length to communication delays into delay free graphs that can be of the optimal schedule for a given set of dependent tasks scheduled without taking into consideration communica- and then for actually computing the optimal schedule. In tion delays between the processors, and whose optimal Section 4, we present a linear-time algorithm for computing schedules obey the precedence and communication delay con- a near-optimal schedule for unit-delay out-forests. Finally, in straints of the original graph and have the same length as the Section 5, we make some concluding remarks. optimal schedule of the original graph. This allows us to use techniques to obtain polyno- 2 THEMERGE ALGORITHM mial time algorithms for the computation of the optimal schedules. We shall discuss here the so-called MERGE algorithm, Our algorithm is a generalization of the algorithm pre- which was first developed in [5] for scheduling out-forests sented in [5] for graphs with no communication delays. without communication delays. We shall use the algorithm Using similar principles as in [5], we prove that we can ac- and the related concepts in Section 3.2, where communica- tually take communication delays among the processors tion delays will be taken into account as well. into consideration while keeping the time complexity of the DEF~NITION1. A component C of a graph G is any subgraph of algorithm polynomial. Moreover, it appears that our result G for which the following is true: there are no edges among regarding the equivalence of a given precedence graph to a nodes of C and nodes of G - C. That is, C is the union of corresponding delay free precedence graph (see Section 3.1 one or more disjoint connected compoizents of G. The and Theorem 3) is the first of its kind. We believe that such median of an out-forest G,p(G), is the height of some ideas can be used to generalize other algorithms that have mth highest tree of G plus one, where m is the number of been developed for scheduling precedence graphs without available processors. The high subgraph HG of a given communication delays, into polynomial time algorithms for out-forest G is the subgraph of G that contains all the the case when communication delays are considered. trees, with height strictly greater than the median of G Furthermore, we present a linear-time algorithm for (i.e., p (G)).The low subgraph L, of G,is the subgraph of computing a near-optimal schedule for unit-delay out- G that contains all the trees of G that are of height less or forests (see Theorem 8). The schedule’s length exceeds the equal to the median p(G). optimum by no more than M - 2 time units. Hence for two Given a graph G we can obviously compute the high and processors the computed schedule is strictly optimum. the low subgraphs of G, H,, and LG, respectively, in Oh) 1.2 Notations and an Outline of the Paper time. The following theorem suggests a way of computing The schedule that results as a solution of Problem 1 will be the length of the optimum schedule of an out-forest G, called an optimum schedule. Any assignment of the tasks to given the length of an optimal schedule of its high sub- graph and is due to Dolev and Warmuth the processors and to the time slots that obeys the prece- H,, 151. dence constraints and the communication delay constraints THEOREM1. For any delay free out-forest G, is called a valid schedule (or simply a schedule). The total number of time slots that a certain schedule occupies is A(G) = max{ A( HG), mp{k 1 km 2 \GI}} called the length of the schedule. We often refer to the The MERGE algorithm presented in 151 computes the op- precedence graph of the tasks simply as a graph. In the timum schedule for a precedence graph G, given an opti- analysis that follows we assume only Out-forest precedence mum schedule for the high subgraph H, of G (see also [51), graphs. The computational tasks are often referred to as in O(n)time. nodes. Whenever there is a dependency T,4 T, between The following theorem (see [5] for a proof), gives the tasks we refer to T,as the parent of T, and we refer to T, as time complexity for the MERGE algorithm. the child of TI.If there is a directed path in the precedence THEOREM2. Let be an out-forest precedence graph. Then given graph from a node T, to a node TI then we refer to TIas the G an optimum schedule for the high subgraph of G, there is predecessor of Ti and we refer to Ti as the successor of T,. an O(n)-time algorithm that computes an optimum sched- Height h(G) of a precedence graph G is the length of the ule for the whole graph G. longest path in G. A subgraph of a given out-forest is called a tree if there is a single node in the subgraph that has no predecessors. Root is a node that has no predecessors and leaf is a node with no successors. A graph F is called a sub- graph of a (precedence) graph G if every node in f has the same successors as it has in G. 1068 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 7, NO. 10, OCTOBER 1996

3 POLYNOMIAL TIMEALGORITHMS FOR OPTIMAL the same length as the optimum schedule of the original graph. Be- SCHEDlULlNG IN THE PRESENCE OF fore proceeding into Lemmas l and 2 that prove the above COMMIJNICATIONDELAYS we define the following: Delay dependencies are the dependencies that are sub- 3.1 Delay Free Transformations of Precedence ject to communication delays. So, if a node n has more than Graphs one delay dependent children, only one of them can be As already mentioned in the introduction, if task Ti, sched- scheduled in the time slot immediately following the time uled in processor Pi,depends on task Ti scheduled in proc- slot n is scheduled. Delay free dependencies are the de- essor Piat time slot t, then task T, cannot be scheduled to pendencies that are not subject to communication delays. begin at time slot t + 1 or earlier. This is because the results So, if a node n has more than one delay free dependent of task Tj are needed for the computation of task T,and children, then there is no restriction on how many out of cannot be i-ransmitted from processor Pi to P, before time the n children that can be scheduled in the time slot, imme- slot t + 2. The transmission of results, as already stated in diately following the time slot where n is scheduled. the introduction, is independent of the actual computation DEFSNIT~ON2. Suppose that G is a graph whose dependencies are in the diffeirent processors, so processor Pi or PI can be used delay dependencies. Then we define the corresponding de- for the computation of other tasks while the communication lay free graph GSas the graph that results if we replace the between PI and Pi is taking place. The idea that is going to delay dependencies among every node n and its children allow us to compute optimum schedules for the case when n,, n2, . . . , with two stages of delay free dependences: communication delays have to be taken into account is the transformation of the given graph into an equivalent delay free 1) between n and some child ni of n (node ni will be re- graph. The delay free graph can then be scheduled under ferred to as the favored child), and precedence constraints only. 2) among nj and the rest of the children of n. So the general idea is to transform any given graph into a The following lemma states that given a precedence graph delay free graph, in which whenever a node n has more than G, any schedule of a corresponding delay free graph GS is a one child, namely n,, .. . nk (see Fig. 1a), then one of them, say valid schedule for G as well (for a proof, please see [171). n,, is chosen to be the favored child that is scheduled before all LEMMA1. Suppose that G is a delay graph and Gsis a coruespond- the other children of the parent n. Then the delay dependen- ing delay pee graph of G. Then if s" is a valid schedule for Gs cies between node n and its children n,, ..., nk are removed (that obeys the precedence constraints of GS)then Ssis a valid and delay free dependencies are added between node n and schedule for G as well (that obeys both the precedence and the the favored child n,,and between niand the rest of the chil- communication delay constraints). dren n,, ..., ni-,,n,+,, ..., nk of n as shown in Fig. lb. Of course one has to prove that the transformation results into a gvaph that Hence, any schedule valid for graph G5 is valid for the obeys the precedence constraints of the original graph and that there original graph C as well. What we are going to show now is always exist,s a delay free graph which has an optimum schedule of that for any graph G there exists a corresponding delay free graph Gs that has an optimum schedule of the same length as the optimum schedule of the original delay graph G. This 11 optimum schedule of GS is an optimum schedule of G as well. The following two definitions are going to be used in the presentation of Lemma 2. DEFINITION3. We say that a schedule S of graph G has the fa- vored child property, if and only if for every node i E G exactly one of i's children, j, is assigned to an eavliev time slot than the other children of i, if i has a child at all. Node j is called the favored child of i. DEFSNITION4. Given a schedule S of graph G, of length 1 we de- fine s,,l < i < l, to be the nudes of G scheduled in time slot i, in schedule S. LEMMA2. For every graph G thew exists an optimum schedule S zvith the favored child property. PROOF.Consider an optimum schedule S = (SI,. . ., S,)of G for which the favored child property does not hold. Let t be the latest time slot containing a node i without a fa- vored child. Let t', where t' 2 t + 2, be the earliest time slot containing a child j of i. Each node in St,-l has at most one child in S,,,and j itself is not the child of a node in St,-, . Hence, there is at least one node k in St,-, without a child in S,,.So let us swap nodes j and k, moving j from S,,to St,-, and k from Sf,-l to S,,.After Fig. 1. Delay free transformation of delay graphs: the strategy. this interchange, j is the favored child of i, and every VARVARIGOU ET AL.: SCHEDULING IN AND OUT FORESTS IN THE PRESENCE OF COMMUNICATION DELAYS 1069

node that had a favored child before the interchange then set node n, as the only children of node n. still has a favored child. It follows that a finite number If there are more than one nodes n, for which of such interchangescreates an optimum schedule with height[qs) = max height(T;’), the favored child property. For an alternative proof of ]=I, ,k this lemma you can also see [171,[181. CI then break the tie arbitrarily. However, the optimal schedule of a delay graph G that b) Add subtrees Tf, . . . ,Tis1, T,:,, . . . ,T: as succes- has the favored child property is also the optimum sched- ule of a delay free transformation of G. Hence, the follow- sors of node n, (keeping the rest of the succes- ing Corollary follows immediately from Lemma 2. sors of n, as they are in T,’) as shown in Fig. lb. COROLLARY1. Suppose G is a delay graph that has an optimum It is easy to see that the above algorithm is correct schedule of length L. Then there is a delay free graph GS and requires O(n)time. 0 corresponding to G, which has an optimum schedule of the The generalization of Lemma 3 for the case of the shortest same length L. delay free graph G’ corresponding to an out-forest G is Theorem 3 below follows easily from the above lemmas straightforward. One should just take the shortest delay and corollary. free trees that Correspond to the trees of graph G. Note also THEOREM3. Given an out-forest G, there exists a corresponding that in the process of computing Gs out of G, one computes delay free graph G’that has an optimum schedule of the same the shortest delay free trees that correspond to all the sub- length as the optimum schedule of the original graph G. trees of G, as well as their height. So we get the following corollary: The result of Theorem 3 plays a key role in the proof of Theorem 5. In fact, in the proof of Theorem 5, we show how COROLLARY2. Given an out-forest G, one can construct a short- to efficiently determine a delay free graph Gs that has an est delay pee out-forest Gs in O(n)time. Given G’, one can optimum schedule of the same length as the optimum find the shortest delay pee transformation of any subtree of schedule of any given graph G. G, and its corresponding height in 0(1)time. DEFINITION6. Given a subgraph F of graph G, we define the cor- 3.2 Scheduling in the Presence of Communication responding shortest delay free graph Fs to be the subgraph Delays of GS (where G“ is the shortest delay free graph of G) that In this section, we are going to present polynomial time contains only the nodes that belong to F. algorithms for the computation of the optimal schedule of out-forest graphs. The following section presents a linear- 3.2.2 High and Low Subgraphs for Unit-Delay Graphs time algorithm for computing a near-optimal schedule for In this section, we present a basic theorem that relates the unit-delay out-forests. length of the optimum schedule of a graph that is subject to 3.2. I Shortest Delay Free Graph communication delays, to the length of the optimal sched- First we present an algorithm for computing the shortest ule of its high subgraph, the latter being defined for graphs delay free graph that corresponds to a given precedence that are subject to communication delays as follows. graph which we define below. DEFINITION7 (High/Low subgraphs of a graph that is sub- DEFINITION5. Shortest delay free graph GSof a given graph G is ject to cornmunication delays). Given a graph F and the a delay free graph such that every subgraph of Gs has corresponding sortest delay free graph Fs = HFSU LFs, height less than or equal to the height of the corresponding wkere HFband LF7are the high and the low subgraphs of (i.e., containing the same nodes) subgraph in any other de- lay free graph for G. Fs, respectively, we define H;, and Lis to be the subgraphs LEMMA3. Given a tree T, one can construct a shortest delay free of F that correspond to HFsand LEE subgraphs of F’ (i.e., tree T‘, as well as the shovtest delay free trees that corre- contain the same nodes as HF5and LF”. We call Hiband spond to all the subtrees of T in O(n)time. L;c the high and low subgraphs of F, respectively. PROOF.A constructive algorithm can be described as follows: The Algorithm: Consider the out-tree T as shown in Fig. la For example gr,aph F of Fig. 2a has graph Fs of Fig. 2b as its rooted at a node n. Let n,, n2, ..., nk be the children of corresponding shortest delay free graph. H,, and LFb are il- n and Ti are the subtrees of corresponding to nodes T lustrated in Fig. 2b and H;%and L;s are illustrated in Fig. 2a. ni, i = 1, 2, ..., k (see Fig. lb). Construct the shortest delay free tree r‘ in the following recursive fashion: DEFINITION8. We define A(F) as the length of the optimum schedule ofa graph F. 1) Compute the shortest delay free trees qs,i = 1, . . ., k, corresponding to the subtrees Tiof the original tree T. As proven in Theorem 3 there exists a delay free graph H 2) Construct the shortest delay free tree r‘ as follows: corresponding to Hi5 that has a schedule of length equal to a) Suppose that the length of the optimal schedule for Hi., i.e.,

height[q’) = ,max height(?’) il(Hi,)= XH). The trees of H are at least as high as the ]=l,..., k 1070 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 7, NO. 10, OCTOBER 1996 ...... Since A(F) is always greater than or equal to mink{k I kmr lFI),then

A(F) = max(A(Hf, j, mp{kI km 2 IFl}}. 0

@ 0; ; 3.2.3 A Dynamic Programming Algorithm for Computing the Optimal Schedule ' ...... As already proven in Theorem 4, given a subgraph F of G, (A) Graph /.' and the length of the optimal schedule for Hiq,we can ,...... , compute A(F) in O(n) time. So, in order to be able to com- ; pute the length of the optimal schedule for any subgraph F j of G, we need to compute A(H;$)for every subgraph F. Since we want to compute a schedule that corresponds to j a delay free transformation, it is useful to characterize the @ @ nodes of any given subgraph, such that the favored child 15 16 : property is maintained. Let us define as Init(G) the set of ...... , /,...... 1 nodes of G that have no predecessors. Let us also define two

$72 = 3 nodes n,, n2 in G to be brothers if these nodes have the same

mcdiaii p(P) = 3 parent in G. Then there are two kind of nodes in Init(Hi, 1 : (li) Graph Fa 1) The nodes whose all of their brothers do not belong in

Fig. 2. Examples of H;s and LL, , corresponding to an out-forest G. These nodes can be scheduled independently of the rest of the nodes in Init(Hi,) and are called initial trees of Hr5 (since both HF, and H are delay free graphs independent nodes; and 2) The nodes whose brothers are all in Init(Hf;,). corresponding to Hi., and HFI is the shortest delay free These nodes cannot be arbitrarily scheduled, but one of graph Corresponding to delay graph ff;> ). So H is the high them has to be scheduled before all its brothers, because it graph of the delay free graph H U LFh (since HFhwas the corresponds to the root of a tree in HFb.Hence we define a high tree of HF5U LF,). We claim that the following is true: bunch to be a set of nodes of a graph G such that every two nodes in the set are brothers and moreover, every brother CLAIM 1. aw U L~,)= max{jl(H), min{kl km 2 IH/+ /LF5/jj. of any node in the set is also in the set. A bunch that con- k sists of initial nodes is called an initial bunch. We now make the following claim: Given a subgraph F of a graph G we define n E Init(F) to CLAIM 2. a(F)= aw U L~,) be an initial independent node if all its brothers are not in F. We define initial components of a subgraph F of graph The above two claims lead to the following result G to be the set of all initial independent nodes and initial THEOREM4. Fov any subgraph F of a given out-forest G, bunches of F. For example, in the subgraph F in Fig. 3a, Initial independent nodes(F) = (1).The set of initial compo- L(F) = max[AjH;, j, mp{k I km 2 IF]}]. nents of F in Fig. 3a is {l,(8, 13), (17, 19, 20)). Note that the set of initial components of F and the set of initial nodes of PROOF.The proof follows immediately from Claim 2 by F contain the same nodes, with the only difference that in noting that: the set of initial components of F the nodes are grouped into initial bunches and initial independent nodes. We now 1) aw;,) = am make the following claim: 2) iL;,..l = /LF,l and CLAIM3. Consider graph G with shortest delay free transforma- 3) that the optimal schedules S and S* of and H tion Gs computed using the algorithm presented in Lemma 3. respectively have the same number of idle periods Then, given a subgraph F of G where some of F's initial nodes are initial independent nodes and some belong to initial p = p* (since they have the same length and the saime number of nodes). bunches, one can compute H;.and LFsof F in O(1) time. Thus,. The following claim follows easily from Definition 1 and Definition 6. CLAIM4. Given a subgraph F of G, its subgraph Hi, contains at most m - 1 initial components. VARVARIGOU ET AL.: SCHEDULING IN AND OUT FORESTS IN THE PRESENCE OF COMMUNICATION DELAYS 1071

node r______.__.______Iiidcpcndent initial PROOF.The algorithm that computes A(F) for every sub- Initial bunch graph F of G with no more than m - 1 initial compo- r_-.._.~ nents can be described as follows: Initial bunch -___. 1) 40)= 0 2) For i = 1 to n do:

For all subgraphs F of G with up to m - 1 initial com- ponents and I F I = i do:

A(F) = 1 min{A(Hf,,R U L~,.)F F'] (a) Graph F + I 2' where I R I 2 m. PROOF OF C0RR.ECTNESS. A proof of correctness of the above algorithm follows from the following comments: 1) As follows from Definition 9, sets R contain only ini- tial nodes of F and at most one node from each of the ini- tial bunches of F. It is easy to see why only initial nodes of F are allowed to be scheduled in the first (h) iiftei Stqi 1 time slot of the schedule of F (nodes are allowed to be scheduled only after all their predecessors have been computed). We next show why it is enough to consider only sets R that have at most one node from any initial bunch for the computation of the optimal schedule of F is more involved. In Theorem 3, we proved that there is a delay free transformation of F, namely Fs, that has an optimal schedule of the same length as the optimal schedule of F. But from Definition 2 we have that for every bunch {nl, .. ., nk]of F, Fqhas some nl as median p(P)= 3 the parent of all the n,s for i = 1, ..., j - 1, j + 1, ..., k. (c) Graph F" But in this optimal schedule of Fs (and of F as well) Fig. 3. Independent initial nodes, initial bunches. only n, is allowed to get scheduled when none of the {nl, ..., nk)have been scheduled yet. This translates in the following: F has an optimal schedule in which at DEFINITION9. Given a graph F we introduce the following nota- most one of the nodes of each initial bunch of F is sched- tion: F =R F'where R is a subset of Init(F) with the re- uled in the first time slot. Hence, it is enough to con- striction that no more than one node from the same initial sider only sets R that have at most one node from bunch of F aye in X.F'is the graph that results from F if we any initial bunch for the computation of the optimal delete the nodes of R. schedule of F. We know that Hi5 contains at most m - 1 initial compo- 2) h(H;,s IJ LF,,) can be computed in O(1) time. Consider nents. Hence, the set of all the subgraphs of G that have no a certain set R and the resulting graph F'. Then more than m - 1 initial components, contains the set of all from Claim 3 we know that given F', Hi,.and LF,$ the possible high graphs Hiq,corresponding to any sub- can be computed in O(1) time. But Hi,. is a subset graph F of G. So, if we compute the length of the optimal of G with up to m - 1 initial components itself schedule for all subgraphs of G that contains at most m - 1 initial components, we can then compute in O(n) time the (since it is the high subgraph of F'). Moreover, length of the optimal schedule for G itself or any subgraph IHf.,,I5 IF'[ < /FI = i. So A(H;,$) has been computed of G. In the following theorem we give a recursive proce- in an earlier step of the algorithm. But having dure for computing the length of the optimal schedule for a(H;,,), we can compute A(F) in O(1) time (see all subgraphs of G that contain at most m - 1 initial compo- nents. In Theorem 6, we show how to use the results of Theorem 4). So, for every given R, A(H;,, U LF,< Theorem 5 to compute the optimal schedule for G. can be computed in O(1) time. Our constructive proof for the following theorem is based 3) Among all the eligible sets R, we chose the one that on the results of Theorem 3, namely, the existence of a delay minimizes A(F'). It is then obvious that the total free graph for any G with the same length of optimal schedule. length of the schedule for F is going to be mini- THEOREM5. Let G be an out-forest. Then for every subgraph F of mum and equal to 1 + A(F'). G with no more than m - 1 initial components, A(F) can be Complexity issues: We can compute the time complexity of computed in ~(n~~-~)time. the algorithm presented above as follows: 1072 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 7, NO. 10, OCTOBER 1996

1) There are O(n) possible independent nodes in 2) Find R such that dRG' and A(Hi,g) is graph G and O(n)different bunches (note that each minimum. bunch corresponds to a common parent; since there are n nodes in G, the number of different 3) Schedule nodes of R in the first time slot. bunches has to be less than n). So there are O(n) step 2: possible initial components for any graph F (total 1) SCHEDULE(H& LG,s) number of independent nodes and bunches). U 2) There are O(n"-') different sets that contain up to 2) SCHEDULE(Hi,) = R appended with m - 1 initial components (total number of ways we SCHEDULE(H;,, U cain choose up to m - 1 initial components out of the O(n) available ones). Hence, Step 2a gets exe- step 3: MERGE(SCHEDULE(H;, 1, L~.1 cuted at most O(n"-') times (one time for each of the different subgraphs F of G with up to m - 1 ini- PROOFOF CORRECTNESS.SCHEDULE(G) algorithm pre- tial components; note that to each set of initial sented above first computes Hi9 and LG5 subgraphs components corresponds one and only one sub- of G. Theorem 2 guarantees that if we find an opti- gr,aph of graph G and visa versa). 3) Every execution of Step 2.3 of the algorithm is of mum schedule for His,then we can find an optimum complexity O(n'"-') because: schedule for G using MERGE algorithm of Theorem 2. a) For every different graph F of Step 2.a there So all we need to prove is that the schedule that is are O(n'"-') different sets R (total number of computed for Hi5 in steps 1 and 2 of the SCHED- different ways to choose up to m - 1 nodes out ULE(G) algorithm is optimum. Then in step 3 MERGE of the O(n)initial nodes of F; note that F con- algorithm will be used to compute the optimum tains up to m - 1 initial components but may schedule for the whole graph G. contain O(n)initial nodes). b) Given a certain R, F' is uniqly determined and it Any schedule for Hi. will have some nodes R takes O(1) time to compute Hi,, and L):,. as scheduled in the first time slot and will have the rest of mentioned in Claim 3. the graph G' = - (R) scheduled in later time slots. Having determined Hi,>,one knows A(Hi,,) which The total length of the schedule of Hi5 then will be 1 + has been computed in an earlier step (since H;,$, A(G'). So all we really need to guarantee is that the choice of R is such that /z (G') is the minimum possible. from Definition 5, only contains up to m - 1 initial But schedule of minimum length for graph G', accord- components and (H;.;I 5 IF'I iIF1 = i). Hence, for a ing to Theorem 4, is the schedule that results in mini- specific R one can compute U')in O(1) time using mum A(H:,s). So R has to be such that the resulting Theorem 4. HE,3 has a schedule of minimum possible length. This 4) The whole algorithm is of O(n"-')O(n"'-') = O(.U~~''-~) time complexity. 0 is exactly how nodes R are chosen in step l.b. The total length of the schedule for Hi5 is 1 + A(G') = REMARK.The algorithm presented in the above theorem, can be used to efficiently determine a delay free graph 1 + A(Hi,, U LG" 1; hence the total length is minimum. G' that has an optimum schedule of the same length In Step 2 of the SCHEDULE algorithm we recur- as the optimum schedule of the any given graph G. sively compute the optimum schedule for G' which In the flollowing theorem a polynomial time algorithm is we append to the set R to find the optimal schedule presented for the computation of the optimal schedul'e of a for Hi5.The latter we use in step 3 to compute the given precedence graph G. optimum schedule for the whole graph G. THEOREM16. Given out-forest G, the optimal schedule for G can The resulting schedule has divided the nodes into be computed in O(n2"-') time. sets that correspond to different time slots. Using the al- gorithm ALLOCATE that is presented below we can de- PROOF.Applying the algorithm of Theorem 5, we can de- cide on which node is going to be scheduled in which termine A(F) for every subgraph F, of G with no more processor so that the communication delay constraints than m - 1 initial nodes in O(nZ"-') time. Having are not violated. ALLOCATE procedure takes the out- computed A(Hhq)we can compute A = XH;, IJ LG,) put of the SCHEDULE algorithm as its input. Suppose that S(k) is the set of nodes scheduled in time slot k. Then in O(1) time (see Theorem 4). The schedule icorre- ALLOCATE algorithm can be described as follows: sponding to the optimal length can be computed by the following recursive procedure: ALLOCATE (SCHEDULE (G)) SCHEDULE (G) 1) Assign nodes S(1) to processors at random. 2) Fork = 2 to L(G)do: step 1: For every node x in S(k), if the parent of x has been 1) Determine Hi, and LG'. scheduled in time slot k - 1 then assign node x to the __

VARVARIGOU ET AL.: SCHEDULING IN AND OUT FORESTS IN THE PRESENCE OF COMMUNICATION DELAYS 1073

same processor as its parent else assign node x at random. longest path from i to a leaf node. The following theorem The correctness of the ALLOCATE algorithm is first presented in [12] proves the optimality of the schedule obvious since at most one of the children of every and also establishes an important property that is used in node x is scheduled in the time slot immediately fol- the proof of Theorem 8. lowing the time slot in which is scheduled. So the x THEOREM7. Critical path scheduling yields an optimal schedule communication delay constraints are met (no com- for any delay-free out-forest Gdi munication delays are involved for dependent tasks scheduled in the same processor). We next establish the main result. Complexity issues: Suppose that SCHEDULE(G)is of T(n) THEOREM8. If Gs is a shortest delay-free out-forest correspond- complexity. Then the complexity of Step la of the al- ing to a given out-forest G, then gorithm is O(n) (see Theorem 4), the complexity of Step lb is O(n"), since there are O(n"') different ways A(GJ 5 A(G) + (m- 2). to choose R (assuming that A(F) has already been PROOF.Let S be an optimal schedule for G'. computed for all the subgraphs F of G of no more than m - 1 initial components). Step 2 is of at most T(n - 1) Case 1: All time slots, other than the first and last, are com- complexity and step 3 is of O(n) complexity (as fol- pletely filled. In this case the schedule is clearly opti- lows from Theorem 2). So T(n)= T(n - 1) + O(nm) mal for both Gs and G. T(n) = O(n"'). Hence the total time complexity of Case 2: There is a time slot, other than the first or last, that is step 2 is O(nm+'),and the SCHEDULE algorithm takes incompletely filled. Let t be the latest such time slot. O(n"'l) time. ALLOCATE algorithm is of O(n) com- From the proof of Theorem 7, we know that each node plexity. Taking into consideration the time complexity in this slot has a continuous chain-of predecessors exe- of computing for all subgraphs F of G with no A(F) cuted in slots t - 1, t - 2, ..., back to slot 1. Furthermore, more than m ~ 1 initial nodes we conclude that the at least one node in slot t must have height h(i) 2 2 in total complexity of the algorithm is O(n*"'-*). 0 GS,else t would be the last nonempty slot in S. Suppose node i has a predecessor k in Gs that is not the favored 4 A LINEARTIME APPROXIMATION ALGORITHM FOR child of its parents in G, where k is in slot t' < t. Then COMPUTINGA NEAR-OPTIMAL SCHEDULE FOR node j, the predecessor of k in G', is the favored child of U NIT-DELAYOUT- FORESTS the parent of both j and k in G and j is in slot t' - 1. This In Section 3.1, we showed that for every out-forest G with implies that h(j) 2 h(k) (since G' is a shortest delay-free communication delays, there exists a delay-free out-forest G5 out-forest), and that j has a successor in Gs, different such that the optimal schedule for GS is also an optimal from i, in slot t. It follows that node i has at most m - 2 schedule for G. The rest of the previous section discussed predecessors that are not favored children. Hence node how to determine such a delay-free transformation. In the i is executed at most m - 2 time units later than it could process we had also introduced the concept of shortest delay- possibly lbe executed in any feasible schedule for G. It free transformations; the optimal schedule of the resultant follows that S itself is no more than m - 2 time units out-forest, however, need not correspond to an optimal longer than an optimal schedule for G. schedule of the parent out-forest. In this section we show that the length of an optimal schedule of a shortest delay-free 5 CONCLUSION forest does not exceed the optimal schedule of the underlying In this paper, we have presented polynomial time algorithms out-forest by more than (m - 2) time units. Since there are linear time algorithms for both determining a shortest delay- for optimally scheduling n dependent tasks in a system of m free forest, and for determining its optimal schedule, it fol- processors when the precedence graph is in the form of Out- lows that there exists a linear time algorithm for determining forest or In-forest, m is a variable of the problem and a near-optimal schedule for any given out-forest with unit bounded by a constant, the tasks are of unit computational communication and computation delays. delay and the communication between the processors is Following the construction in Section 3.2.1, we can define taken into consideration and is assumed to be of unit delay. the height, h(i),of a shortest delay-free tree rooted at node i as We also presented a linear-time algorithm for computing a follows: near-optimal schedule for a unit-delay out-forest. Some inter- esting problems that could follow as possible extensions of 1 if i is a leaf node, the results presented in this paper are: = (1 + max{h(j) I j is a child of node i} otherwise. The nonstraight profile scheduling problem with com- We shall apply critical path scheduling for obtaining the munication delays, meaning the scheduling problem optimal schedule for a shortest delay-free out-forest. When when the number of processors available at each time critical path scheduling is applied to a delay-free forest Gd,+ step varies but is still bounded by a constant. tasks are assigned to time slots, one slot at a time, from first Scheduling problem with communication delays under to last. An unassigned task is said to be "available" for slot t the assurnption of non identical processors; for exam- if all of its predecessors have been assigned to prior slots. If ple an interesting generalization of the scheduling no more than m tasks are available at t, all the available problem would be the case where the available proces- tasks are put in St. If more than m tasks are available, then sors are of different computational capacity. m highest tasks are put in S,,where the height of a node i, is Scheduling with communication delays in more gen- the height of the subtree rooted at i, i.e., the length of a eral precedence graphs. 1074 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 7, NO. 10, OCTOBER 1996

ACKNOWLEDGMENTS Vwani P. Roychowdhury received the BTech degree from the Indian Institute of Technology, This work was supported in part by the SDIO/IST U. S. Kanpur, India, and the PhD degree from Stanford Army Research Office through Contract DAAL03-87-K-0033. University, Stanford, California, in 1982 and 1989, respectively, both in electrical engineering. He is currently a professor in the Department REFER ENC: ES of Electrical Engineering at the University of Cali- fornia, Los Angeles From August 1991 until June J.-J. F.D. Anger, Hwang, and Y.-C. Chow, “Scheduling with Suffi- 1996, he was a faculty member at the School of cient Loosely Coupled Processors,” J. Parallel and Distributed Coin- Electrical and Computer Engineering at Purdue puting, VOl. 9, pp. 87-92, 1990. University His research interests include parallel P. Chretienne, ”A Polynomial Algorithm to Optimally Schedule Tasks on a Virtual Distributed System Under Tree-Like Precedence algorithms and architectures, computational paradigms for nanoelectron- Graphs,” European J. Operafinnal Research, vol. 43, pp. 2255230,1988. ICS, design and analysis of neural networks, special purpose computing 1’. Chretienne, “Task Scheduling Over Distributed Memorv Ma- arrays, VLSl design, and fault-tolerant computation chines,” .Proc. Int’l Workshop Paralk and Distributed Algonthms,’I988. J. Colin and P. Chretienne, ”Cpm Scheduling with Small Communi- cation Delays and Task Duplication,” Technical Report M.A.S.190.1, Thomas Kailath was educated in Poona, India, Universite Pierre et Marie Curie, France, Jan. 1990. and at the Massachusetts Institute of Technology D. Dolev and M.K. Warmuth, ”Profile Scheduling of Opposing Until December 1962, he worked at the Jet Propul- Forests and Level Orders,” SIAM 1. Algorithms and Discrete Methods, sion Laboratories, Pasadena, California, and also vol. 6, PFI.87-92, Oct. 1985. taught part-time at the California Institute of Tech- M. Fujii, T. Kasami, and K. Ninomiya, ”Optimal Scheduling of Two nology He then went to Stanford University as an Equivalent Processors,” SIAM J, Applied Math., vol. 17, pp. 784-789, associate professor of electrical engineering He 1969. served as director of the Information Systems H.N. Gabow, “A Linear Time Recognition Algorithm for Interval Laboratory during a period of rapid growth from DAGS,” ihformation Processing Letters, vol. 12, pp. 20-22,1981. 1971 through 1980, as associate chairman of the H.N. Gabow, ”An Almost Linear Algorithm for Two Processor department until 1987, and was then appointed the Scheduling,” J. ACM, vol. 29, pp. 766-780,1982. first holder of the Hitachi America Professorship of Engineering Prof. M.R. Garey and D.S. Johnson, Computers and Infractability:A Guide to Kailath has also held shorter-term appointments at several institutions the Theor!( ofNP-Completeness. New York: W. H. Freeman, 1979. around the world including AT&T Bell Laboratories, University of California M.R. Garey, D.S. Johnson, R.E. Tarjan, and M. Yannakakis, at Berkeley, the Indian Institute of Science, Cambridge University, Delft ”Scheduling Opposing Forests,” SIAM 1. Algorithms and Discrete University, and the Massachusetts Institute of Technology Methods, vol. 4, pp. 72-93, Mar. 1983. His research has spanned a large number of disciplines, emphasizing R.L. Graham, E.L. Lawler, J.K. Lenstra, and A.H.G. Rinnooy Kan, information theory and communication in the 1960s, linear systems, esti- ”Optimization and Approximation in Deterministic Sequencing and mation, and control in the 1970s, and VLSl design and sensor array signal Scheduling: A Survey,” And5 Discrete Math., vol. 5, pp. 287-326, processing in the 1980s Alongside these have been contributions to sev- 1979. eral fields of applied mathematics His current interests emphasize applica- T.C. Hu, “Parallel Sequencing and Assembly Line Problems,” Op- tions of signal processing, computation, and control to problems in manu- erations Research, vol. 9, pp. 841-848,1961. facturing and telecommunications. E.W. Mayr, “Well Structured Parallel Programs Are Not Easy to Prof Kailath serves on the editorial boards of more than a dozen engi- Schedule,” Technical Report STAN-CS-81-880, Stanford Univ., Stan- neering and mathematicsjournals, and has been editor of the Prentice Hall ford, Calif., 1981. Series in Information and System Sciences since 1963 He is an author of C.H.Papadimitriou and M. Yannakakis, ”Scheduling Interval Order nearly 300 papers, two textbooks, and several research monographs and Tasks,” ,IDIMJ. Computing, vol. 10, pp. 405-409,1979, patents He is also active in several professional societies He has received C.H. Papadimitriou and M. Yannakakis, “Towards an Architecture- awards from the IEEE Information Theory Society, which he served as Independent Analysis of Parallel Algorithms,” SIAM J. Computing, president in 1975, the IEEE Signal Processing Society, and the American vol. 19, no. 2, pp. 322-328, Apr. 1990. Control Council He is the recipient of the 1995 IEEE Education Medal He J.D. Ullman, ”NP-complete Scheduling Problems,” J. Computer and has held Guggenheim and Churchill fellowships, among others, and has Systems Science, vol. 10, pp. 384-393,1975. T.A. Varvarigou, ”Scheduling and Fault Tolerance Issues in Multi- honorary doctorates from Linkoping University in Sweden and Strathclyde University in Scotland He is a fellow of the IEEE and of the Institute of processor Systems,” PhD thesis, Stanford Univ., Stanford, Calif., Dec. 1991. Mathematical Statistics, and a member of the National Academy of Engi- T. Varvarigou, V.P. Roychowdhury, and T. Kailath, ”Scheduling In neering, the American Academy of Arts and Sciences, and the Third World and Out Forests in the Presence of Communication Delays,” Proc. Academy of Sciences Int’l Parallel Processing Symp. (IPPS ’931, pp. 222-229, Newport Beach, Cnlif ,Apr 1993 M K Warmuth, ”Scheduling in Profiles of Constant Breadth,” PhD Eugene Lawler obtained an AM at Harvard Uni- thesis, Drpt of , Univ of Colorado, Boulder, Aug versity in 1957, and was a senior electrical engi- 1981 neer at Sylvania Electric Products in Needham, Massachusetts, from 1959 until 1961 when he returned to Harvard to obtain a PhD in 1962. He taught at the , Ann Arbor, Theodora A. Varvarigou received the BTech from 1962-1970 He was a member of the faculty degree from the National Technical University of at the University of California at Berkeley from Athens, Greece, in 1988, the MS degree from 1971 until his death in 1995 Stanford University, Stanford, California, in 1989, Dr Lawler studied algorithmic issues in combi- and the PhD degree from Stanford University in natorial optimization for more than 30 years His 1991 She IS currently an assistant professor at the contributions were fundamental in giving the discipline the breadth and Technical University of Crete, Greece From De- depth it has obtained He was the author of the textbook Combmatorla/ cember 1991 until December 1994, she was a Opf/m/Zafion~Networks and Mafroids (1976). He was a coeditor of The member of the technical staff at AT&T Bell Labs, Traveling Salesman Problem A Guided Tour of Combinafor/al Optimiza- Holmdel, New Jersey. Her research interests In- tion (1985) which became benchmark reference. He was also the coauthor clude parallel algorithms and architectures, fault- of a classic paper on branch-and-bound (with D E Wood) and of one on tolerant system design, and parallel scheduling on multiprocessorsystems. dynamic pwmmlng(with J M Moore)