The Bene T of Predicated Execution for Software Pipelining
Total Page:16
File Type:pdf, Size:1020Kb
Published in HICSS-26 Conference Pro ceedings, January 1993, Vol. 1, pp. 497-506. 1 The Bene t of Predicated Execution for Software Pip elining Nancy J. Warter Daniel M. Lavery Wen-mei W. Hwu Center for Reliable and High-Performance Computing University of Illinois at Urbana/Champaign Urbana, IL 61801 Abstract implementations for the Warp [5] [3] and Cydra 5 [6] machines. Both implementations are based on the mo dulo scheduling techniques prop osed by Rau and Software pipelining is a compile-time scheduling Glaeser [7]. The Cydra 5 has sp ecial hardware sup- technique that overlaps successive loop iterations to p ort for mo dulo scheduling in the form of a rotating expose operation-level paral lelism. An important prob- register le and predicated op erations [8]. In the ab- lem with the development of e ective software pipelin- sence of sp ecial hardware supp ort in the Warp, Lam ing algorithms is how to hand le loops with conditional prop oses mo dulo variable expansion and hierarchical branches. Conditional branches increase the complex- reduction for handling register allo cation and condi- ity and decrease the e ectiveness of software pipelin- tional constructs resp ectively [3]. Rau has studied the ing algorithms by introducing many possible execution b ene ts of hardware supp ort for register allo cation [9]. paths into the scheduling scope. This paper presents In this pap er we analyze the implications and b en- an empirical study of the importanceofanarchi- e ts of predicated op erations for software pip elining tectural support, referredtoaspredicated execution, lo ops with conditional branches. In particular, the on the e ectiveness of software pipelining. In order analysis presented in this pap er is designed to help to perform an in-depth analysis, we focus on Rau's future micropro cessor designers to determine if predi- modulo scheduling algorithm for software pipelining. cated execution supp ort is worthwhile given their own Three versions of the modulo scheduling algorithm, estimation of the increased hardware cost. one with and two without predicated execution support, are implementedinaprototypecompiler. Experiments Although software pip elining has b een shown to b e basedonimportant loops from numeric applications e ective for many di erent architectures, in this pap er show that predicated execution support substantial ly we limit the discussion to VLIW and sup erscalar pro- improves the e ectiveness of the modulo scheduling al- cessors. For clarity,we use VLIW terminology.Thus, gorithm. an instruction refers to a very long instruction word which contains multiple op erations. 1 Intro duction 2 Mo dulo Scheduling Software pip elining has b een shown to b e an e ec- tivescheduling metho d for exploiting op eration-level parallelism [1][2][3] [4]. The basic idea b ehind soft- In a mo dulo scheduled software pip eline, a lo op it- eration is initiated every II cycles, where II is the Ini- ware pip elining is to overlap the iterations of a lo op tiation Interval [7] [3] [10] [11]. The II is constrained b o dy and thus exp ose sucient parallelism to utilize the underlying hardware. There are two fundamental by the most heavily utilized resource and the worst problems that must b e solved for software pip elining. case recurrence for the lo op. These constraints each The rst is to ensure that overlapping lifetimes of the form a lower b ound for II. A tightlower b ound is the same virtual register are allo cated to unique physical maximum of these lower b ounds. In this pap er, this tightlower b ound is referred to as the minimum II.If registers. The second problem is enabling lo op itera- tions with conditional branches to b e overlapp ed. an iteration uses a resource r for c cycles and there r are n copies of this resource, then the minimum II These problems have b een solved in the compiler r Published in HICSS-26 Conference Pro ceedings, January 1993, Vol. 1, pp. 497-506. 2 til an II is found that satis- Algorithm Loop Data Dependence Modulo Resource This pro cess rep eats un Graph Reservation Table es the resource and dep endence constraints. If the 1. Generate data veany recurrences, a schedule can dependence graph LD functional units lo op do es not ha <f, 4> cycle mod II FU2 um II [7]. In the pres- 2. Determine Iteration FU1 usually b e found for the minim Interval (II) ADD ere develop ed in the 0 LD ADD ence of recurrence s, heuristics w - maximum of resource constraints <f, 1> {s = 0} {s = 4} Cydra 5 compiler to generate near-optimal schedules. 3. List schedule using MUL hedule the recurrence no de 1 MUL ST The basic principle is to sc modulo resource <f, 1> {s = 5} {s = 7} reservation table rst and then the no des not constrained by recur- ST - 2 functional units ] [11]. - uniform FU's s = start time rences [12 - II = 2 After II is scheduled, the co de for the software pip elined lo op is generated. Since not every in- Figure 1: Mo dulo scheduling a lo op without recur- struction is issued in the rst II cycles, a prologue rences. is required to \ ll" the pip eline. There are p = latest issue time 1 II's or stages in the prologue. II 7+1 e 1 = 3. The In the example in Figure 1, p = d due to resource constraints, RI I ,is 2 kernel of the software pip eline corresp onds to the p or- c r tion of the co de that is iterated. Since the lo op b o dy RI I = max ; r 2R n r spans multiple II's, a register lifetime mayoverlap it- self. If this happ ens, the registers for each lifetime where R is the set of all resources. If a dep endence must b e renamed. This can b e done in hardware us- edge, e, in a cycle has latency l and connects op era- e ing a rotating register le as in the Cydra 5 [6]. With tions that are d iterations apart, then the minimum e this supp ort, the kernel has one stage. Without sp ecial II due to dep endence cycles, CII,is hardware supp ort, modulo variable expansion can b e X 2 3 used to unroll the kernel and rename the registers [3]. l e 6 7 The numb er of times the kernel is unrolled, u,isde- e2E c 6 7 X CII = max ; 6 7 termined by the maximum register lifetime mo dulo II. c2C 6 d 7 e 6 7 The last part of the pip eline is the epilogue which con- e2E c sists of p stages needed to complete executing the last where C is the set of all dep endence cycles and E is c p iterations of the lo op. In total, the software pip eline the set of edges in dep endence cycle c. has 2p + u stages. Figure 1 shows how mo dulo scheduling is applied to a lo op without recurrences. The lo op has four op era- tions with dep endences shown in the data dep endence 3 Scheduling Lo ops with Conditional graph. The arcs on the dep endence graph show the Branches typ e of the dep endence ow and latency. In this ex- ample, a 2-issue VLIW with uniform function units In order to apply mo dulo scheduling to lo ops with is assumed. The load op eration has a 4-cycle latency conditional branches, the lo op b o dy must rst b e con- and the add and multiply have unit latencies. Since verted into straight-line co de. Two techniques, hier- there are no recurrences , the minimum II dep ends on archical reduction [13][3] and if-conversion [14] [15], the maximum of the resource constraints. Since there have b een develop ed for converting conditional con- are two uniform function units and four op erations in 4 structs e.g., if-then-else constructs into straight-line the lo op, the minimum II is d e = 2. With no recur- 2 co de for the Warp [5] and Cydra 5 [6] machines re- rences, the op erations can b e list scheduled using the sp ectively. In this section, we analyze the impact of mo dulo resource reservation table [12]. The mo dulo these two conversion techniques on the scheduling con- resource reservation table has a column for each func- straints. tion unit and a row for each cycle in II. Note that after the load, add, and multiply are scheduled, the 3.1 Hierarchical Reduction store op eration can b e scheduled in cycle 6. However, there are no available resources at cycle 0 6 mo d 2 Hierarchical reduction is a technique that con- in the reservation table and thus the store op eration verts co de with conditional constructs into straight- is delayed a cycle and scheduled in cycle 7. line co de by collapsing each conditional construct into If a schedule is not found for a given II, then a pseudo op eration [13] [16][3]. In this section, these II is incremented and the lo op b o dy is rescheduled. Published in HICSS-26 Conference Pro ceedings, January 1993, Vol. 1, pp. 497-506. 3 op1: r1 <- 0 pseudo op erations will b e referred to as reduct op's. op2: r2 <- label_A hical technique since nested conditional It is a hierarc op3: r3 <- max*8 y collapsing from the inner- constructs are reduced b for (i = 0; i < max; i++) { op4: L1: r4 <- mem[r2+r1] if(i == 0) { hical reduction, most to the outermost.