Three Architectural Mo dels for Compiler-Controlled Sp eculative Execution Pohua P. Chang Nancy J. Warter Scott A. Mahlke Wil liam Y. Chen Wen-mei W. Hwu Abstract To e ectively exploit instruction level parallelism, the compiler must move instructions across branches. When an instruction is moved ab ove a branch that it is control dep endent on, it is considered to b e sp eculatively executed since it is executed b efore it is known whether or not its result is needed. There are p otential hazards when sp eculatively executing instructions. If these hazards can b e eliminated, the compiler can more aggressively schedule the co de. The hazards of sp eculative execution are outlined in this pap er. Three architectural mo dels: re- stricted, general and b o osting, whichhave increasing amounts of supp ort for removing these hazards are discussed. The p erformance gained by each level of additional hardware supp ort is analyzed using the IMPACT C compiler which p erforms sup erblo ckscheduling for sup erscalar and sup erpip elined pro cessors. Index terms - Conditional branches, exception handling, sp eculative execution, static co de scheduling, sup erblo ck, sup erpip elining, sup erscalar. The authors are with the Center for Reliable and High-Performance Computing, University of Illinois, Urbana- Champaign, Illinoi s, 61801. 1 1 Intro duction For non-numeric programs, there is insucient instruction level parallelism available within a basic blo ck to exploit sup erscalar and sup erpip eli ned pro cessors [1][2][3]. Toschedule instructions b eyond the basic blo ck b oundary, instructions havetobemoved across conditional branches. There are two problems that need to b e addressed in order for a scheduler to move instructions ab ove branches. First, to schedule the co de eciently, the scheduler must identify the likely executed paths and then move instructions along these paths. Second, when the branch is mispredicted, executing the instruction should not alter the b ehavior of the program. Dynamically scheduled pro cessors can use hardware branch prediction [4]toschedule instruc- tions from the likely executed path or schedule instructions from b oth paths of a conditional branch such as in the IBM 360/91 [5]. Statically scheduled pro cessors can either predict the branch direc- tion using pro ling or some other static branch prediction mechanism or use guarded instructions to schedule instructions along b oth paths [6]. For lo op intensive co de, static branch prediction is accurate and techniques such as lo op unrolling and software pip elini ng are e ectiveatscheduling co de across iterations in a well-de ned manner [7][8][9][10]. For control intensive co de, pro ling provides accurate branch prediction [11]. Once the direction of the branch is determined, blo cks which tend to execute together can b e group ed to form a trace [12][13]. To reduce some of the b o okkeeping complexity, the side entrances to the trace can b e removed to form a sup erblo ck [14]. In dynamically and statically scheduled pro cessors in which the scheduling scop e is enlarged by predicting the branch direction, there are p ossible hazards to moving instructions ab ove branches if the instruction is sp eculatively executed. An instruction is sp eculatively executed if it is moved ab ove a conditional branch that it is control dep endent up on [15]. A sp eculatively executed in- 2 struction should neither cause an exception which terminates the program nor incorrectly overwrite avalue when the branch is mispredicted. Various hardware techniques can b e used to prevent such hazards. Bu ers can b e used to store the values of the moved instructions until the branch com- mits [16][2][17]. If the branch is taken, the values in the bu ers are squashed. In this mo del, exception handling can b e delayed until the branch commits. Alternatively, non-trapping instruc- tions can b e used to guarantee that a moved instruction do es not cause an exception [18]. In this pap er we fo cus on static scheduling using pro ling information to predict the branch direction. We present a sup erblo ckscheduling algorithm that supp orts three co de p ercolation mo dels which require varying degrees of hardware supp ort to enable co de motion across branches. We present the architecture supp ort required for each mo del. Our exp erimental results show the p erformance of the three mo dels on sup erscalar and sup erpip eli ned pro cessors. 2 Sup erblo ckScheduling Sup erblo ckscheduling is an extension to trace scheduling [12] which reduces some of the b o okkeep- ing complexity. The sup erblo ckscheduling algorithm is a four-step pro cess, 1. trace selection, 2. sup erblo ck formation and enlarging, 3. dep endence graph generation, and 4. list scheduling. Steps 3 and 4 are used for b oth prepass and p ostpass co de scheduling. Prepass co de scheduling is p erformed prior to register allo cation to reduce the e ect of arti cial data dep endences that are 3 avg = 0; weight = 0; count = 0; while(ptr != NIL) { count = count + 1; if(ptr->wt < 0) weight = weight - ptr->wt; else weight = weight + ptr->wt; ptr = ptr->next; } if(count != 0) avg = weight/count; Figure 1: C co de segment. intro duced by register assignment [19][20]. Postpass co de scheduling is p erformed after register allo cation. The C co de segment in Figure 1 will b e used in this pap er to illustrate the sup erblo ckscheduling algorithm. Compiling the C co de segment for a load/store architecture pro duces the assembly language shown in Figure 2. The assembly co de format is opcode destination, source1, source2 where the numb er of source op erands dep ends on the op co de. The weighted control ow graph of the assembly co de segment is shown is Figure 3. The weights on the arcs of the graph corresp ond to the execution frequency of the control transfers. For example, basic blo ck2BB2 executed 100 times with the control going from BB2 to BB4 90 of the time and from BB2 to BB3 the remaining 10 of the time. This information can b e gathered using pro ling. The rst step of the sup erblo ckscheduling algorithm is to use trace selection to form traces from the most frequently executed paths of the program [12]. Figure 4 shows the p ortion of the control ow graph corresp onding to the while lo op after trace selection. The dashed b ox outlines the most frequently executed path of the lo op. In addition to a top entry and a b ottom exit p oint, traces can havemultiple side entry and exit p oints. A side entry p oint is a branchinto the middle of a trace and a side exit is a branch out of the middle of a trace. For example, the arc from BB2 4 (i1) load r1, _ptr (i2) mov r7, 0 // avg (i3) mov r2, 0 // count (i4) mov r3, 0 // weight (i5) beq L3, r1, 0 (i6) L0: add r2, r2, 1 (i7) load r4, 0[r1] // ptr->wt (i8) bge L1, r4, 0 (i9) sub r3, r3, r4 (i10) br L2 (i11) L1: add r3, r3, r4 (i12) L2: load r1, 4[r1] (i13) bne L0, r1, 0 (i14) L3: beq L4, r2, 0 (i15) div r7, r3, r2 (i16) store _avg r7 (i17) L4: Figure 2: Assembly co de segment. to BB3 in Figure 4 is a side exit and the arc from BB3 to BB5 is a side entrance. Tomove co de across a side entrance, complex b o okkeeping is required to ensure correct program execution [12][21]. For example, to schedule the co de within the trace eciently,itmay b e desirable to move instruction i12 from BB5 to BB4.To ensure correct execution when the control owis through BB3, i12 must also b e copied into BB3 and the branch instruction i10 must b e mo di ed to p oint to instruction i13. If there were another path out of BB3 then a new basic blo ckwould need to b e created b etween BB3 and BB5 to hold instruction i12 and a branchtoBB5. In this case, the branch instruction i10 would branch to the new basic blo ck. The second step of the sup erblo ckscheduling algorithm is to form sup erblo cks. Sup erblo cks avoid the complex repairs asso ciated with moving co de across side entrances by removing all side entrances from a trace. Side entrances to a trace can b e removed using a technique called tail duplication [14]. A copy of the tail p ortion of the trace from the side entrance to the end of the trace is app ended to the end of the function. All side entrances into the trace are then moved to the corresp onding duplicate basic blo cks. The remaining trace with only a single entrance is a sup erblo ck. Figure 5 shows the lo op p ortion of the control ow graph after sup erblo ck formation 5 1 BB1 i1 i2 i3 i4 i5 1 BB2 i6 i7 i8 10 90 BB3 BB4 0 i9 i11 99 i10 10 90 BB5 i12 i13 1 BB6 i14 1 0 BB7 i15 i16 1 BB8 i17 1 Figure 3: Weighted control ow graph. 6 1 BB2 i6 i7 i8 10 90 BB3 BB4 99 i9 i11 i10 90 10 BB5 i12 i13 1 Figure 4: Lo op p ortion of control ow graph after trace selection.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages35 Page
-
File Size-