Dynamic Hammock Predication for Non-Predicated Instruction Set

to appear in Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT) 1998 Dynamic Hammock Predication for Non-predicated Instruction Set Architectures y x z Artur Klauserx Todd Austin Dirk Grunwald Brad Calder y z x University of Colorado at Boulder Intel University of California, San Diego Department of Computer Science Microcomputer Research Lab Department of Computer Science and Engineering Abstract This paper describes a mechanism for removing branch penalties by implementing a restricted form of multi-path execution on Conventional speculative architectures use branch prediction to dynamically scheduled architectures. We call this Dynamic Pred- evaluate the most likely execution path during program execution. ication (DP) because we do not assume support for predication in However, certain branches are difficult to predict. One solution the instruction set architecture (ISA). Instead, we identify branches to this problem is to evaluate both paths following such a condi- that can benefit from predicated execution using information pro- tional branch. Predicated execution can be used to implement this vided by compiler or link-time transformations. During execution, form of multi-path execution. Predicated architectures fetch and is- dynamic predication converts the instruction sequences for both sue instructions that have associated predicates. These predicates paths of these branches into predicated form. Internally, the pro- indicate if the instruction should commit its result. Predicating a cessor instructions use full predication (i.e., all instructions can be branch reduces the number of branches executed, eliminating the predicated). The branch is then eliminated from the instruction chance of branch misprediction at the cost of executing additional stream and instructions from both paths are concurrently executed. instructions. The conversion to predicated form occurs while the program exe- In this paper, we propose a restricted form of multi-path exe- cutes; hence the name Dynamic Predication. cution called Dynamic Predication for architectures with little or In this paper, we concentrate on dynamic predication for simple no support for predicated instructions in their instruction set. Dy- branch hammocks [7], or branches that have a clear fork-join form namic predication dynamically predicates instruction sequences in with no nested hammocks. Adding the ability to predicate ham- the form of a branch hammock, concurrently executing both paths mocks to a dynamically scheduled architecture helps (1) eliminate of the branch. A branch hammock is a short forward branch mispredicted branches, (2) reduce capacity demands on resources that spans a few instructions in the form of an if-then or for branch prediction, and (3) improve instruction fetching by in- if-then-else construct. We mark these and other constructs creasing the number of instructions between branches. in the executable. When the decode stage detects such a sequence, it passes a predicated instruction sequence to a dynamically sched- 1.1 Dynamic Predication for Branch Hammocks uled execution core. Our results show that dynamic predication can accrue speedups of up to 13%. Figure 1 shows a code fragment corresponding to an “if-then-else” statement. The example has a single conditional branch at instruction (e) and a join point at (k). As shown in Figure 1, we break 1 Introduction a branch hammock into four “contexts” that we use to discuss dynamic predication throughout this paper. The fork-context occurs Current processors already issue 4 instructions per cycle, and fu- up to the point of the original conditional branch. The then-context ture designs will be able to issue 8 or more instructions per cycle. and optional else-context each contain half of the branch hammock. However, increasing the issue width of the processor beyond 4 in- The join-context is the code following the branch hammock. structions will have decreasing benefits unless the compiler and ar- Since dynamic predication only converts branch hammocks chitecture technology is improved. Studies on current processors, into predicated form, it is more restrictive than a fully predicated which can issue 4 instructions per cycle, show that on average only ISA with an aggressive predicated compiler. In addition, if the 1 to 2 instructions are issued per cycle [6, 21]. Therefore, 50% to hammock code is not laid-out as one consecutive block of in- 75% of the processor’s potential performance is not being utilized. structions, dynamic predication is not applied, since the advan- A main contributor to this performance degradation is the small size tage of consecutive fetch is lost. Despite these limitations, we of basic blocks and high branch misprediction penalties. Even with show that dynamic predication can improve the performance of a new branch prediction architectures, some branches are still very dynamically-scheduled architecture with a conventional ISA. hard to predict. For example, the SPECint95 program go has only There are three problems to be solved in dynamic predication: 80% branch prediction accuracy using the latest branch prediction identifying eligible branch hammocks, deciding whether to use architectures [19]. predication to execute the branch hammock, and maintaining the Predicated execution is an important part of future architecture processor state in the face of dynamic predication. design. Predication allows the removal of conditional branches In this work, we assume the conditional branch starting a predi- from the instruction stream through conditional execution of in- catable hammock ( i.e. instruction (e) in Figure 1) would be marked structions. Predicated architectures fetch and issue instructions that by a compiler or binary instrumentation tool. have predicates indicating if the instruction should commit its result. Fork Context Fork Context (a) R1 := … (a) R1.a := … (b) R2 := … (b) R2.b := … (c) R3.c := … (c) R3 := … (d) R4.d := … (d) R4 := … (e) P1.e := CC (e) Bcc (i) Then Context Then context (CC is false) (CC is false) (f) R1.f := R1.a + R2.b (f) R1 := R1 + R2 (g) R3.g := R1.f * 2 Else Context (h) BRA (k) Else context (g) R3 := R1 * 2 (CC is true) (h) BRA (k) (CC is true) (i) R2.i := R1.a - R2.b (j) R3.j := R2.i * 2 (i) R2 := R1 - R2 (j) R3 := R2 * 2 Join Context (k) R1.k := if P1.e then R1.a else R1.f (l) R2.l := if P1.e then R2.i else R2.b (m) R3.m := if P1.e then R3.j else R3.g (k) RA := R1 (n) RA.n := R1.k Join (l) RB := R2 (o) RB.o := R2.l Context (m) RC := R3 (p) RC.p := R3.m (n) RD := R4 (q) RD.q := R4.d Figure 1. Branch Hammocks Figure 2. Translated Branch Hammocks. Rx.y: logical register x, physical register y The decision to evaluate both paths can be done using off-line 2 Implementation of Dynamic Predication profiles or an on-line dynamic confidence estimator. If an off-line profile is used, the branch is always executed using predication. In To support dynamic predication, the pipeline must be extended to the dynamic method, the branch is executed using predication only identify hammocks, predicate the instructions within hammocks, if a confidence estimator reports a low confidence in the branch pre- and support the correct execution of instructions within the created diction. We use a dynamic confidence estimator similar to the de- fork, then, and else contexts. In addition, this support must coex- sign of Jacobsen et al. [10]. Clearly, dynamic predication will have ist harmoniously with general branch speculation. In the following the largest influence for branches with little predictability and few text, we detail the enhancements we made to our baseline out-of- instructions. If a branch can be easily predicted, then speculation order issue processor pipeline. An overview of the pipeline exten- is more efficient than predication. However, wide-issue processors sions and instruction re-order buffer additions is shown in Figure 3. are likely to have fetched the instructions on both branch paths, and predication can be useful even for highly predictable branches. 2.1 Decoder Support Dynamic predication evaluates both sides of a branch, and places additional pressure on execution resources. In Figure 1, reg- 2.1.1 Dynamic Hammock Predication isters R1 and R2 are each modified on only one half of the ham- Once the decoder determines that a hammock should be predicated, mock, whereas R3 is modified on both halves. Register R4 is not it indicates this in the re-order buffer entries for the predicated in- modified by either path. Since each path redefines some portion of structions. In addition, a predication context tag (described in Sec- the register name space, and instructions from both paths are issued tion 2.4) is assigned to the hammock instructions along with a pred- to the same processor, some mechanism must be used to distinguish icate value. The predicate value for the predicate region, i.e., true or the state for each path. We inject instructions from both predicated false, is compared against the resulting predicate in the context tag paths into the same pipeline. We use a labeling mechanism, de- when committing the instructions. If the value of the context tag scribed later, to identify the instructions and registers from the dif- is equal to the predicate value then the instruction is committed, ferent paths. We remove the conditional branch and replace it with otherwise it is squashed. a predicate definition instruction, rename the registers on each path The decoder uses the branch target of a hammock branch to of the branch to maintain independent execution semantics and in- determine when the then, else, and join points are encountered, as ject conditional move (cmove) operations to reconcile predicated previously described. When the else point is encountered, the pred- definitions into a single context following the join point. icate value stored with the instructions is inverted.

Dynamic Hammock Predication for Non-Predicated Instruction Set

MIPS® Architecture for Programmers Volume I-B: Introduction to the Micromips32™ Architecture, Revision 5.03

MIPS32™ Architecture for Programmers Volume I: Introduction to the MIPS32™ Architecture

MCST~8 Assembly Language Programming Manual PRELIMINARY EDITION

Notes on Usage of the RX Family C/C++ Compiler V.1.00 Release 01 and Corrections in the User's Manual

(SLB) Predictor: a Compiler Assisted Branch Prediction for Data Dependent Branches

Downloads/Kornau-Tim--Diplomarbeit--Rop.Pdf [34] Sebastian Krahmer

PL/I for AIX: Programming Guide

Jump Tables Via Function Pointer Arrays in C/C++ Jump Tables, Also Called Branch Tables, Are an Efficient Means of Handling Similar Events in Software

LC-2 Programmer's Reference and User Guide

Control Flow

Lecture Notes Companion CPSC 213 2Nd Edition DRAFT Oct28

Low-Level C Programming CSEE W4840