Software Bubbles: Using Predication to Compensate for Aliasing in Software Pipelines

Software Bubbles: Using Predication to Compensate for Aliasing in Software Pipelines£ Benjamin Goldberg and Emily Crutcher Department of Computer Science New York University [email protected], [email protected] Chad Huneycutt and Krishna Palem College of Computing and Department of Electrical and Computer Engineering Georgia Institute of Technology [email protected], [email protected] Abstract dependences and those that can be analyzed statically as static dependences. This paper describes a technique for utilizing predica- The program fragment in figure 1(b) has the same ef- tion to support software pipelining on EPIC architectures fect in limiting the compiler’s ability to perform software in the presence of dynamic memory aliasing. The essential pipelining as the one in figure 1(a). In this example, the idea is that the compiler generates an optimistic software- arrays a and b may overlap and the distance (in words) be- pipelined schedule that assumes there is no memory alias- tween a[0] and b[0] corresponds to the value k in fig- ing. The operations in the pipeline kernel are predicated, ure 1(a). however, so that if memory aliasing is detected by a run-time One solution is to generate several different versions of check, the predicate registers are set to disable the iterations a loop, with different degrees of software pipelining. Dur- that are so tightly overlapped as to violate the memory de- ing execution, a test is made to determine the dependence pendences. We refer to these disabled kernel operations as distance (e.g. k in figure 1(a)) and a branch to the appro- software bubbles. priately pipelined loop is performed. The drawbacks of this approach include possible code explosion due to the multiple versions of each loop as well as the cost of the branch 1 Introduction itself. This paper presents a technique, called software bub- Software pipelining and other methods for parallelizing bling, for supporting software pipelining in the presence of loops rely on the compiler’s ability to find loop-carried de- dynamic dependences without generating multiple versions pendences. Often the presence or absence of these depen- of the loop. This approach is aimed at EPIC architectures, dences cannot be determined statically. Consider the frag- such as Intel’s IA-64 and HP Laboratories’ HPL-PD, that ment in figure 1(a), where the value of k cannot be deter- support both instruction-level parallelism and predication. mined at compile time. In such situations, the compiler can- The notation we’ll use in this paper for predication, i.e. not rule out the possibility that k=1, which would create a the ability to disable the execution of an operation based on loop carried dependence with a dependence distance of 1. a one-bit predicate register p,is That is, the value written to a[i] in one iteration would be p: operation a[i-k] read as in the next iteration. Pipelining this loop where operation is performed only if the value of the pred- a[i] would provide little benefit, since the store of in one icate register p is 1. In the HPL-PD, thirty-two predicate a[i-k] iteration must complete before the load of in the registers can collectively be read or modified as a single 32- next iteration. bit register, and in the IA-64 the same is true with 64 pred- For the purposes of this paper, we will refer to depen- icate registers. We assume a similar capability here, and dences that cannot be analyzed at compile time as dynamic refer to the aggregate register as PR. We’ll use the syntax £ This work has been supported by the Hewlett-Packard Corporation and p[i] to refer to the ith rotating predicate register (if rotat- the DARPA Data Intensive Systems program. ing predicate registers are available) and pi to refer to the void copy(int a[],int b[]) for(i=k;i<n;i++) { for(int i=0,i<n;i++) a[i] = a[i-k]; a[i] = b[i]; } (a) (b) Figure 1: Program fragments exhibiting dynamic dependences ith non-rotating predicate register. We will also use a simple instruction set rather than the The idea behind software bubbling is that the compiler, HPL-PD or IA-64 ISA. Operations appearing on the same when encountering a loop that has dynamic dependences line are assumed to be within the same VLIW instruction. but can otherwise be pipelined, generates a single version We continue to use the code fragment in figure 1(a) as of the pipeline that is constrained only by the static depen- the driving example in this paper. We’ll assume that the dences and resource constraints known to the compiler. The assembly code generated for the body of the loop is operations within the pipeline kernel, however, are predi- ; q points to a[i-k] cated in such a way that if dynamic dependences arise, the L1: ; s points to a[i] predicate registers are set in a pattern that causes some of the r = load q ; load a[i-k] into r kernel operations to be disabled each time the kernel is exe- store s,r ; store r into a[i] cuted. With this disabling of operations, the effective over- q = add q,4 ; increment q lap of the iterations in the pipeline is reduced sufficiently s = add s,4 ; increment s brd L1 to satisfy the dynamic dependences. We refer to the disabled operations as software bubbles, drawing an analogy where the branch operation, brd, decrements and tests a with bubbles in processor pipelines. In figure 2, we see an dedicated loop counter register, LC. If rotating registers are example of a software pipeline with and without bubbling. supported, the brd operation also decrements the rotating The crossed out boxes in figure 2(b) correspond to iterations register base. that have been disabled using predication. A possible pipelining of this loop is illustrated in fig- ¾ ure 2(a), where add½ and add refer to the first and second add operations in the body of the loop, respectively. 2 Predication in Software Pipelining In this pipelining, the kernel of the pipeline consists of a single VLIW instruction containing the operations in each For clarity of the presentation, we will make the following horizontal gray box of figure 2(a), namely1. ½ simplifying assumptions (except where stated otherwise): add¾ add store load The overlaying of iterations in figure 2(a) assumes that the ¯ In our examples, all operations have a latency of 1 (i.e. store the result is available in the next cycle). Although this memory location written by the operation in one load assumption is unrealistic, it has no impact on the tech- iteration is not read by the operation in the next iter- store nical results of this paper. It simply makes our exam- ation. If there were such a dependence between the load ples more compact. This assumption was not made in and the , then the two operations could not occur in our experiments. the same VLIW instruction and the pipelining in figure 2(a) would be incorrect2. ¯ Rotating registers, including rotating predicate regis- In our driving example, figure 1(a), the value of the vari- ters, are available. As with normal software pipeling, able k directly determines the dependence distance of the rotating registers are a convenience that reduce code loop-carried dependence. Rather than assume the worst, i.e. size but do not fundamentally change the pipelining or that k will be 1, a compiler performing software bubbling bubbling methods. generates the most optimistic pipeline, but one in which each kernel operation is predicated. Figure 2(b) illustrates ¯ The 32-bit aggregate predication register, PR,isused the execution of software-bubbled code in the case where k only to set the predication pattern for bubbling. The predicate registers used for other purposes (condition- 1We have intentionally oversimplified this discussion by leaving out the PR operands of the operations in the kernel. We address the issues of operands als, etc.) are assumed not to be part of . in subsequent sections. 2On the IA-64, unlike HPL-PD, stores and loads to the same address ¯ There is only one dynamic dependence in a loop. This can occur in the same cycle. This increases the possible overlapping of a assumption is relaxed in section 10. pipeline by one cycle, but doesn’t change the reasoning in this paper Load Load Store Load StoreLoad Store Load Add Store Load Add 1 1 Add Store Load Add Add Store Load 2 Add 1 2 1 Add Add 2 Add 1Store Load 2 Add 1 Store Load Add2 Add Add2 Add 1 Store 1 Store Add 2 Add 2 Add 1 Add 1 Add 2 Add 2 (a) (b) Figure 2: Software pipeline (a) without bubbling and (b) with bubbling has indeed been determined at run time to be 1. The opera- We describe simple bubbling over the next several sections. tions in the crossed-out boxes are disabled by clearing their Section 8 commences the description of generalized bub- predicate registers, thus each load operation doesn’t occur bling. Much of what is described for simple bubbling also until the cycle after the previous store, as desired. If the applies to generalized bubbling. store operation had a latency greater than one, additional iterations would have had to be disabled. It is helpful to view the crossing out of the boxes in fig- 4 The Kernel in Simple Bubbling ure 2(b) as simply pushing iterations down and to the right Upon examination of figure 2(b), where every other itera- in the pipeline.

Software Bubbles: Using Predication to Compensate for Aliasing in Software Pipelines

Sustaining Moore's Law Through Inexactness

An Efficient Memory Operations Optimization Technique for Vector Loops on Itanium 2 Processors

School of Electrical and Computer Engineering 2000-2001 Annual Report

Flexible Error Handling for Embedded Real-Time Systems – Operating System and Run-Time Aspects –

Lecture Notes in Computer Science 1742 Edited by G

Hardware and Software for Approximate Computing

Overcoming the Power Wall by Exploiting Application Inexactness and Emerging COTS Architectural Features: Trading Precision for Improving Application Quality 1

GRADUATE STUDY in COMPUTER SCIENCE Rice University Graduate

An Optimum Inexact Design for an Energy Efficient Hearing

Exploiting Errors for Efficiency: a Survey from Circuits to Algorithms

Error-Efficient Computing Systems

Microarchitecture-Aware Physical Planning for Deep Submicron Technology Mongkol Ekpanyapong