Optimizing Memory Accesses for Spatial Computation

Optimizing Memory Accesses For Spatial Computation Mihai Budiu and Seth C. Goldstein Computer Science Department Carnegie Mellon University g fmihaib,seth @cs.cmu.edu Abstract ence of pointers. Pegasus also increases available paral- In this paper we present the internal representation and lelism by supporting aggressive predication without giving optimizations used by the CASH compiler for improving up the features of SSA. the memory parallelism of pointer-based programs. CASH One of Pegasus’ main design goals is to support spa- uses an SSA-based representation for memory, which com- tial computation. Spatial computation refers to the direct pactly summarizes both control-flow- and dependence in- execution of high-level language programs in hardware. formation. Each operation in the program is implemented as a hard- In CASH, memory optimization is a four-step process: ware operator. Data flows from one operation to another (1) first an initial, relatively coarse, representation of mem- along wires that connect the operations. ory dependences is built; (2) next, unnecessary memory de- Pegasus is a natural intermediate representation for spa- pendences are removed using dependence tests; (3) third, tial computation because its semantics are similar to that redundant memory operations are removed (4) finally, par- of an asynchronous circuit. Pegasus is an executable IR allelism is increased by pipelining memory accesses in which unifies predication, static-single assignment, may- loops. While the first three steps above are very general, dependences, and data-flow semantics. The most powerful the loop pipelining transformations are particularly appli- aspect of Pegasus is that it allows the essential information cable for spatial computation, which is the primary target in a program to be represented directly. Thus, many op- of CASH. timizations are reduced to a local rewriting of the graph. The redundant memory removal optimizations pre- This makes Pegasus a particularly scalable representation, sented are: load/store hoisting (subsuming partial re- yielding efficient compilation, even when compiling entire dundancy elimination and common-subexpression elimina- programs to circuits. tion), load-after-store removal, store-before-store removal In this paper we describe optimizations which increase (dead store removal) and loop-invariant load motion. memory parallelism, eliminate redundant memory refer- One of our loop pipelining transformations is a new ences, and increase the pipeline parallelism found in loops. form of loop parallelization, called loop decoupling. This We show how Pegasus’ explicit representation of both con- transformation separates independent memory accesses trol and data dependences allows these optimizations to be within a loop body into several independent loops, which concisely expressed. are allowed dynamically to slip with respect to each other. 2 Example A new computational primitive, a token generator is used to We motivate our memory representation with the fol- dynamically control the amount of slip, allowing maximum lowing example: freedom, while guaranteeing that no memory dependences are violated. void f(unsigned*p, unsigned a[], int i) { 1 Introduction if (p) a[i] += *p; One of the main bottlenecks to increasing the perfor- else a[i] = 1; mance of programs is that many compiler optimizations a[i] <<= a[i+1]; break down in the face of memory references. For exam- } ple, while SSA is an IR that is widely recognized as en- abling many efficient and powerful optimizations it cannot This program uses a[i] as a temporary variable. We be easily applied in the presence of pointers. In this paper have compiled this program using the highest optimization we present an intermediate representation, Pegasus, that level using seven different compilers: enables efficient and powerful optimizations in the pres- ¯ gcc 3.2 for Intel P4, -O3, ¯ Sun WorkShop 6 update 2 for Sparc, -xO5, In Figure 1B, the load from a[i] labeled 4 immedi- ¯ DEC CC V5.6-075 for alpha, -O4, ately follows the two possible stores (2 and 3). No compli- ¯ MIPSpro Compiler Version 7.3.1.2m for SGI, -O4, cated dataflow or control-flow analysis is required to de- ¯ SGI ORC version 1.0.0 for Itanium, -O4, duce this fact: it is indicated by the tokens flowing directly ¯ IBM AIX cc version 3.6.6.0, -O3 from the stores to the load. As such, the load can be re- ¯ CASH. moved and replaced with the value of the executed store. This replacement is shown in Figure 1C, where the load Only CASH and the AIX compiler removed all the has been replaced by a multiplexor 7, drawn as a trapezoid. useless memory accesses made for the intermediate result The multiplexor is controlled by the predicates of the two stored in a[i] (two stores and one load). The other com- stores to a[i], 2 and 3: the store that executes (i.e., has pilers retain the intermediate redundant stores to a[i], a “true” predicate at run-time), is the one that will forward which are immediately overwritten by the last store in the the stored value through the multiplexor. function. The CASH result is even more surprising in light of the simplicity of the analysis it performs: the op- As a result of the load elimination, in Figure 1C the a[i] timizations consist only of reachability computations in a store 6 to immediately follows (and post-dominates) DAG and localized code transformations (term-rewriting). the other two stores, 2 and 3, to the same address; in con- The point of this comparison is not to show the superior- sequence, 2 and 3 are useless, and can be completely re- ity of CASH; rather, we want to point out that optimiza- moved, as dead code. Again, the post-dominance test does tions which are complicated enough to perform in a tradi- not involve any control-flow analysis: it follows immedi- tional representation (so that most compilers forgo them), ately from the representation, because (1) the stores imme- are very easy to express, implement and reason about when diately follow each other, as indicated by the tokens con- using a better program representation. necting them; (2) they occur at the same address and (3) each predicate of an earlier store implies the predicate of The strength of CASH originates from Pegasus, its in- the latter one, which is constant “true” (shown as 1), internal program representation. The optimization occurs in dicating the post-dominance. The post-dominance is de- three steps, detailed in Figure 1. (We describe Pegasus in termined by only elementary boolean manipulations. The Section 3 and details of these optimizations in Section 5.) 1 transformation from C to D is accomplished in two steps: Figure 1A depicts a slightly simplified form of the pro- (a) the predicates of the prior stores are “and”-ed with the gram representation before any optimizations. Each oper- negation of the predicate of the latter store (i.e., the prior ation is represented by a node in a graph. Edges represent stores should occur only if the latter one doesn’t overwrite value flow between nodes. Pegasus uses predication [20]; them) and (b) stores with a constant “false” predicate are throughout this paper, dotted lines represent predicate val- completely removed from the program as dead code. ues, while dashed lines represent may dependences for memory access operations; these lines carry token values. 2.1 Related Work Each memory operation has a predicate input, indicating The explicit representation of memory dependences be- whether it ought to be executed, and a token input, indi- tween program operations has been suggested numerous cating that the side-effects of the prior dependent opera- times in the literature; for example in Pingali’s Depen- tions have occurred. At the beginning of the compilation dence Flow Graph [21], or Steensgaard’s adaptation to process the token dependences track the original program Value-Dependence Graphs [24]. As Steensgaard has ob- control-flow. The nodes denoted by “V” represent opera- served, this type of representation is actually a generaliza- tions that combine multiple input tokens into a single token tion of SSA designed to handle value flow through mem- for each output; these originally represent joins in the pro- ory. Other researchers have explored the use of SSA for gram control-flow. The “*” node is the token input, indi- handling memory dependences, e.g.,[7, 8, 14, 5, 6, 13]. cating that side-effects in the previous part of the program The original contributions of this work are: have completed. ¯ We apply the memory access optimization techniques in CASH first proves that a[i] and a[i+1] access dis- the context of a spatial compiler, which translates C pro- joint memory regions, and thus commute (nodes 3/5, 2/5 grams to hardware circuits. and 5/6). By removing the token edges between these ¯ We show how to use predication and SSA together for memory operations the program is transformed as in Fig- memory code hoisting (subsuming partial redundancy ure 1B. Notice that a new combine operator is inserted at elimination and global common-subexpression elimina- the end of the program; its execution indicates that all prior tion for memory operation), removing loads after stores, program side-effects have occurred. dead store removal, and loop-invariant code motion for memory operations; our algorithms rely on boolean ma- 1We have elided the address computation part. nipulation of the controlling predicates (Section 5). * p != 0 p != 0 * * p != 0 *p != 0 =a[i] =*p 1 ! =*p =a[i] ! 1 1 =a[i] =*p ! =a[i] =*p ! 1 1 1 1 + V a[i]= + V a[i]= 1 a[i]= V + 1 1 V + 1 2 2 2 a[i]= a[i]= =a[i+1] a[i]= ?: =a[i+1] ?: 3 3 5 3 7 5 1 V 1 V 1 1 1 V << << 1 =a[i] =a[i+1] =a[i] =a[i+1] a[i]= a[i]= 4 5 45 6 6 1 << V 1 << V V a[i]= a[i]= 6 6 V (A) (B) (C) (D) Figure 1: Program transformations [dotted lines represent predicates, while dashed lines represent tokens]: (A) removal of two extraneous dependences between a[i] and a[i+1], (B) load from a[i] bypasses directly stored value(s), (C) store to a[i] kills prior stores to the same address (D) final result.

Load more