Software Pipelining for the Pegasus IR

Software Pipelining for the Pegasus IR Cody Hartwig Elie Krevat [email protected] [email protected] Abstract significant performance improvement. Software pipelining in general has been the source Modern processors, especially VLIW processors, of- of much research, and we cover a brief classification ten have the ability to execute multiple instructions and survey of the most popular techniques in Sec- simultaneously. Taking advantage of this capabil- tion 2. Our approach differs from previous research ity is crucial for high performance software applica- because we apply our algorithms in the context of tions. Software pipelining is a technique designed to Pegasus: an intermediate representation used by the increase the level of parallelism in loops. We pro- CASH compiler [4, 5]. The CASH compiler translates pose a new approach to software pipelining based on programs written in C into implementations of hard- direct manipulations of control flow graphs in Pe- ware components. Pegasus was designed to support gasus: an intermediate representation used by the spatial computation, so operations in a program cor- CASH compiler. In this paper, we describe the de- respond to actual hardware operations, and a Pegasus sign and implementation of our software pipelining graph models both the data flow and control flow of algorithm. Additionally, we provide a detailed analy- a program. In Pegasus, basic blocks in the control sis of the metrics and heuristics used by our algorithm flow graph are combined into hyperblocks that rep- in the context of a simple code example. resent units of speculative work. So while previous approaches to software pipelining use a loop body of 1 Introduction instructions, our approach makes use of hyperblocks, operators, and a representation that reveals depen- Modern VLIW architectures can schedule multiple in- dencies in a control flow graph. By implementing structions at once, but they are constrained by data this approach in Pegasus and not the generated as- and control dependencies that limit the opportunity sembly code, we abstract away lower-level resource for parallel execution. True data dependencies occur constraints that are handled in the later stages of when an instruction depends on the result of a previ- compilation. ous instruction. Other data dependencies occur when To implement software pipelining in Pegasus, we two operations write to the same variable, or an in- propose a localized and iterative approach that put variable to an instruction is written to by a later pipelines operations one at a time. Our approach instruction. Control dependencies occur when pred- computes operation outputs for future loop iterations icated instructions are conditionally executed. Soft- in the current iteration. Pipelining an operation con- ware compilers use parallelization techniques to work sists of moving that operation from the hyperblock of around these dependencies and exploit as much in- a loop body into the hyperblock’s pre-header, and the struction level parallelism as possible from a given data-flow for values before and after executing that program. operation are fed into the loop hyperblock. Then Software pipelining is a highly effective technique each loop iteration uses the value of the operation al- to increase the level of available parallelism in the ready computed, either in the pre-header or during a body of a loop by restructuring the code to overlap previous iteration, and computes the operation value operations from different iterations. By overlapping for a future iteration. This approach is analogous to iterations, there are more instructions available for preparing temporary variables of future iterations to scheduling and better opportunities to schedule in- make the loop body schedule more efficient. structions in parallel. Since the code in a loop may An operation is a candidate to be pipelined if it be executed many times over, even a small improve- matches a number of possible patterns, described ment in instruction level parallelism can lead to a fully in Section 4. Pattern matching in Pegasus is a 1 simple local decision, since patterns depend only on veals enough instructions to improve the instruction the type of operation and the source of its inputs. In level parallelism without creating too much code size our current implementation, because we do not create expansion. A pattern recognition stage then iden- an epilogue, operations must also be side-effect free tifies a repeating kernel from the unrolled loop that (e.g., loads may be pipelined but not stores). Our can be scheduled efficiently. A well known example of approach also chooses operations to pipeline that are this technique is Aiken and Nicolau’s Perfect Pipelin- on the most expensive paths from the beginning to ing [1, 2]. the end of a hyperblock. This heuristic for choosing the next operation to pipeline tends to decouple the Alternatively, modulo scheduling techniques focus more expensive operations from longer path depen- on creating a schedule from one iteration of a loop dencies, so after software pipelining more operations that can be repeated without violating any resource are scheduled in parallel. and precedence constraints. A minimum initiation While the potential benefit of software pipelining interval is calculated for the minimum number of in- is substantial, possible negative side effects are instructions required to separate repeated iterations of creased register pressure and wasted speculative op- the schedule. If the scheduler fails to find a schedule erations. The increase in register pressure can come with the minimum initiation interval, it will incre- from computing instructions from multiple iterations ment this interval and iterate the same process. Ex- at once, and can result in register spilling. Wasted amples of this technique include Lam’s hierarchical speculative operations can occur when extra instruc- reduction method that handle conditional statements tions are computed to prepare a very tightly pipelined on VLIW machines [9] and Rau’s Iterative Modulo loop that has a control flow which executes the loop Scheduling [13, 15]. Rau also discusses how regis- only a very few times or not at all. If not handled ter pressure and allocation strategies are affected by correctly, these side effects can eliminate the benefit his approach, but specifically avoids the problem of of software pipelining, and even do more harm than what to do when there are not enough available reg- good. Since scheduling with resource constraints is a isters [14]. well known NP-hard problem [7], heuristics are generally used to avoid the worst of these situations, and Percolation scheduling applies many atomic pro- a feedback approach between the different stages of gram transformations to a parallel execution control- the compiler can provide better hints as to the most flow graph based on a number of guidance rules effective strategies. For example, a less aggressive and heuristics (including information from data- software pipelining strategy should be implemented dependency analysis). The nodes in a parallel execu- in response to register spilling. We do not explicitly tion graph contain many operations, and operations implement such a feedback loop, but this is an area are moved between nodes if there are no dependency for future work that is fully compatible with our ap- constraints [6, 10, 11, 12]. proach. At a basic level, Percolation Scheduling may ap- 2 Related Work pear similar to our approach, however, the transformations of Percolation Scheduling are actually very Many algorithms exist to perform software pipelin- different because they change the order of indepen- ing, and using a classification developed by Allan dent operations. Since the Pegasus graph encapsu- et al. [3] these algorithms generally perform either lates both data-flow and control-flow information, the kernel recognition or modulo scheduling. Percolation ordering of a series of operators in a Pegasus hyper- scheduling [10] is an additional approach with a more block must always be respected, since operations that localized decision process that doesn’t fit exactly into appear later in the ordering depend on the results either of the previous classifications, although its con- of earlier operations. The parallel execution graph cepts of primitive transformations are combined with used in Percolation Scheduling does not have these loop unrolling in Aiken’s Perfect Pipelining kernel desirable dependence properties built into the graph- recognition algorithm [2]. ical structure. Also, Percolation Scheduling produces Kernel recognition techniques assume the schedule code explosions by visiting nodes on every global con- for loop iterations is fixed and unroll the loop some trol path between moves, while our pattern matching n number of times, choosing a value of n that re- algorithm makes use of localized decisions. 2 3 Approach mu mu mu mu 1 mu mu const We propose implementing software pipelining pred through direct manipulation of Pegasus graphs. op load cast op The primary goal of this approach is to reduce data dependencies between operations in an effort output output output output to increase the opportunity for instruction level parallelism. Figure 1: Patterns recognized by software pipelining At a high level, we implement software pipelining include side-effect free operations with inputs that by moving operations between iterations of a loop. are mus or constants. For example, if a value is loaded from memory in a loop body, we can move that load to the previous loop iteration. In this way, the loaded value is avail- 4 Design able immediately at the beginning of new iterations and uses of it are not required to wait for a load de- In this section we describe the specific algorithm we lay. Meanwhile, the current iteration will execute the have designed to implement software pipelining in Pe- load that will be used by the next iteration. This ef- gasus as well as show a simple example execution of fectively decouples the dependency between the load this algorithm.

Load more