Single-Dimension Software Pipelining for Multi-Dimensional Loops

Single-Dimension Software Pipelining for Multi-Dimensional Loops z y y Hongbo Rong y , Zhizhong Tang , R.Govindarajan , Alban Douillet , Guang R.Gao z y Department of Electrical Department of Computer Supercomputer Edn. & Res. Centre and Computer Engineering Science and Technology Computer Science & Automation University of Delaware Tsinghua University Indian Institute of Science Newark, DE 19716, USA Beijing, 100084, China Bangalore 560 012, India g frong,douillet,ggao [email protected] [email protected] @capsl.udel.edu Abstract Software pipelining [1, 2, 14, 16, 17, 20, 23, 24, 25, 30] is an effective way to extract ILP from loops. While numer- Traditionally, software pipelining is applied either to the ous algorithms have been proposed for single loops or the innermost loop of a given loop nest or from the innermost innermost loops of loop nests [1, 3, 16, 17, 19, 24, 25], only loop to outer loops. In this paper, we propose a three- a few address software pipelining of loop nests [17, 20, 30]. step approach, called Single-dimension Software Pipelin- These methods share the essential idea of scheduling each ing (SSP), to software pipeline a loop nest at an arbitrary loop level successively, starting with the innermost one. loop level. In hierarchical modulo scheduling [17], an inner loop The first step identifies the most profitable loop level for is first modulo scheduled and is considered as an atomic- software pipelining in terms of initiation rate or data reuse operation of its outer loop. The process is repeated until all potential. The second step simplifies the multi-dimensional loop levels are scheduled, or available resources are used data-dependence graph (DDG) into a -dimensional DDG up, or dependences disallow further parallelization. The in- and constructs a -dimensional schedule for the selected efficiency due to the filling and draining (prolog and epilog loop level. The third step derives a simple mapping function parts) of the software pipelined schedule for the inner loops which specifies the schedule time for the operations of the is addressed in [20, 30]. multi-dimensional loop, based on the -dimensional sched- In this paper, we refer to the above approach as mod- ule. We prove that the SSP method is correct and at least as ulo scheduling from the innermost loop, or innermost-loop- efficient as other modulo scheduling methods. centric modulo scheduling. The innermost-loop-centric ap- We establish the feasibility and correctness of our approach naturally extends the single loop scheduling method proach by implementing it on the IA-64 architecture. Exper- to the multi-dimensional domain, but has two major short- imental results on a small number of loops show significant comings: (1) it commits itself to the innermost loop first performance improvements over existing modulo schedul- without considering how much parallelism the other levels ing methods that software pipeline a loop nest from the in- have to offer. Software pipelining another loop level first nermost loop. might result in higher parallelism. (2) It cannot exploit the data reuse potential in the outer loops [8, 9]. Other software pipelining approaches for loop nests [14, 23] do not con- 1. Introduction sider resource constraints. Lastly, there have been other interesting work that com- Loop nests are rich in coarse-grain and fine-grain par- bine loop transformation (e.g., unimodular transformation, allelism and substantial progress has been made in ex- etc.) with software pipelining [32, 8]. However, in these ploiting the former [7, 10, 11, 13, 18, 6]. With the ad- methods, software pipelining is still limited to the innermost vent of ILP (Instruction-Level Parallelism) processors like loop of the transformed loop nest. VLIW, superscalar, and EPIC [1], and the fast growth in In this paper, we present a unique framework for hardware resources, another important challenge is to ex- resource-constrained software pipelining for a class of loop ploit fine-grain parallelism in the multi-dimensional itera- nests. Our approach can software pipeline a loop nest at tion space [14, 23, 20]. an arbitrary loop level. It extends the traditional innermost Proceedings of the International Symposium on Code Generation and Optimization (CGO 2004) 0-7695-2102-9/04 $20.00 © 2004 IEEE loop software pipelining scheme to allow software pipelin- such support, our method can still produce efficient code, al- ing to be applied to the most “beneficial” level in a loop though with an increase in code size. We target ILP unipro- nest in order to better exploit parallelism and data reuse po- cessors with support for predication [1, 5]. There is no tential in the whole iteration space and match it to the constraint on the function units. They may or may not be available machine resources. pipelined. There is no constraint on the latencies of opera- The problem addressed in this paper is formu- tions, either. lated as follows: Given a loop nest composed of n We have developed a stand-alone implementation of our L L L n loops , identify the most profitable loop method for the IA-64 architecture. The resulting code is run L x n level x ( ) and software pipeline it.Soft- on an IA-64 Itanium machine and the actual execution time L ware pipelining x means that the consecutive itera- is measured. Our initial experimental results on a few loops L tions of x will be overlapped at run-time. The other loops reveal that the proposed approach achieves significant per- will not be software-pipelined. In this paper, we only dis- formance benefits over existing methods. Furthermore, we L cuss how to parallelize the selected loop x . Its outer loops, observe that the SSP approach is beneficial even in the pres- if any, remain intact in our approach and can be paral- ence of loop transformations, such as loop interchange, loop lelized later. tiling, and unroll-and-jam. The above problem can be broken down into two sub- We remark that the experimental results presented here problems: how to predict the benefits of software pipelining are preliminary in nature, and are included here mainly to a given loop level, and how to software pipeline the loop give a feel for the potential benefits of our method. The pur- nest at the chosen loop level. pose of this paper is to introduce the SSP concept. Our solution consists of three steps: Our method shows several advantages: 1. Loop selection: This step searches for the most prof- Global foresight: Instead of focusing only on the in- itable loop level in the loop nest. Here profitability can nermost loop, every loop level is examined and the be measured in terms of initiation rate, or data reuse most profitable one is chosen. Furthermore, any crite- potential, or both. rion can be used to judge the “profitability” in this loop 2. Dependence simplification and 1-D schedule construc- selection step. This flexibility opens a new prospect tion: The multi-dimensional DDG of the loop nest is to combine software pipelining with any other op- reduced to a 1-dimensional (1-D) DDG for the se- timal criterion beyond the ILP degree, which is of- lected loop. Based on the 1-D DDG and the resource ten chosen as the main objective for software pipelin- constraints, a modulo schedule, referred to as the 1-D ing. In this paper, we consider not only parallelism, schedule, is constructed for the operations in the loop but also cache effects, which have not been consid- nest. No matter how many inner loops the selected loop ered by most traditional software pipelining methods level has, it is scheduled as if it were a single loop. [1, 3, 14, 16, 17, 19, 20, 23, 24, 25, 30]. 3. Final schedule computation: Based on the resulting 1- Simplicity: The method retains the simplicity of the D schedule, this step derives a simple mapping func- classical modulo scheduling of single loops. The tion which specifies the schedule time of operations of scheduling is based on a simplified 1-dimensional the multi-dimensional loop. DDG and is done only once, irrespectively of the depth of the loop nest. This is another essential dif- Since the problem of multi-dimensional scheduling is re- ference from previous approaches. Also the tradi- duced to 1-dimensional scheduling and mapping, we refer tional modulo scheduling of single loops is subsumed to our approach as Single-dimension Software Pipelining as a special case. (SSP). An equally important problem is how to generate com- Efficiency: Under identical conditions, our schedule 2 pact code for the constructed SSP schedule. The code gen- provably achieves the shortest computation time that eration problem involves issues related to register assign- could be achieved by the traditional innermost-loop- ment through software and hardware renaming (to handle centric approach. Yet, since we search the entire loop live range overlapping), controlling filling and draining of nest and choose the most profitable loop level, the ex- the pipelines, and limiting the code size increase. Details ecution time may be even shorter. Besides, we con- are published in [27]. sider reuse vector space to achieve potentially higher Like many software-pipelining methods, we assume dy- namic support for kernel 1-only code [28]. In the absence of 2 We differentiate between the terms “execution time” and “computation time” of a schedule. The latter refers to the estimated execution time of the schedule, while the former refers to the actual execution 1 The kernel of the 1-D schedule. time measured by running the schedule on a real machine. Proceedings of the International Symposium on Code Generation and Optimization (CGO 2004) 0-7695-2102-9/04 $20.00 © 2004 IEEE exploitation of data locality, which, in turn, further re- schedule length of each iteration be l cycles.

Single-Dimension Software Pipelining for Multi-Dimensional Loops

Optimal Software Pipelining: Integer Linear Programming Approach

Decoupled Software Pipelining with the Synchronization Array

VLIW Architectures Lisa Wu, Krste Asanovic

Static Instruction Scheduling for High Performance on Limited Hardware

Introduction to Software Pipelining in the IA-64 Architecture

Register Assignment for Software Pipelining with Partitioned Register Banks

The Bene T of Predicated Execution for Software Pipelining

Software Pipelining of Nested Loops

Software Pipelining for Transport-Triggered Architectures

A Simple, Verified Validator for Software Pipelining

Build Linux Kernel with Open64

For More Details, See Downloadable Description