Software Pipelining: an Effective Scheduling Technique for VLIW Machines

Software Pipelining: An Effective Scheduling Technique for VLIW Machines Monica Lam Department of Computer Science Carnegie Mellon University Pittsburgh, Pennsylvania 15213 Abstract 1. Introduction This paper shows that software pipelining is an effective A VLIW (very long instruction word) machine [5, 111,is and viable scheduling technique for VLIW processors. In simiiar to a horizontally microcoded machine in that the data software pipelining, iterations of a loop in the source program path consists of multiple, possibly pipelined, functional units, are continuously initiated at constant intervals, before the each of which can be independently controlled through dedi- preceding iterations complete. The advantage of software cated fields in a “very long” instruction. The distinctive pipelining is that optimal performance can be achieved with feature of VLIW architectures is that these long instructions compact object code. are the machine instructions. There is no additional layer of This paper extends previous results of software pipelining interpretation in which machine instructions are expanded in two ways: First, this paper shows that by using an im- into micro-instructions. While complex resource or field con- proved algorithm, near-optimal performance can be obtained flicts often exist between functionally independent operations without specialized hardware. Second, we propose a in a horizontal microcode engine, a VLIW machine generally hierarchical reduction scheme whereby entire control con- has an orthogonal instruction set snd a higher degree of paral- structs are reduced to an object similar to an operation in a lelism. The key to generating efficient code for the VLIW basic block. With this scheme, all innermost loops, including machine is global code compaction. that is, the compaction of those containing conditional statements, can be software code across basic blocks [ 121. In fact, the VLIW architecture pipelined. It also diminishes the start-up cost of loops with is developed from the study of the global code compaction small number of iterations. Hierarchical reduction comple- technique, trace scheduling [lo]. ments the software pipelining technique, permitting a consistent performance improvement be obtained. The thesis of this paper is that software The techniques proposed have been validated by an im- pipelining [24,25,30] is a viable alternative technique for plementation of a compiler for Warp, a systolic array consist- scheduling VLIW processors. In software pipelining, itera- ing of 10 VLIW processors. This compiler has been used for tions of a loop in a source program are continuously initiated developing a large number of applications in the areas of at constant intervals without having to wait for preceding image, signal and scientific processing. iterations to complete. That is, multiple iterations, in different stages of their computations, are in progress simultaneously. The research was supported in part by Defense Advanced The steady state of this pipeline constitutes the loop body of Research projects Agency (DOD) monitored by the Space and the object code. The advantage of software pipelining is that Naval Warfare Systems Command under Contract NOOO39-87-C-0251, and in part by the Office of’ Naval optimal performance can be achieved with compact object Research under Contracts NOOO14-87-K-0385 and code. NO0014-87-K-0533. PermissIon to copy without fee all or part of this material is granted provided A drawback of software pipelining is its complexity; the that the copies are not made or distributed for direct commercial advantage, problem of finding an optimal schedule is NP-complete. the ACM copyright notice and the title of the pubhcauon and its date appear. and notice is given that copying is by permission of the Association for (This can be shown by transforming the problem of resource Computing Machinery. To copy otherwise. or to republish. requires a fee and/ constrained scheduling problem [ 141 to the software pipelin- or specltic permission. ing problem). There have been two approachesin response to o 1988 ACM O-8979 l-269- l/88/0006/03 I 8 $1.50 the complexity of this problem: (1) change the architecture, and thus the characteristics of the constraints, so that the )” problem becomes tractable, and (2) use heuristics. The first Language Design and lmplementatlon approach is used in the polycyclic [25] and Cydrome’s Cydra Atlanta, Georgia, June 22-24, 1988 318 architecture; a specialized crossbar is used to make optimizing However, their software pipeliig algorithm only overlaps loops without data dependencies between iterations tractable. the computation from at most two iterations. The unfavorable However, this hardware feature is expensive; and, when inter- results obtained for software pipelining can be attributed to iteration dependency is present in a loop, exhaustive search on the particular algorithm rather than the software pipelining the strongly connected components of the data flow graph is approach. still necessary[16]. The second approach is used in the FPS-164 compiler [30]. Software pipelining is applied to a The techniques described in this paper have been validated restricted set of loops, namely those containing a single by the implementation of a compiler for the Warp machine. Fortran statement. In other words, at most one inter-iteration Warp [4] is a high-performance, programmable systolic array data dependency relationship can be present in the flow graph. developed by Carnegie Mellon and General Electric, our in- The results were that near-optimal results can be obtained dustrial partner. The Warp array is a linear array of VLIW cheaply without the specialized hardware. processors, each capable of a peak computation rate of 10 million floating-point operations per second (10 MFLGPS). This paper shows that software pipelining is a practical, A Warp array typically consists of ten processors, or cells, efficient, and general technique for scheduling the parallelism and thus has an aggregatebandwidth of 100 MFLGPS. in a VLIW machine. We have extended previous results of software pipelining in two ways. First, we show that near- Each Warp cell has its own sequencer and program optimal results can often be obtained for loops containing memory. Its data path consists of a floating-point multiplier, both intra- and inter-iteration data dependency, using software a floating-point adder, an integer ALU, three register files heuristics. We have improved the scheduling heuristics and (one for each arithmetic unit), a 512-word queue for each of introduced a new optimization called modulo variable expan- the two inter-cell data communication channels, and a 32 sion. The latter implements part of the functionality of the Kword data memory. All these components are connected specialized hardware proposed in the polycyclic machine, through a crossbar, and can be programmed to operate con- thus allowing us to achieve similar performance. cturently via wide instructions of over 200 bits. The multiplier and adder are both 5-stage pipelined together with the Second, this paper proposes a hierarchical reduction 2 cycle &lay through the register file, multiplications and scheme whereby entire control constructs are reduced to an additions take 7 cycles to complete. object similar to an operation in a basic block. Scheduling techniques previously defined for basic blocks can be applied The machine is programmed using a language called W2. across basic blocks. The significance of hierarchical reduc- In W2, conventional Pascal-like control constructs are used to tion is threefold: Fist, conditional statementsno longer con- specie the cell programs, and asynchronous computation stitute a barrier to code motion, code in innermost loops primitives are used to specify inter-cell communication. The containing conditional statementscan be compacted Second, Warp machine and the W2 compiler have been used exten- and more importantly, software pipelining can be applied to sively for about two years, in many applications such as arbitrarily complex loops, including those containing con- low-level vision for robot vehicle navigation, image and sig- ditional statements. Third, hierarchical reduction dihes nal processing, and scientific computing [2,3]. Gur previous the penalty of short loops: scalar code can be scheduled with papers presented an overview of the compiler and described the prolog and epilog of a pipelined loop. We can even an array level optimization that supports efficient fme-grain software pipeline the second level loop as well. The overall parallelism among cells [ 15.201. This paper describes the result is that a consistent speed up is obtained whenever scheduling techniques used to generate code for the parallel parallelism is available across loop iterations. and pipelined functional units in each cell. Software pipelining, as addressedhere, is the problem of This paper consists of three parts: Part I describes the scheduling the operations within an iteration, such that the software pipelining algorithm for loops containing straight- iterations can be pipelined to yield optimal throughput, line loop bodies, focusing on the extensions and improve- Software pipelining has also been studied under different con- ments. Part II describes the hierarchical reduction approach, texts. The software pipelining algorithms proposed by Su et and shows how software pipelining can be applied to all al. [27.28], and Aiken and Nicolau [l], assume that the loops. Part III contains

Load more