Software Pipelining for Transport-Triggered Architectures

Software Pipelining for Transport-Triggered Architectures Jan Hoogerbrttgge Henk Corporaal Hans Mulder Delft University of Technology Department of Electrical Engineering Section Computer Architecture and Digital Systems Abstract As stated in [5] transport-triggered MOVE architectures have extra irtstxuction scheduling degrees of tkeedom. This paper investigates if and how those extra degrees influence the software pipelining This paper discusses software pipelining for a new class of ar- iteration initiation interval. It therefore adapts the existing algo- chitectures that we call transport-triggered. These architectures reduce the interconnection requirements between function units. rithms for software pipelining as developed by Lam [2]. It is shown that transport-triggering may lead to a significant reduction They also exhibit code scheduling possibilities which are not avail- of the iteration initiation interval and therefore to an increase of able in traditional operation-triggered architectures. In addition the the MIPS and/or MFLOPS rate. scheduling freedom is extended by the use of so-called hybrid- pipelined function utits. The remainder of this paper starts with an introduction of the MOVE class of architectures; it clari6es the idea of transport- In order to exploit this tleedom, existing scheduling techniques triggered architectures. Section 3 formulates the software pipelin- need to be extended. We present a software pipelirtirtg technique, based on Lam’s algorithm, which exploits the potential of ing problem and its algorithmic solution for trrmsport-triggered architectures. Section 4 describes the architecture characteristics !mnsport-triggered architectures. and benchmarks used for the measurements. In order to research Performance results are presented for several benchmak loops. the influence of the extra scheduling freedom, the algorithm has Depending on the available transport capacity, MFLOP rates may been applied to the benchmarks under dfierent scheduling disci- increase significantly as compared to scheduling without the ex~a plines. The next section (5) compares and analysis the measure- degrees of freedom. ments. Finally section 6 gives severaf conclusions and indicates further research to be done. 1 Introduction 2 The MOVE class of architectures Software pipelining is a well known technique for optimizing loops for superpipelined and VLJSV like architectures. In software MOVE architectures can be viewed as a number of function units pipelining the next iteration of 8 loop starts before the current one (FUs) connected by some kind of transport network (see figure 1). end$ it aims at minimizing the initiation interval of successive By changing the type rmd number of FUs and the connectivity iterations. Several algorithms for automatic software pipelining and capacity of the transport network a whole range of architec- exist [1,2,3,4]. They differ in complexity and flexibility. tures can be implemented; MOVE architectures are therefore ideal for being tailored to application specific needs. CurrentIy we are This paper describes and investigates software pipelining for a class of architectures which we cafl transpor6friggered architec- implementing a single chip MOVE prototype architecture. tures. In opera[ion-triggered architectures, instmctiotrs specify operations and require this operation to trigger the transport of source and destination operands. In lran.rporr-zriggered architectures instructions specify the transport of data only and, therefore, require the transport to trigger the operations. AII recent RISC. VLIW. and strperpipehned architectures can be classified as operation-triggered. The class of MOVE architectures presented in [5] and certain micro architectures (e.g., [6]) can be classified as transport-triggered. Permissionto copy without fee all or part of tbii material is granted provided that the copies are no! made or distributed for direct commercial advantage,tbe ACM copyright notice and the t itle of the pubka tion and Figure 1: The MOVE architecture. its date appear, and notice is given that copying is by permission of the Association for Computing Machme~. To copy otherwise, or to republish, requires a fee and/or specitic permission. In the next subsections we tirst go into some more detaik of the MOVE concep~ explaining how to program this cIass of architec- (3I 1991 ACM O-89791-460-O/91/Mll/W74 $1.50 74 tures, and indicate some of its advantages (for a complete quali- is supplied via the operand register. Since a store has no result tative analysis see [5]). Second the p@icular implementation of no result move is needed On average between 1.5 and 2 moves pipelined FUs is discussed This implementation extends the code are necessary per operation. scheduling freedom. Flow control operations are afso done by moves. An absolute jump is simply writing the target address into the program counter, 2.1 The MOVE Concept which is visible in the register address space. Relative jumps are provided by writing a displacement to the program counter displacement register. Both jumps have delay slots before instructions MOVE architectures are programmed by specifying the necessary at the target addfess are executed. !m+nsport of data values in the transport network. There is only one type of instruction the move from FU to FU. In generaf, Conditional execution is supported by guards. Each move can be F(Js are connected to the transport network by way of three types guarded. A guarded move g : r= + rb only takes place when the of registers, operand trigger and result registers. All moves are guard g evaluates to true. In the prototype MOVE architecture g than from register to register. General-purpose registers (GPRs) can be a boolean expression of two boolean values produced by can be viewed as mon~lc-identity FUs. The function of the three the compare FU. register types is: Advantages of the MOVE concept Operand registers: These act as normal generaf purpose regis- The MOVE concept has several advantages both for implementa- ters, except when an operation is triggered (see below) on tion in VLSI and for generating high quality code. It will he clear the accompanying FU its contents is used as an operand for that splitting traditional operations into more fimdstnental trans- the operation. port operations will offer exha opportunities for generating high quafity code. The benefits for the scheduler in relation to softwrue Trigger regiaterw A move to a trigger registers causes the start pipelining wilf become clear in the remainder of this paper. The of an operation on the accompanying FU. The moved value most important implementation advantages are is used as an operand of the operation. If more operands are neede~ they are taken from operand registers of the FU. To handle different operations on a single FU, the trigger s Efficient transport capacity. On average we need far less register is mapped into different logical address locations. than 3 data transports per operatiom this means that with an equal amount of metal on a chip (transport busses require a Result registers Most FUs are fufly pipelinect, so an operation large amount of chip area!) we can keep more function units can be started in every cycle. When a result of an operation busy. leaves the pipeline it is placed in the result register of the FU, from where it can be moved to other FUs. ● Very short cycle time, The cycle time of MOVE architectures is lower-bounded by the register to register transport time, All FUs contain at least one trigger type of register. Practical which can be very short. MOVE architectures handle multiple moves in parallel; in this ● Flexible architecture. It is very easy to tailor MOVE m- sense they resemble VLIW architectures. chitectures to specific applications by changing the transport capacity (e.g. the number of busses), type of FUs, and the number of FUs. ● Simplicity. Despite its flexibility MOVE architectures art T@ 4 io r~ -+ add ; trigger addition very easy to design. They are therefore suitable for automatic ... ; space for other moves generation within a sificon compiler environment. re + ea ir + sd ; trigger store Figure 2 The MOVE code for an addition followed by a store. 2.2 FU Pipelines Most FUs within MOVE architectures are implemented using a In order to make the MOVE concept more clear, an example is so called hybrid pipefining mechanism. Figure 3 shows a typical given in figure 2. The example shows the atiltion of GPRs r a integer pipeline with a latency of 3 clock cycles. and ?b and a store of the result at memory address r.. The code Current high-performance architectures generally implement starts with moving r= and ?b to the operand and trigger registers pipelines where either interme&ate pipeline stages continue always of the integer unit (registers io and add). The trigger register on each clock cycle to the next stage (e.g. [7 ,8]) or intermediate is accessed at the location that causes an addition on the integer stages continue only in case a new operation is issued (e.g. [9 ]). unit. The operand move(s) must always be done before or in the In hybrid pipefines intermediate values continue to the next stage same cycle as the trigger move. When the result of the addition on each clock cycle, as far as they do not overwrite results from leaves the pipeline it is available in the result register of the integer previous operations. This offers extra schedrshrrg freedom; re- rmi~ in our example the latency of the integer unit is two, so the sults may be moved from the FUs at arty time after triggering the result move crm take place two cycles after the trigger move. The operation’. Hybrid pipelines also reduce register usage by using result move passes the result directfy to the trigger register of interfrtdlate stages as temporary storage. the memory unit. At the same time the address of the store is placed in the operand register of the memoxy unit.

Software Pipelining for Transport-Triggered Architectures

Optimal Software Pipelining: Integer Linear Programming Approach

Decoupled Software Pipelining with the Synchronization Array

VLIW Architectures Lisa Wu, Krste Asanovic

Static Instruction Scheduling for High Performance on Limited Hardware

Introduction to Software Pipelining in the IA-64 Architecture

Register Assignment for Software Pipelining with Partitioned Register Banks

The Bene T of Predicated Execution for Software Pipelining

Software Pipelining of Nested Loops

A Simple, Verified Validator for Software Pipelining

Build Linux Kernel with Open64

For More Details, See Downloadable Description

Survey on Combinatorial Register Allocation and Instruction Scheduling ACM Computing Surveys