Efficient Place and Route for Pipeline Reconfigurable Architectures
Total Page:16
File Type:pdf, Size:1020Kb
Efficient Place and Route for Pipeline Reconfigurable Architectures Srihari Cadambi*and Seth Copen Goldsteint Carnegie Mellon University 5000 Forbes Avenue Pittsburgh, PA 15213. Abstract scheme for a class of reconfigurable architectures that exhibit pipeline reconfiguration. The target pipeline In this paper, we present a fast and eficient compi- reconfigurable architecture is represented by a sim- lation methodology for pipeline reconfigurable architec- ple VLIW-style model with a list of architectural con- tures. Our compiler back-end is much faster than con- straints. We heuristically represent all constraints by a ventional CAD tools, and fairly eficient. We repre- single graph parameter, the routing path length (RPL) sent pipeline reconfigurable architectures by a general- of the graph, which we then attempt to minimize. Our ized VLI W-like model. The complex architectural con- compilation scheme is a combination of high-level syn- straints are effectively expressed in terms of a single thesis heuristics and a fast, deterministic place-and- graph parameter: the routing path length (RPL). Com- route algorithm made possible by a compiler-friendly piling to our model using RPL, we demonstrate fast architecture. The scheme may be used to design a compilation times and show speedups of between lox compiler for any pipeline reconfigurable architecture, and 200x on a pipeline reconfigurable architecture when as well as any architecture that may be represented by compared to an UltraSparc-II. our model. It is very fast and fairly efficient. The remainder of the paper is organized as follows: Section 2 describes the concept of pipeline reconfigura- 1. Introduction tion and its benefits. We briefly describe PipeRench, an instance of pipeline reconfigurable architectures, Current workloads for computing devices are rapidly a,nd the VLIW-style architectural model for the com- changing to include real-time media processing and piler. Section 3 describes the compiler along with our compute-intensive programs. These new workloads are place-and-route algorithm. Section 4 presents results dataflow-dominated and characterized by regular com- that establish correlation between the RPL and our putations on large sets of small data elements, limited objective function. We also show the effects of differ- 1/0 and lots of computation. As a result, many of them ent priority function parameters on our results, and re- seriously underutilize the large datapaths and instruc- port speedup numbers over an UltraSparc-I1 obtained tion bandwidths of conventional processors. by using our compiler on the PipeRench reconfigurable These problems are being addressed through pro- architecture. Section 5 discusses some related work and cessor changes [la, 6, 181. However, reconfigurable we conclude in Section 6. computing offers a fundamentally different way to solve these issues. Reconfigurable architectures have 2. Pipeline Reconfigurable Architectures the capability to configure connections between pro- grammable logic elements, registers and memory in or- In this section, we review pipeline reconfigura- der to construct a highly-parallel implementation of the tion [14], a technique in which a large physical design is processing kernel at run-time. This feature makes them implemented on smaller sized hardware through rapid very attractive, since a specific high-speed circuit for a reconfiguration. This technique allows the compiler to given instance of an application can be generated at target an unbounded amount of hardware. In addition, compile- or even run-time. the performance of a design improves in proportion to In this paper, we present a generalized compilation the amount of hardware allocated to that design. Thus, *[email protected] as more transistors available, the same hardware de- tseth0cs .cmu. edu signs achieve higher levels of performance. 423 0-7695-0801-4/00$10.00 0 2000 IEEE file interconnect, allows the values of all the registers to be transferred to the registers of the PE in the same column of the next stripe. Figure 1 shows the PES, their register files and the interconnection network in PipeRench. 2.2. VLIW-style Architectural Model Despite hardware virtualization, three important constraints are still imposed by such a pipeline reconfig- urable architecture on the compiler: (i) a limited num- ber of PES per stripe, (ii) a limited number of registers per PE and (iii) a limited number of read and write Figure 1. PipeRench architecture showing the N PES in each stripe and their register files. The in- ports in each register file. These restrict the number of terconnect switches B-bit values. Unless overwritten values that may overlap (i.e., live ranges of variables) by their PE, the register files constitute a pipelined on the register file of a single PE, make routing diffi- bus. cult, and impose a constraint on operator placement. We can model this as a distributed register file Pipeline reconfiguration is a method of virtualiz- VLIW-like architecture with N B-bit wide functional ing pipelined hardware applications by breaking the units (FUs, or PES in our case). The configuration for configuration into pieces that correspond to pipeline each stripe is analogous to a VLIW-instruction. Each stages in the application. These configurations are instruction is comprised of N sub-instructions, one for then loaded, one per cycle, onto the interconnected each FU. A sub-instruction determines the operation network of programmable processing elements (PES) of the FU, the sources of its inputs, and the register collectively known as the fabric [5]. allocated for its outputs. Each FU has one register file Since the configuration of stages happens concur- with P registers, R read ports and W write ports. The rently with the execution of other stages, there is no inputs to a FU may come from other functional units loss in performance due to reconfiguration. As the depending on the interconnect. Thus, by generalizing pipeline is filling with data, stages of the computation this model and varying the parameters N,B, P, R and are configured one step ahead of the data [5]. LV, our work may be extended to compilers that ma.p We now present the main characteristics of dataflow graphs to other similar architectures. PipeRench, an instance of the class of pipeline reconfig- urable architectures. We then present our architectural 2.3. Definitions model that describes this class of architectures for the compiler. Before we present the problem and our solution, we define a few terms. A dataflow graph is used to rep- 2.1. PipeRench: an instance of Pipeline Reconfig- resent the input program. It consists of nodes which urable Architectures represent operations’. Nodes are interconnected by di- rected edges, referred to as wires. A wire has a single PipeRench is composed of pipeline stages called source node but can have multiple destination nodes stripes. The pipeline stages available in the hardware (fanout). Each wire carries a value, represented by are referred to as physical stripes while the pipeline multiple bits. stages that the application is compiled to are referred Nodes may be routing nodes or non-routing nodes. to as virtual stripes. The compiler-generated virtual A routing node does not consume any functional re- stripes are mapped to the physical stripes at run-time. sources, while a non-routing node has to be allocated Each physical stripe is composed of N processing el- PES. An example of a routing-node is a “bit-select” ements (PES). In turn, each PE is composed of B (marked [1,1]in Figure 2). identically configured look-up tables (LUTs), P B-bit The routing path length of a single bit in a given registers in a register file, and some control logic. The wire w is the total number of time-steps between w and PES have 2 data inputs. Each stripe has an associated its nearest non-routing source node. The routing path inter-strive interconnect used to route values to the next stripe and also to route values to other PES in the We use the the terms “Node” and “Operation” interchange- same stripe. An additional interconnect, the register- ably in this paper. 424 Input Schedule 1 Schedule 2 TIME 1 Target Arch TIME 2 Total RPL = 4 Total RPL = 6 its bit wire sources N=l. P=4, Registers needed = 2 Registers needed = 4 =2+3=5 R=l, W=l Figure 2. Illustration of RPL. The right source bitwire Figure 3. Effect of minimizing routing path length of the - operator has its nearest non-routing source (RPL) on the maximum number of registers needed. node in time-step 0, resulting in a RPL of 3. The lefi The target architecture has 1 PE per stripe. Schedule source bit of the - time-step 1, resulting in an RPL 1 prefers to place the high fan-in node A first, while of 2. Each arc represents one bit of a wire. schedule 2 postpones A. The total RPL and maxi- mum number of registers in Schedule 1 is lower. length of a node is the sum of the routing path lengths register usage is shown in Figure 3. We now present of all of the bits of its source wires. (See Figure 2.) the following solution: 2.4. The Problem and Solution 0 We take advantage of hardware virtualization and assume an infinite amount of hardware in one di- The probleni that we solve is to map a given mension, that is, unlimited stripes. This allows us dataflow graph to a pipeline reconfigurable architecture to gainfully adopt a greedy, deterministic, linear defined by the model in Section 2.2. Specifically, for time place-and-route algorithm in order to get a each operation we have to first assign a stripe. Within lot of speed. The algorithm never backtracks or that stripe, we have to allocate PE resources to per- removes an operation that was already placed.