Reconfigurable Computing: A Survey of Systems and

KATHERINE COMPTON Northwestern University

AND

SCOTT HAUCK University of Washington

Due to its potential to greatly accelerate a wide variety of applications, reconfigurable computing has become a subject of a great deal of research. Its key feature is the ability to perform computations in hardware to increase performance, while retaining much of the flexibility of a software solution. In this survey, we explore the hardware aspects of reconfigurable computing machines, from single chip architectures to multi-chip systems, including internal structures and external coupling. We also focus on the software that targets these machines, such as compilation tools that map high-level algorithms directly to the reconfigurable substrate. Finally, we consider the issues involved in run-time reconfigurable systems, which reuse the configurable hardware during program execution.

Categories and Subject Descriptors: A.1 [Introductory and Survey]; B.6.1 [Logic Design]: Design Style—logic arrays; B.6.3 [Logic Design]: Design Aids; B.7.1 [Integrated Circuits]: Types and Design Styles—gate arrays General Terms: Design, Performance Additional Key Words and Phrases: Automatic design, field-programmable, FPGA, manual design, reconfigurable architectures, reconfigurable computing, reconfigurable systems

1. INTRODUCTION of algorithms. The first is to use hard- wired technology, either an Application There are two primary methods in con- Specific Integrated Circuit (ASIC) or a ventional computing for the execution group of individual components forming a

This research was supported in part by Motorola, Inc., DARPA, and NSF. K. Compton was supported by an NSF fellowship. S. Hauck was supported in part by an NSF CAREER award and a Sloan Research Fellowship. Authors’ addresses: K. Compton, Department of Electrical and Computer Engineering, Northwestern Uni- versity, 2145 Sheridan Road, Evanston, IL 60208-3118; e-mail: [email protected]; S. Hauck, De- partment of Electrical Engineering, The University of Washington, Box 352500, Seattle, WA 98195; e-mail: [email protected]. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or direct commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any compo- nent of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 1515 Broadway, New York, NY 10036 USA, fax +1 (212) 869-0481, or [email protected]. c 2002 ACM 0360-0300/02/0600-0171 $5.00

ACM Computing Surveys, Vol. 34, No. 2, June 2002, pp. 171–210. 172 K. Compton and S. Hauck board-level solution, to perform the oper- figurable hardware by computing the logic ations in hardware. ASICs are designed functions of the circuit within the logic specifically to perform a given computa- blocks, and using the configurable routing tion, and thus they are very fast and to connect the blocks together to form the efficient when executing the exact com- necessary circuit. putation for which they were designed. FPGAs and reconfigurable computing However, the circuit cannot be altered af- have been shown to accelerate a variety of ter fabrication. This forces a redesign and applications. Data encryption, for exam- refabrication of the chip if any part of its ple, is able to leverage both parallelism circuit requires modification. This is an ex- and fine-grained data manipulation. An pensive process, especially when one con- implementation of the Serpent Block siders the difficulties in replacing ASICs Cipher in the Virtex XCV1000 in a large number of deployed systems. shows a throughput increase by a factor Board-level circuits are also somewhat in- of over 18 compared to a Pentium Pro flexible, frequently requiring a board re- PC running at 200 MHz [Elbirt and Paar design and replacement in the event of 2000]. Additionally, a reconfigurable com- changes to the application. puting implementation of sieving for fac- The second method is to use soft- toring large numbers (useful in breaking ware-programmed —a far encryption schemes) was accelerated by a more flexible solution. Processors execute factor of 28 over a 200-MHz UltraSparc a set of instructions to perform a compu- workstation [Kim and Mangione-Smith tation. By changing the software instruc- 2000]. The Garp architecture shows a tions, the functionality of the system is comparable speed-up for DES [Hauser altered without changing the hardware. and Wawrzynek 1997], as does an However, the downside of this flexibility FPGA implementation of an elliptic curve is that the performance can suffer, if not cryptography application [Leung et al. in clock speed then in work rate, and is 2000]. far below that of an ASIC. The processor Other recent applications that have must read each instruction from memory, been shown to exhibit significant speed- decode its meaning, and only then exe- ups using reconfigurable hardware cute it. This results in a high execution include: automatic target recognition overhead for each individual operation. [Rencher and Hutchings 1997], string pat- Additionally, the set of instructions that tern matching [Weinhardt and Luk 1999], may be used by a program is determined Golomb Ruler Derivation [Dollas et al. at the fabrication time of the processor. 1998; Sotiriades et al. 2000], transitive Any other operations that are to be im- closure of dynamic graphs [Huelsbergen plemented must be built out of existing 2000], Boolean satisfiability [Zhong et al. instructions. 1998], data compression [Huang et al. Reconfigurable computing is intended to 2000], and genetic algorithms for the tra- fill the gap between hardware and soft- velling salesman problem [Graham and ware, achieving potentially much higher Nelson 1996]. performance than software, while main- In order to achieve these performance taining a higher level of flexibility than benefits, yet support a wide range of appli- hardware. Reconfigurable devices, in- cations, reconfigurable systems are usu- cluding field-programmable gate arrays ally formed with a combination of re- (FPGAs), contain an array of computa- configurable logic and a general-purpose tional elements whose functionality is de- . The processor performs termined through multiple programmable the operations that cannot be done effi- configuration bits. These elements, some- ciently in the reconfigurable logic, such times known as logic blocks, are connected as data-dependent control and possibly using a set of routing resources that are memory accesses, while the computational also programmable. In this way, custom cores are mapped to the reconfigurable digital circuits can be mapped to the recon- hardware. This reconfigurable logic can be

ACM Computing Surveys, Vol. 34, No. 2, June 2002. Reconfigurable Computing 173 composed of either commercial FPGAs or uration compression and the partial reuse custom configurable hardware. of already programmed configurations can Compilation environments for reconfig- be used to reduce this overhead. urable hardware range from tools to assist This article presents a survey of cur- a programmer in performing a hand map- rent research in hardware and software ping of a circuit to the hardware, to com- systems for reconfigurable computing, as plete automated systems that take a cir- well as techniques that specifically target cuit description in a high-level language run-time reconfigurability. We lead off this to a configuration for a reconfigurable sys- discussion by examining the technology tem. The design process involves first par- required for reconfigurable computing, fol- titioning a program into sections to be im- lowed by a more in-depth examination of plemented on hardware, and those which the various hardware structures used in are to be implemented in software on the reconfigurable systems. Next, we look at host processor. The computations destined the software required for compilation of for the reconfigurable hardware are syn- algorithms to configurable computers, and thesized into a gate level or register trans- the trade-offs between hand-mapping and fer level circuit description. This circuit is automatic compilation. Finally, we discuss mapped onto the logic blocks within the re- run-time reconfigurable systems, which configurable hardware during the technol- further utilize the intrinsic flexibility of ogy mapping phase. These mapped blocks configurable computing platforms by opti- are then placed into the specific physi- mizing the hardware not only for different cal blocks within the hardware, and the applications, but for different operations pieces of the circuit are connected using within a single application as well. the reconfigurable routing. After compi- This survey does not seek to cover ev- lation, the circuit is ready for configura- ery technique and research project in the tion onto the hardware at run-time. These area of reconfigurable computing. Instead, steps, when performed using an automatic it hopes to serve as an introduction to compilation system, require very little ef- this rapidly evolving field, bringing in- fort on the part of the programmer to terested readers quickly up to speed on utilize the reconfigurable hardware. How- developments from the last half-decade. ever, performing some or all of these oper- Those interested in further background ations by hand can result in a more highly can find coverage of older techniques optimized circuit for performance-critical and systems elsewhere [Rose et al. 1993; applications. Hauck and Agarwal 1996; Vuillemin et al. Since FPGAs must pay an area penalty 1996; Mangione-Smith et al. 1997; Hauck because of their reconfigurability, device 1998b]. capacity can sometimes be a concern. Sys- tems that are configured only at power- 2. TECHNOLOGY up are able to accelerate only as much of the program as will fit within the pro- Reconfigurable computing as a concept grammable structures. Additional areas of has been in existence for quite some time a program might be accelerated by reusing [Estrin et al. 1963]. Even general-purpose the reconfigurable hardware during pro- processors use some of the same basic gram execution. This process is known ideas, such as reusing computational com- as run-time reconfiguration (RTR). While ponents for independent computations, this style of computing has the benefit of and using multiplexers to control the allowing for the acceleration of a greater routing between these components. How- portion of an application, it also introduces ever, the term reconfigurable comput- the overhead of configuration, which lim- ing, as it is used in current research its the amount of acceleration possible. Be- (and within this survey), refers to sys- cause configuration can take milliseconds tems incorporating some form of hard- or longer, rapid and efficient configuration ware programmability—customizing how is a critical issue. Methods such as config- the hardware is used using a number

ACM Computing Surveys, Vol. 34, No. 2, June 2002. 174 K. Compton and S. Hauck

Fig. 1. A programming bit for SRAM-based FPGAs [Xilinx 1994] (left) and a pro- grammable routing connection (right). of physical control points. These control Thus, these chips can be programmed and points can then be changed periodically in reprogrammed about as easily as a stan- order to execute different applications us- dard static RAM. In fact, one research ing the same hardware. project, the PAM project [Vuillemin et al. The recent advances in reconfigurable 1996], considers a group of one or more computing are for the most part de- FPGAs to be a RAM unit that performs rived from the technologies developed computation between the memory write for FPGAs in the mid-1980s. FPGAs (sending the configuration information were originally created to serve as a hy- and input data) and memory read (read- brid device between PALs and Mask- ing the results of the computation). This Programmable Gate Arrays (MPGAs). leads some to use the term Programmable Like PALs, FPGAs are fully electrically Active Memory or PAM. programmable, meaning that the physical One example of how the SRAM configu- design costs are amortized over multiple ration points can be used is to control rout- application circuit implementations, and ing within a reconfigurable device [Chow the hardware can be customized nearly in- et al. 1999a]. To configure the routing on stantaneously. Like MPGAs, they can im- an FPGA, typically a passgate structure plement very complex computations on a is employed (see Figure 1 right). Here the single chip, with devices currently in pro- programming bit will turn on a routing duction containing the equivalent of over connection when it is configured with a a million gates. Because of these features, true value, allowing a signal to flow from FPGAs had been primarily viewed as glue- one wire to another, and will disconnect logic replacement and rapid-prototyping these resources when the bit is set to false. vehicles. However, as we show through- With a proper interconnection of these ele- out this article, the flexibility, capacity, ments, which may include millions of rout- and performance of these devices has ing choice points within a single device, a opened up completely new avenues in rich routing fabric can be created. high-performance computation, forming Another example of how these configu- the basis of reconfigurable computing. ration bits may be used is to control mul- Most current FPGAs and reconfig- tiplexers, which will choose between the urable devices are SRAM-programmable output of different logic resources within (Figure 1 left), meaning that SRAM1 the array. For example, to provide optional bits are connected to the configuration stateholding elements a D flip-flop (DFF) points in the FPGA, and programming may be included with a multiplexer se- the SRAM bits configures the FPGA. lecting whether to forward the latched or unlatched signal value (see Figure 2 1 The term “SRAM” is technically incorrect for many left). Thus, for systems that require state- FPGA architectures, given that the configuration holding the programming bits controlling memory may or may not support random access. In the multiplexer would be configured to se- fact, the configuration memory tends to be continu- lect the DFF output, while systems that ally read in order to perform its function. However, this is the generally accepted term in the field and do not need this function would choose correctly conveys the concept of static volatile mem- the bypass route that sends the input di- ory using an easily understandable label. rectly to the output. Similar structures

ACM Computing Surveys, Vol. 34, No. 2, June 2002. Reconfigurable Computing 175

Fig. 2. D flip-flop with optional bypass (left) and a 3-input LUT (right). can choose between other on-chip func- very closely coupled systems, the recon- tionalities, such as fixed-logic computation figurability lies within customizable func- elements, memories, carry chains, or other tional units on the regular of functions. the microprocessor. On the other hand, a Finally, the configuration bits may be reconfigurable computing system can be used as control signals for a computational as loosely coupled as a networked stand- unit or as the basis for computation it- alone unit. Most reconfigurable systems self. As a control signal, a configuration are categorized somewhere between these bit may determine whether an ALU per- two extremes, frequently with the recon- forms an addition, subtraction, or other figurable hardware acting as a coproces- logic computations. On the other hand, sor to a host microprocessor. The pro- with a structure such as a lookup table grammable array itself can be comprised (LUT), the configuration bits themselves of one or more commercially available form the result of the computation (see FPGAs, or can be a custom device designed Figure 2 right). These elements are essen- specifically for reconfigurable computing. tially small memories provided for com- The design of the actual computation puting arbitrary logic functions. LUTs can blocks within the reconfigurable hardware compute any function of N inputs (where varies from system to system. Each unit of N is the number of control signals for the computation, or , can be as sim- LUT’s multiplexer) by programming the ple as a 3-input lookup table (LUT), or as 2N programming bits with the truth ta- complex as a 4-bit ALU. This difference ble of the desired function. Thus, if all in block size is commonly referred to as programming bits except the one corre- the of the logic block, where sponding to the input pattern 111 were the 3-bit LUT is an example of a very set to zero a 3-input LUT would act as a fine-grained computational element, and a 3-input AND gate, while programming it 4-bit ALU is an example of a quite coarse- with all ones except in 000 would compute grained unit. The finer-grained blocks are a NAND. useful for bit-level manipulations, while the coarse-grained blocks are better opti- mized for standard datapath applications. 3. HARDWARE Some architectures employ different sizes Reconfigurable computing systems use or types of blocks within a single recon- FPGAs or other programmable hardware figurable array in order to efficiently sup- to accelerate algorithm execution by map- port different types of computation. For ping compute-intensive calculations to the example, memory is frequently embedded reconfigurable substrate. These hardware within the reconfigurable hardware to pro- resources are frequently coupled with a vide temporary data storage, forming a general-purpose microprocessor that is heterogeneous structure composed of both responsible for controlling the reconfig- logic blocks and memory blocks [Ebeling urable logic and executing program code et al. 1996; 1998; Lucent 1998; that cannot be efficiently accelerated. In Marshall et al. 1999; Xilinx 1999].

ACM Computing Surveys, Vol. 34, No. 2, June 2002. 176 K. Compton and S. Hauck

The routing between the logic blocks cessor’s registers. A call to the Chimaera within the reconfigurable hardware is also unit is in actuality only a fetch of the re- of great importance. Routing contributes sult value. This value is stable and valid significantly to the overall area of the re- after the correct input values have been configurable hardware. Yet, when the per- written to the registers and have filtered centage of logic blocks used in an FPGA be- through the computation. comes very high, automatic routing tools In the next sections, we consider in frequently have difficulty achieving the greater depth the hardware issues in re- necessary connections between the blocks. configurable computing, including both Good routing structures are therefore es- logic and routing. To support the compu- sential to ensure that a design can be suc- tation demands of reconfigurable comput- cessfully placed and routed onto the recon- ing, we consider the logic block architec- figurable hardware. tures of these devices, including possibly Once a circuit has been programmed the integration of heterogeneous logic re- onto the reconfigurable hardware, it is sources within a device. Heterogeneity ready to be used by the host processor dur- also extends between chips, where one of ing program execution. The run-time op- the most important concerns is the cou- eration of a reconfigurable system occurs pling of the reconfigurable logic with stan- in two distinct phases: configuration and dard, general-purpose processors. How- execution. The programming of the recon- ever, reconfigurable devices are more than figurable hardware is under the control of just logic devices; the routing resources the host processor. This host processor di- are at least as important as logic re- rects a stream of configuration data to the sources, and thus we consider intercon- reconfigurable hardware, and this config- nect structures, including 1D-oriented de- uration data is used to define the actual vices that are beginning to appear. operation of the hardware. Configurations can be loaded solely at start-up of a pro- 3.1. Coupling gram, or periodically during runtime, de- pending on the design of the system. More Frequently, reconfigurable hardware is concepts involved in run-time reconfigu- coupled with a traditional microprocessor. ration (the dynamic reconfiguration of de- Programmable logic tends to be inefficient vices during computation execution) are at implementing certain types of opera- discussed in a later section. tions, such as variable-length loops and The actual execution model of the re- branch control. In order to run an applica- configurable hardware varies from sys- tion in a reconfigurable computing system tem to system. For example, the NAPA most efficiently, the areas of the program system [Rupp et al. 1998] by default that cannot be easily mapped to the recon- suspends the execution of the host pro- figurable logic are executed on a host mi- cessor during execution on the recon- croprocessor. Meanwhile, the areas with a figurable hardware. However, simulta- high density of computation that can ben- neous computation can occur with the efit from implementation in hardware are use of fork-and-join primitives, similar to mapped to the reconfigurable logic. For the multiprocessor programming. REMARC systems that use a microprocessor in con- [Miyamori and Olukotun 1998] is a re- junction with reconfigurable logic, there configurable system that uses a pipelined are several ways in which these two com- set of execution phases within the recon- putation structures may be coupled, as figurable hardware. These pipeline stages Figure 3 shows. overlap with the pipeline stages of the host First, reconfigurable hardware can be processor, allowing for simultaneous ex- used solely to provide reconfigurable ecution. In the Chimaera system [Hauck functional units within a host proces- et al. 1997], the reconfigurable hardware sor [Razdan and Smith 1994; Hauck is constantly executing based upon the in- et al. 1997]. This allows for a tradi- put values held in a subset of the host pro- tional programming environment with the

ACM Computing Surveys, Vol. 34, No. 2, June 2002. Reconfigurable Computing 177

Fig. 3. Different levels of coupling in a reconfigurable system. Reconfigurable logic is shaded. addition of custom instructions that may logic is embedded into the data cache. change over time. Here, the reconfigurable This cache can then be used as either a units execute as functional units on the regular cache or as an additional com- main microprocessor datapath, with reg- puting resource depending on the target isters used to hold the input and output application. operands. Third, an attached reconfigurable Second, a reconfigurable unit may processing unit [Vuillemin et al. 1996; be used as a [Wittig and Annapolis 1998; Laufer et al. 1999] be- Chow 1996; Hauser and Wawrzynek 1997; haves as if it is an additional processor in Miyamori and Olukotun 1998; Rupp et al. a multiprocessor system or an additional 1998; Chameleon 2000]. A coprocessor is, compute engine accessed semifrequently in general, larger than a functional unit, through external I/O. The host processor’s and is able to perform computations with- data cache is not visible to the attached out the constant supervision of the host reconfigurable processing unit. There is, processor. Instead, the processor initial- therefore, a higher delay in communica- izes the reconfigurable hardware and ei- tion between the host processor and the re- ther sends the necessary data to the logic, configurable hardware, such as when com- or provides information on where this data municating configuration information, might be found in memory. The reconfig- input data, and results. This communi- urable unit performs the actual computa- cation is performed though specialized tions independently of the main processor, primitives similar to multiprocessor sys- and returns the results after completion. tems. However, this type of reconfigurable This type of coupling allows the reconfig- hardware does allow for a great deal of urable logic to operate for a large num- computation independence, by shifting ber of cycles without intervention from large chunks of a computation over to the the host processor, and generally permits reconfigurable hardware. the host processor and the reconfigurable Finally, the most loosely coupled form logic to execute simultaneously. This re- of reconfigurable hardware is that of duces the overhead incurred by the use an external stand-alone processing unit of the reconfigurable logic, compared to a [Quickturn 1999a, 1999b]. This type of reconfigurable functional unit that must reconfigurable hardware communicates communicate with the host processor each infrequently with a host processor (if time a reconfigurable “instruction” is used. present). This model is similar to that One idea that is somewhat of a hybrid be- of networked workstations, where pro- tween the first and second coupling meth- cessing may occur for very long periods ods, is the use of programmable hardware of time without a great deal of commu- within a configurable cache [Kim et al. nication. In the case of the Quickturn 2000]. In this situation, the reconfigurable systems, however, this hardware is geared

ACM Computing Surveys, Vol. 34, No. 2, June 2002. 178 K. Compton and S. Hauck more towards emulation than reconfig- urable computing. Each of these styles has distinct ben- efits and drawbacks. The tighter the in- tegration of the reconfigurable hardware, the more frequently it can be used within an application or set of applications due to a lower communication overhead. How- ever, the hardware is unable to operate for significant portions of time without in- tervention from a host processor, and the amount of reconfigurable logic available is often quite limited. The more loosely cou- pled styles allow for greater parallelism in Fig. 4. A basic logic block, with a 4-input program execution, but suffer from higher LUT, carry chain, and a D-type flip-flop with communications overhead. In applications bypass. that require a great deal of communica- tion, this can reduce or remove any accel- established that the best function block eration benefits gained through this type for a standard FPGA, a device whose pri- of reconfigurable hardware. mary role is the implementation of ran- dom digital logic, is the one found in the first devices deployed—the lookup table 3.2. Traditional FPGAs (Figure 2 right). As described in the pre- Before discussing the detailed architec- vious section, an N-input LUT is basically ture design of reconfigurable devices in a memory that, when programmed appro- general, we will first describe the logic priately, can compute any function of up to and routing of FPGAs. These concepts N inputs. This flexibility, with relatively apply directly to reconfigurable systems simple routing requirements (each input using commercial FPGAs, such as PAM need only be routed to a single multiplexer [Vuillemin et al. 1996] and Splash 2 control input) turns out to be very power- [Arnold et al. 1992; Buell et al. 1996], ful for logic implementation. Although it is and many also extend to architectures less area-efficient than fixed logic blocks, designed specifically for reconfigurable such as a standard NAND gate, the truth computing. Hardware concepts applying is that most current FPGAs use less than specifically to architectures designed for 10% of their chip area for logic, devoting reconfigurable computing, as well as vari- the majority of the silicon real estate for ations on the generic FPGA description routing resources. provided here, are discussed following this The typical FPGA has a logic block section. More detailed surveys of FPGA ar- with one or more 4-input LUT(s), op- chitectures themselves can be found else- tional D flip-flops (DFF), and some form where [Brown et al. 1992a; Rose et al. of fast carry logic (Figure 4). The LUTs 1993]. allow any function to be implemented, pro- Since the introduction of FPGAs in the viding generic logic. The flip-flop can be mid-1980s, there have been many differ- used for pipelining, registers, statehold- ent investigations into what computation ing functions for finite state machines, or element(s) should be built into the ar- any other situation where clocking is re- ray [Rose et al. 1993]. One could consider quired. Note that the flip-flops will typi- FPGAs that were created with PAL-like cally include programmable set/reset lines product term arrays, or multiplexer-based and clock signals, which may come from functionality, or even basic fixed functions global signals routed on special resources, such as simple NAND and XOR gates. In or could be routed via the standard in- fact, many such architectures have been terconnect structures from some other built. However, it seems to be fairly well input or logic block. The fast carry logic

ACM Computing Surveys, Vol. 34, No. 2, June 2002. Reconfigurable Computing 179

minals. These blocks also connect shorter local wires to longer-distance routing re- sources. Signals flow from the logic block into the connection block, and then along longer wires within the routing channels. At the switchboxes, there are connections between the horizontal and vertical rout- ing resources to allow signals to change their routing direction. Once the signal has traversed through routing resources and intervening switchboxes, it arrives at the destination logic block through one of its local connection blocks. In this man- ner, relatively arbitrary interconnections can be achieved between the logic blocks in the system. Fig. 5. A generic island-style FPGA routing archi- Within a given routing channel, there tecture. may be a number of different lengths of routing resources. Some local interconnec- is a special resource provided in the cell tions may only move between adjacent to speed up carry-based computations, logic blocks (carry chains are a good ex- such as addition, parity, wide AND op- ample of this), providing high-speed lo- erations, and other functions. These re- cal interconnect. Medium length lines may sources will bypass the general routing run the width of several logic blocks, pro- structure, connecting instead directly be- viding for some longer distance intercon- tween neighbors in the same column. nect. Finally, longlines that run the entire Since there are very few routing choices chip width or height may provide for more in the carry chain, and thus less delay on global signals. Also, many architectures the computation, the inclusion of these re- contain special “global lines” that provide sources can significantly speed up carry- high-speed, and often low-skew, connec- based computations. tions to all of the logic blocks in the array. Just as there has been a great deal These are primarily used for clocks, resets, of experimentation in FPGA logic block and other truly global signals. architectures, there has been equally While the routing architecture of an as much investigation into interconnect FPGA is typically quite complex—the con- structures. As logic blocks have basically nection blocks and switchboxes surround- standardized on LUT-based structures, ing a single logic block typically have thou- routing resources have become primarily sands of programming points—they are island-style, with logic surrounded by gen- designed to be able to support fairly arbi- eral routing channels. trary interconnection patterns. Most users Most FPGA architectures organize their ignore the exact details of these architec- routing structures as a relatively smooth tures and allow the automatic physical de- sea of routing resources, allowing fast and sign tools to choose appropriate resources efficient communication along the rows to use in order to achieve a given intercon- and columns of logic blocks. As shown nect pattern. in Figure 5, the logic blocks are em- bedded in a general routing structure, 3.3. Logic Block Granularity with input and output signals attaching to the routing fabric through connection Most reconfigurable hardware is based blocks. The connection blocks provide pro- upon a set of computation structures that grammable multiplexers, selecting which are repeated to form an array. These of the signals in the given routing channel structures, commonly called logic blocks will be connected to the logic block’s ter- or cells, vary in complexity from a very

ACM Computing Surveys, Vol. 34, No. 2, June 2002. 180 K. Compton and S. Hauck

flip-flop. Additionally, there is specialized carry-chain circuitry that helps to acceler- ate addition, parity, and other operations that use a carry chain. These types of logic blocks are useful for fine-grained bit-level manipulation of data, as can frequently be found in encryption and image processing applications. Also, because the cells are fine-grained, computation structures of arbitrary bit widths can be created. This can be useful for implementing datapath circuits that are based on data widths not Fig. 6. The functional unit from a Xilinx 6200 cell [Xilinx 1996]. implemented on the host processor (5 bit multiply, 18 bit addition, etc). Reconfig- urable hardware can not only take advan- small and simple block that can calculate tage of small bit widths, but also large data a function of only three inputs, to a struc- widths. When a program uses bit widths ture that is essentially a 4-bit ALU. Some in excess of what is normally available in of these block types are configurable, in a host processor, the processor must per- that the actual operation is determined by form the computations using a number of a set of loaded configuration data. Other extra steps in order to handle the full data blocks are fixed structures, and the config- width. A fine-grained architecture would urability lies in the connections between be able to implement the full bit width in a them. The size and complexity of the ba- single step, without the fetching, decoding, sic computing blocks is referred to as the and execution of additional instructions, block’s granularity. as long as enough logic cells are available. An example of a very fine-grained logic A number of reconfigurable systems use block can be found in the Xilinx 6200 series a granularity of logic block that we cat- of FPGAs [Xilinx 1996]. The functional egorize as medium-grained [Xilinx 1994; unit from one of these cells, as shown in Hauser and Wawrzynek 1997; Haynes and Figure 6, can implement any two-input Cheung 1998; Lucent 1998; Marshall et al. function and some three-input functions. 1999]. For example, Garp [Hauser and However, although this type of architec- Wawrzynek 1997] is designed to perform ture is useful for very fine-grained bit ma- a number of different operations on up nipulation, it can be too fine-grained to ef- to four 2-bit inputs. Another medium- ficiently implement many types of circuits, grained structure was designed specifi- such as multipliers. Similarly, finite state cally to be embedded inside of a general- machines are frequently too complex to purpose FPGA to implement multipliers easily map to a reasonable number of of a configurable bit width [Haynes and very fine-grained logic blocks. However, fi- Cheung 1998]. The logic block used in the nite state machines are also too dependent multiplier FPGA is capable of implement- upon single bit values to be efficiently im- inga4×4 multiplication, or cascaded into plemented in a very coarse-grained archi- larger structures. The CHESS architec- tecture. This type of circuit is more suited ture [Marshall et al. 1999] also operates to an architecture that provides more on 4-bit values, with each of its cells act- connections and computational power per ing as a 4-bit ALU. Medium-grained logic logic block, yet still provides sufficient ca- blocks may be used to implement datapath pability for bit-level manipulation. circuits of varying bit widths, similar to The logic cell in the Altera FLEX 10K ar- the fine-grained structures. However, with chitecture [Altera 1998] is a fine-grained the ability to perform more complex oper- structure that is somewhat coarser than ations of a greater number of inputs, this the 6200. This architecture mainly con- type of structure can be used efficiently to sists of a single 4-input LUT with a implement a wider variety of operations.

ACM Computing Surveys, Vol. 34, No. 2, June 2002. Reconfigurable Computing 181

Fig. 7. One cell in the RaPiD-I reconfigurable architecture [Ebeling et al. 1996]. The registers, RAM, ALUs, and multiplier all operate on 16-bit values. The multiplier outputs a 32-bit result, split into the high 16 bits and the low 16 bits. All routing lines shown are 16-bit wide busses. The short parallel lines on the busses represent configurable bus connectors.

Very coarse-grained architectures are is composed of an 8 × 8 array of 16-bit primarily intended for the implementa- processors. Each of these processors uses tion of word-width datapath circuits. Be- its own instruction memory in conjunction cause the logic blocks used are optimized with a global program counter. This style for large computations, they will perform of architecture closely resembles a single- these operations much more quickly (and chip multiprocessor, although with much consume less chip area) than a set of simpler component processors because the smaller cells connected to form the same system is intended to be coupled with a type of structure. However, because their host processor. The RAW project [Moritz composition is static, they are unable et al. 1998] is a further example of a re- to leverage optimizations in the size of configurable architecture based on a mul- operands. For example, the RaPiD archi- tiprocessor design. tecture [Ebeling et al. 1996], shown in The granularity of the FPGA also has Figure 7, as well as the Chameleon ar- a potential effect on the reconfiguration chitecture [Chameleon 2000], are exam- time of the device. This is an important ples of this very coarse-grained type of issue for run-time reconfiguration, which design. Each of these architectures is com- is discussed in further depth in a later sec- posed of word-sized adders, multipliers, tion. A fine-grained array has many config- and registers. If only three 1-bit values uration points to perform very small com- are required, then the use of these archi- putations, and thus requires more data tectures suffers an unnecessary area and bits during configuration. speed overhead, as all of the bits in the full word size are computed. However, these 3.4. Heterogeneous Arrays coarse-grained architectures can be much more efficient than fine-grained architec- In order to provide greater performance tures for implementing functions closer to or flexibility in computation, some recon- their basic word size. figurable systems provide a heterogeneous An alternate form of a coarse-grained structure, where the capabilities of the system is one in which the logic blocks logic cells are not the same throughout are actually very small processors, poten- the system. One use of heterogeneity in tially each with its own instruction mem- reconfigurable systems is to provide mul- ory and/or data values. The REMARC ar- tiplier function blocks embedded within chitecture [Miyamori and Olukotun 1998] the reconfigurable hardware [Haynes and

ACM Computing Surveys, Vol. 34, No. 2, June 2002. 182 K. Compton and S. Hauck

Cheung 1998; Chameleon 2000; Xilinx can be emulated [Altera 1998; Cong and 2001]. Because multiplication is one of the Xu 1998; Wilton 1998; Heile and Leaver more difficult computations to implement 1999]. In fact, because there may be more efficiently in a traditional FPGA struc- than one value output from the memory ture, the custom multiplication hardware on a read operation, the memory struc- embedded within a reconfigurable array ture may be able to perform multiple dif- allows a system to perform even that func- ferent computations (one for each bit of tion well. data output), provided that all necessary Another use of heterogeneous struc- inputs appear on the address lines. In this tures is to provide embedded memory manner, the embedded RAM behaves the blocks scattered throughout the reconfig- same as a very large LUT. Therefore, em- urable hardware. This allows storage of bedded memory allows a programmer or frequently used data and variables, and a synthesis tool to perform a trade-off be- allows for quick access to these values tween logic and memory usage in order to due to the proximity of the memory to achieve higher area efficiency. the logic blocks that access it. Memory Furthermore, a few of the commercial structures embedded into the reconfig- FPGA companies have announced plans to urable fabric come in two forms. The first include entire microprocessors as embed- is simply the use of available LUTs as ded structures within their FPGAs. Altera RAM structures, as can be done in the has demonstrated a preliminary ARM9- Xilinx 4000 series [Xilinx 1994] and Virtex based Excalibur device, which combines [Xilinx 1999] FPGAs. Although making reconfigurable hardware with an embed- these very small blocks into a larger ded ARM9 processor core [Altera 2001]. RAM structure introduces overhead to the Meanwhile, Xilinx is working with IBM to memory system, it does provide local, vari- include a PowerPC processor core within able width memory structures. the Virtex-II FPGA [Xilinx 2000]. By con- Some architectures include dedicated trast, Adaptive Silicon’s focus is to provide memory blocks within their array, such reconfigurable logic cores to customers for as the Xilinx Virtex series [Xilinx 1999, embedding in their own system-on-a-chip 2001] and Altera [Altera 1998] FPGAs, as (SoC) devices [Adaptive 2001]. well as the CS2000 RCP (reconfigurable communications processor) device from 3.5. Routing Resources Chameleon Systems, Inc. [Chameleon 2000]. These memory blocks have greater Interconnect resources are provided in a performance in large sizes than similar- reconfigurable architecture to connect to- sized structures built from many small gether the device’s programmable logic el- LUTs. While these structures are some- ements. These resources are usually con- what less flexible than the LUT-based figurable, where the path of a signal is memories, they can also provide some cus- determined at compile or run-time rather tomization. For example, the Altera FLEX than fabrication time. This flexible inter- 10K FPGA [Altera 1998] provides embed- connect between logic blocks or computa- ded memories that have a limited total tional elements allows for a wide variety number of wires, but allow a trade-off be- of circuit structures, each with their own tween the number of address lines and the interconnect requirements, to be mapped data bit width. to the reconfigurable hardware. For ex- When embedded memories are not used ample, the routing for FPGAs is gener- for data storage by a particular config- ally island-style, with logic surrounded uration, the area that they occupy does by routing channels, which contain sev- not necessarily have to be wasted. By us- eral wires, potentially of varying lengths. ing the address lines of the memory as Within this type of routing architecture, function inputs and the values stored in however, there are still variations. Some of the memory as function outputs, logical these differences include the ratio of wires expressions of a large number of inputs to logic in the system, how long each of the

ACM Computing Surveys, Vol. 34, No. 2, June 2002. Reconfigurable Computing 183

Fig. 8. Segmented (left) and hierarchical (right) routing structures. The white boxes are logic blocks, while the dark boxes are connection switches. wires should be, and whether they should local communications traffic. These short be connected in a segmented or hierarchi- wires can be connected together using cal manner. switchboxes to emulate longer wires. Fre- A step in the design of efficient rout- quently, segmented routing structures ing structures for FPGAs and reconfig- also contain longer wires to allow sig- urable systems therefore involves exam- nals to travel efficiently over long dis- ining the logic vs. routing area trade-off tances without passing through a great within reconfigurable architectures. One number of switches. Hierarchical routing group has argued that the interconnect [Aggarwal and Lewis 1994; Lai and Wang should constitute a much higher propor- 1997; Tsu et al. 1999] is the second method tion of area in order to allow for successful to provide both local and global commu- routing under high-logic utilization condi- nication. Routing within a group (or clus- tions [Takahara et al. 1998]. However, for ter) of logic blocks is at the local level, FPGAs, high-LUT utilization may not nec- only connecting within that cluster. At essarily be the most desirable situation, the boundaries of these clusters, however, but rather efficient routing usage may be longer wires connect the different clusters of more importance [DeHon 1999]. This together. This is potentially repeated at a is because the routing resources occupy a number of levels. The idea behind the use much larger part of the area of an FPGA of hierarchical structures is that, provided than the logic resources, and therefore the a good placement has been made onto the most area efficient designs will be those hardware, most communication should be that optimize their use of the routing re- local and only a limited amount of com- sources rather than the logic resources. munication will traverse long distances. The amount of required routing does not Therefore, the wiring is designed to fit this grow linearly with the amount of logic model, with a greater number of local rout- present; therefore, larger devices require ing wires in a cluster than distance routing even greater amounts of routing per logic wires between clusters. block than small ones [Trimberger et al. Because routing can occupy a large part 1997b]. of the area of a reconfigurable device, the There are two primary methods to pro- type of routing used must be carefully con- vide both local and global routing re- sidered. If the wires available are much sources, as shown in Figure 8. The first longer than what is required to route a sig- is the use of segmented routing [Betz and nal, the excess wire length is wasted. On Rose 1999; Chow et al. 1999a]. In seg- the other hand, if the wires available are mented routing, short wires accommodate much shorter than necessary, the signal

ACM Computing Surveys, Vol. 34, No. 2, June 2002. 184 K. Compton and S. Hauck

Fig. 9. A traditional two-dimensional island-style routing structure (left) and a one- dimensional routing structure (right). The white boxes represent logic elements. must pass through switchboxes that con- routing is that if there are not enough nect the short wires together into a longer routing resources in a particular area of wire, or through levels of the routing hier- a mapped circuit, routing that circuit be- archy. This induces additional delay and comes actually more difficult than on a slows the overall operation of the circuit. two-dimensional array that provides more Furthermore, the switchbox circuitry oc- alternatives. A number of different re- cupies area that might be better used for configurable systems have been designed additional logic or wires. in this manner. Both Garp [Hauser and There are a few alternatives to the Wawrzynek 1997] and Chimaera [Hauck island-style of routing resources. Systems et al. 1997] are structures that provide such as RaPiD [Ebeling et al. 1996] use cells that compute a small number of bit segmented bus-based routing, where sig- positions, and a row of these cells to- nals are full word-sized in width. This is gether computes the full data word. A most common in the one-dimensional type row can only be used by a single config- of architecture, as discussed in the next uration, making these designs one dimen- section. sional. In this manner, each configuration occupies some number of complete rows. Although multiple narrow-width compu- 3.6. One-Dimensional Structures tations can fit within a single row, these Most current FPGAs are of the two- structures are optimized for word-based dimensional variety, as shown in Figure 9. computations that occupy the entire row. This allows for a great deal of flexibility, The NAPA architecture [Rupp et al. 1998] as any signal can be routed on a nearly is similar, with a full column of cells act- arbitrary path. However, providing this ing as the atomic unit for a configura- level of routing flexibility requires a great tion, as is PipeRench [Cadambi et al. 1998; deal of routing area. It also complicates Goldstein et al. 2000]. the placement and routing software, as the In some systems, the computation software must consider a very large num- blocks in a one-dimensional structure op- ber of possibilities. erate on word-width values instead of One solution is to use a more one- single bits. Therefore, busses are routed dimensional style of architecture, also de- instead of individual values. This also picted in Figure 9. Here, placement is decreases the time required for routing, restricted along one axis. With a more as the bits of a bus can be considered limited set of choices, the placement can together rather than as separate routes. be performed much more quickly. Routing As shown previously in Figure 7, RaPiD is also simplified, because it is generally [Ebeling et al. 1996] is basically a one- along a single dimension as well, with the dimensional design that only includes other dimension generally only used for word-width processing elements. The dif- calculations requiring a shift operation. ferent computation units are organized in One drawback of the one-dimensional a single dimension along the horizontal

ACM Computing Surveys, Vol. 34, No. 2, June 2002. Reconfigurable Computing 185

Fig. 10. Mesh (left) and partial crossbar (right) interconnect topologies for multi-FPGA systems. axis. The general flow of information fol- level column and row busses is the P1 lows this layout, with the major routing system developed within the PAM project busses also laid out in a horizontal man- [Vuillemin et al. 1996]. This architecture ner. Additionally, all routing is of word- uses a central array of 16 commercial sized values, and therefore all routing is FPGAs with connections to nearest- of busses, not individual wires. A few ver- neighbors. However,four 16-bit row busses tical resources are included in the archi- and four 16-bit column busses run the tecture to allow signals to transfer be- length of the array and facilitate commu- tween busses, or to travel from a bus to nication between non-neighbor FPGAs. a computation node. However, the major- A crossbar attempts to remove this prob- ity of the routing in this architecture is lem by using special routing-only chips one-dimensional. to connect each FPGA potentially to any other FPGA. The inter-chip delays are more uniform, given that a signal trav- 3.7. Multi-FPGA Systems els the exact same “distance” to get from Reconfigurable systems that are composed one FPGA to another, regardless of where of multiple FPGA chips interconnected those FPGAs are located. However, a on a single processing board have addi- crossbar interconnect does not scale eas- tional hardware concerns over single-chip ily with an increase in the number of systems. In particular, there is a need for FPGAs. The crossbar pattern of the chips an efficient connection scheme between is fixed at fabrication of the multi-FPGA the chips, as well as to external memory board. Variants on these two basic topolo- and the system bus. This is to provide for gies attempt to remove some of the prob- circuits that are too large to fit within a lems encountered in mesh and crossbar single FPGA, but may be partitioned over topologies [Arnold et al. 1992; Varghese the multiple FPGAs available. A number et al. 1993; Buell et al. 1996; Vuillemin of different interconnection schemes have et al. 1996; Lewis et al. 1997; Khalid and been explored [Butts and Batcheller 1991; Rose 1998]. One of these variants can be Hauck et al. 1998a; Hauck 1998; Khalid found in the Splash 2 system [Arnold et al. 1999] including meshes and crossbars, as 1992; Buell et al. 1996]. The predecessor, shown in Figure 10. A mesh connects the Splash 1, used a linear systolic commu- nearest-neighbors in the array of FPGA nication method. This type of connection chips. This allows for efficient communi- was found to work quite well for a vari- cation between the neighbors, but may ety of applications. However, this highly require that some signals pass through constrained communication model made an FPGA simply to create a connection some types of computations difficult or between non-neighbors. Although this can even impossible. Therefore, Splash 2 was be done, and is quite possible, it uses valu- designed to include not only the linear con- able I/O resources on the FPGA that forms nections of Splash 1 that were found to the routing bridge. One system that uses be useful for many applications, but also a mesh topology with additional board- a crossbar network to allow any FPGA

ACM Computing Surveys, Vol. 34, No. 2, June 2002. 186 K. Compton and S. Hauck to communicate with any other FPGA on the same board. For multi-FPGA systems, because of the need for efficient commu- nication between the FPGAs, determin- ing the inter-chip routing topology is a very important step in the design process. More details on multi-FPGA system archi- tectures can be found elsewhere [Hauck 1998b; Khalid 1999].

3.8. Hardware Summary

The design of reconfigurable hardware Fig. 11. Three possible design flows for algorithm varies wildly from system to system. The implementation on a reconfigurable system. Grey reconfigurable logic may be used as a stages indicate manual effort on the part of the de- configurable functional unit, or may be signer, while white stages are done automatically. The dotted lines represent paths to improve the re- a multi-FPGA stand-alone unit. Within sulting circuit. It should be noted that the middle the reconfigurable logic itself, the com- design cycle is only one of the possible compromises plexity of the core computational units, between automatic and manual design. or logic blocks, vary from very simple to extremely complex, some implementing a 4-bit ALU or even a 16 × 16 multi- reconfigurable system employed, as well plication. These blocks are not required as a significant amount of design time. On to be uniform throughout the array, as the other end of the spectrum, an auto- the use of different types of blocks can matic compilation system provides a quick add high-performance functionality in the and easy way to program for reconfig- case of specialized computation circuitry, urable systems. It therefore makes the use or expanded storage in the case of em- of reconfigurable hardware more accessi- bedded memory blocks. Routing resources ble to general application programmers, also offer a variety of choices, primarily in but quality may suffer. amount, length, and organization of the Both for manual and automatic cir- wires. Systems have been developed that cuit creation, the design process proceeds fit into many different points within this through a number of distinct phases, as design space, and no true “best” system indicated in Figure 11. Circuit specifica- has yet been agreed upon. tion is the process of describing the func- tions that are to be placed on the recon- figurable hardware. This can be done as 4. SOFTWARE simply as by writing a program in C that Although reconfigurable hardware has represents the functionality of the algo- been shown to have significant perfor- rithm to be implemented in hardware. On mance benefits for some applications, it the other hand, this can also be as complex may be ignored by application program- as specifying the inputs, outputs, and op- mers unless they are able to easily in- eration of each basic building block in the corporate its use into their systems. This reconfigurable system. Between these two requires a software design environment methods is the specification of the circuit that aids in the creation of configurations using generic complex components, such for the reconfigurable hardware. This soft- as adders and multipliers, which will be ware can range from a software assist mapped to the actual hardware later in in manual circuit creation to a complete the design process. For descriptions in a automated circuit design system. Manual high-level language (HLL), such as C/C++ circuit description is a powerful method or Java, or ones using complex building for the creation of high-quality circuit de- blocks, this code must be compiled into signs. However, it requires a great deal of a netlist of gate-level components. For background knowledge of the particular the HLL implementations, this involves

ACM Computing Surveys, Vol. 34, No. 2, June 2002. Reconfigurable Computing 187

ping stage may also consider using these memories as logic units when they are not being used for data storage. The memories act as very large LUTs, where the number of inputs is equal to the number of address lines. In order to use these memories as Fig. 12. A wide function implemented with multiple logic, the mapping software must analyze LUTs. how much of the memory blocks are actu- ally used as storage in a given mapping. It generating computational components to must then determine which are available perform the arithmetic and logic opera- in order to implement logic, and what part tions within the program, and separate or parts of the circuit are best mapped to structures to handle the program control, the memory [Cong and Xu 1998; Wilton such as loop iterations and branching op- 1998]. erations. Given a structural description, After the circuit has been mapped, the either generated from a HLL or specified resulting blocks must be placed onto the by the user, each complex structure is re- reconfigurable hardware. Each of these placed with a network of the basic gates blocks is assigned to a specific location that perform that function. within the hardware, hopefully close to Once a detailed gate- or element-level the other logic blocks with which it com- description of the circuit has been created, municates. As FPGA capacities increase, these structures must be translated to the the placement phase of circuit mapping actual logic elements of the reconfigurable becomes more and more time consuming. hardware. This stage is known as tech- Floorplanning is a technique that can nology mapping, and is dependent upon be used to alleviate some of this cost. the exact target architecture. For a LUT- A floorplanning algorithm first partitions based architecture, this stage partitions the logic cells into clusters, where cells the circuit into a number of small subfunc- with a large amount of communication tions, each of which can be mapped to a are grouped together. These clusters are single LUT [Brown et al. 1992a; Abouzeid then placed as units onto regions of the et al. 1993; Sangiovanni-Vincentelli et al. reconfigurable hardware. Once this global 1993; Hwang et al. 1994; Chang et al. placement is complete, the actual place- 1996; Hauck and Agarwal 1996; Yi and ment algorithm performs detailed place- Jhon 1996; Chowdhary and Hayes 1997; ment of the individual logic blocks within Lin et al. 1997; Cong and Wu 1998; Pan the boundaries assigned to the cluster and Lin 1998; Togawa et al. 1998; Cong [Sankar and Rose 1999]. et al. 1999]. Some architectures, such as The use of a floorplanning tool is par- the Xilinx 4000 series [Xilinx 1994], con- ticularly helpful for situations where the tain multiple LUTs per logic cell. These circuit structure being mapped is of a dat- LUTs can be used either separately to gen- apath type. Large computational compo- erate small functions, or together to gen- nents or macros that are found in datapath erate some wider-input functions [Inuani circuits are frequently composed of highly and Saul 1997; Cong and Hwang 1998]. regular logic. These structures are placed By taking advantage of multiple LUTs and as entire units, and their component cells the internal routing within a single logic are restricted to the floorplanned location cell, functions with more inputs than can [Shi and Bhatia 1997; Emmert and Bhatia be implemented using a single LUT can 1999]. This encourages the placer to find a efficiently be mapped into the FPGA ar- very regular placement of these logic cells, chitecture. Figure 12 shows one example resulting in a higher performance layout of a wide function mapped to a multi-LUT of the circuit. Another technique for the FPGA logic cell. mapping and placement of datapath ele- For reconfigurable structures that in- ments is to perform both of these steps clude embedded memory blocks, the map- simultaneously [Callahan et al. 1998].

ACM Computing Surveys, Vol. 34, No. 2, June 2002. 188 K. Compton and S. Hauck

This method also exploits the regular- many connected components to be placed ity of the datapath elements to gener- far from one another, as the signals that ate mappings and placements quickly and travel long distances use more routing efficiently. resources than those that travel shorter Floorplanning is also important when ones. A good placement is therefore es- dealing with hierarchically structured re- sential to the routing process. One of configurable designs. In these architec- the challenges in routing for FPGAs and tures, the available resources have been reconfigurable systems is that the avail- grouped by the logic or routing hierarchy able routing resources are limited. In gen- of the hardware. Because performance is eral hardware design, the goal is to min- best when routing lengths are minimized, imize the number of routing tracks used the cells to be placed should be grouped in a channel between rows of computation such that cells that require a great deal units, but the channels can be made as of communication or which are on a criti- wide as necessary. In reconfigurable sys- cal path are placed together within a logic tems, however, the number of available cluster on the hardware [Krupnova et al. routing tracks is determined at fabrication 1997; Senouci et al. 1998]. time, and therefore the routing software After floorplanning, the individual logic must perform within these boundaries. blocks are placed into specific logic cells. Thus, FPGA routing concentrates on min- One algorithm that is commonly used imizing congestion within the available is the simulated annealing technique tracks [Brown et al. 1992b; McMurchie [Shahookar and Mazumder 1991; Betz and Ebeling 1995; Alexander and Robins and Rose 1997; Sankar and Rose 1999]. 1996; Chan and Schlag 1997; Lee and Wu This method takes an initial placement 1997; Thakur et al. 1997; Wu and Marek- of the system, which can be generated Sadowska 1997; Swartz et al. 1998; Nam (pseudo-) randomly, and performs a series et al. 1999]. Because routing is one of of “moves” on that layout. A move is sim- the more time-intensive portions of the ply the changing of the location of a sin- design cycle, it can be helpful to deter- gle logic cell, or the exchanging of loca- mine if a placed circuit can be routed tions of two logic cells. These moves are before actually performing the routing attempted one at a time using random step. This quickly informs the designer target locations. If a move improves the if changes need to be made to the layout layout, then the layout is changed to re- or a larger reconfigurable structure is re- flect that move. If a move is considered to quired [Wood and Rutenbar 1997; Swartz be undesirable, then it is only accepted a et al. 1998]. small percentage of the time. Accepting a Each of the design phases mentioned few “bad” moves helps to avoid any local above may be implemented either manu- minima in the placement space. Other al- ally or automatically using compiler tools. gorithms exist that are not so based on The operation of some of these individual random movements [Gehring and Ludwig steps are described in greater depth in the 1996], although this searches a smaller following sections. area of the placement space for a solution, and therefore may be unable to find a so- 4.1. Hardware-Software Partitioning lution which meets performance require- ments if a design uses a high percentage For systems that include both reconfig- of the reconfigurable resources. urable hardware and a traditional micro- Finally, the different reconfigurable processor, the program must first be par- components comprising the application titioned into sections to be executed on circuit are connected during the routing the reconfigurable hardware and sections stage. Particular signals are assigned to to be executed in software on the micro- specific portions of the routing resources processor. In general, complex control se- of the reconfigurable hardware. This can quences such as variable-length loops are become difficult if the placement causes more efficiently implemented in software,

ACM Computing Surveys, Vol. 34, No. 2, June 2002. Reconfigurable Computing 189 while fixed datapath operations may be celeration gained through the execution more efficiently executed in hardware. of a code fragment in hardware to de- Most compilers presented for reconfig- termine whether the cost of configuration urable systems generate only the hard- is overcome by the benefits of hardware ware configuration for the system, rather execution. than both hardware and software. In some cases, this is because the reconfigurable 4.2. Circuit Specification hardware may not be coupled with a host processor, so only a hardware configura- In order to use the reconfigurable hard- tion is necessary. For cases where recon- ware, designers must somehow be able to figurable hardware does operate alongside specify the operation of their custom cir- a host microprocessor, some systems cur- cuits. Before high-level compilation tools rently require that the hardware compila- are developed for a specific reconfigurable tion be performed separately from the soft- system, this is done through hand map- ware compilation, and special functions ping of the circuit, where the designer are called from within the software in specifies the operation of the components order to configure and control the reconfig- in the configurable system directly. Here, urable hardware. However, this requires the designers utilize the basic building effort on the part of the designer to iden- blocks of the reconfigurable system to cre- tify the sections that should be mapped ate the desired circuit. This style of cir- to hardware, and to translate these into cuit specification is primarily useful only special hardware functions. In order to when a software front-end for circuit de- make the use of the reconfigurable hard- sign is unavailable, or for the design of ware transparent to the designer, the par- small circuits or circuits with very high titioning and programming of the hard- performance requirements. This is due ware should occur simultaneously in a to the great amount of time involved in single programming environment. manual circuit creation. However, for cir- For compilers that manage both the cuits that can be reasonably hand mapped, hardware and software aspects of applica- this provides potentially the smallest and tion design, the hardware/software parti- fastest implementation. tioning can be performed either manually, Because not all designers can be inti- or automatically by the compiler itself. mately familiar with every reconfigurable When the partitioning is performed by architecture, some design tools abstract the programmer, compiler directives are the specifics of the target architecture. used to mark sections of program code for Creating a circuit using a structural de- hardware compilation. The NAPA C lan- sign language involves describing a cir- guage [Gokhale and Stone 1998] provides cuit using building blocks such as gates, pragma statements to allow a program- flip-flops and latches [Bellows and Hutch- mer to specify whether a section of code is ings 1998; Gehring and Ludwig 1998; to be executed in software on the Fixed In- Hutchings et al. 1999]. The compiler then struction Processor (FIP), or in hardware maps these modules to one or more ba- on the Adaptive Logic Processor (ALP). sic components of the architecture of the Cardoso and Neto [1999] present another reconfigurable system. Structural VHDL compiler that requires the user to specify is one example of this type of program- (using information gained through the use ming, and commercial tools are avail- of profiling tools) which areas of code to able for compiling from this language map to the reconfigurable hardware. into vendor-specific FPGAs [Synplicity Alternately, the hardware/software par- 1999]. titioning can be done automatically However, these two methods require [Chichkov and Almeida 1997; Kress et al. that the designer possess either an in- 1997; Callahan et al. 2000; Li et al. 2000a]. timate knowledge of the targeted recon- In this case, the compiler will use cost figurable hardware, or at least a work- functions based upon the amount of ac- ing knowledge of the concepts involved

ACM Computing Surveys, Vol. 34, No. 2, June 2002. 190 K. Compton and S. Hauck in hardware design. In order to allow suffer from the drawback that it tends to a greater number of software developers produce larger and slower designs than to take advantage of reconfigurable com- those generated by a structural descrip- puting, tools that allow for behavioral tion or hand-mapping. Behavioral descrip- circuit descriptions are being developed. tions can leave many aspects of the cir- These systems trade some area and per- cuit unspecified. For example, a compiler formance quality for greater flexibility and that encounters a while loop must gener- ease of use. ate complicated control structures in or- Behavioral circuit design is similar to der to allow for an unspecified number software design because the designer in- of iterations. Also, in many HLL imple- dicates the steps a hardware subsys- mentations, optimizations based upon the tem must go through in order to per- bit width of operands cannot be performed. form the desired computation rather than The compiler is generally unaware of the actual composition of the circuit. any application-specific limitations on the These behavioral descriptions can be ei- operand size; it only sees the program- ther in a generic hardware description mer’s choice of data format in the program. language such as VHDL or , or a Problems such as these might be solved general-purpose high-level language such through additional programmer effort to as C/C++ or Java. The eventual goal of replace while loops whenever possible this type of compilation is to allow users with for loops, and to use compiler direc- to write programs in commonly used lan- tives to indicate exact sizes of operands guages that compile equally well, with- [Galloway 1995; Gokhale and Stone 1998]. out modification, to both a traditional This method of hardware design falls be- software executable and to an executable tween structural description and behav- which leverages reconfigurable hardware. ioral description in complexity, because Working towards this direction, although the programmers do not need Transmogrifier C [Galloway 1995] al- to know a great deal about hardware de- lows a subset of the C language to be sign, they are required to follow addi- used to describe hardware circuits. While tional guidelines that are not required for multiplication, division, pointers, arrays, software-only implementations. and a few other C language specifics are not supported, this system provides a behavioral method of circuit description 4.3. Circuit Libraries using a primitive form of the C language. Similarly, the C++ programming environ- The use of circuit or macro libraries ment used for the P1 system [Vuillemin can greatly simplify and speed the de- et al. 1996] provides a hybrid method of sign process. By predesigning commonly description, using a combination of be- used structures such as adders, mul- havioral and structural design. ’ tipliers, and counters, circuit creation CoCentric compiler [Synopsys 2000], for configurable systems becomes largely which can be targeted to the Xilinx Virtex the assembly of high-level components, series of FPGA, uses SystemC to provide and only application-specific structures for behavioral compilation of C/C++ require detailed design. The actual ar- with the assistance of a set of additional chitecture of the reconfigurable device hardware-defining classes. Other compil- can be abstracted, provided only library ers, such as Nimble [Li et al. 2000a] and components are used, as these low-level the Garp compiler [Callahan et al. 2000], details will already have been encapsu- are fully behavioral C compilers, handling lated within the library structures. Al- the full set of the ANSI C language. though the users of the circuit library Although behavioral description, and may not know the intricacies of the des- HLL description in particular, provides tination architecture, they are still able a convenient method for the program- to make use of architecture-specific op- ming of reconfigurable systems, it does timizations, such as specialized carry

ACM Computing Surveys, Vol. 34, No. 2, June 2002. Reconfigurable Computing 191 chains. This is because designers very ever, circuit generators create semicus- familiar with the details of the target ar- tomized high-level structures automati- chitecture create the components within a cally at compile time, as opposed to circuit circuit library. They can take advantage libraries that only provide static struc- of architecture specifics when creating the tures. For example, a circuit generator can modules to make these components faster create an adder structure of the exact bit and smaller than a designer unfamiliar width required by the designer, whereas a with the architecture likely would. An circuit library is likely to contain a limited added benefit of the architecture abstrac- number of adder structures, none of which tion is that the use of library components may be of the correct size. Circuit gener- can also facilitate design migration from ators are therefore more flexible than cir- one architecture to another, because de- cuit libraries because of the customization signers are not required to learn a new allowed. architecture, but only to indicate the new Some circuit generators, such as target for the library components. How- MacGen [Yasar et al. 1996], are executed ever, this does require that a circuit li- at the command line using custom de- brary contain implementations for more scription files to generate physical design than one architecture. layout data files. Newer circuit genera- One method for using library com- tors, however, are functions or methods ponents is to simply instantiate them called from high-level language programs. within an HDL design [Xilinx 1997; Altera PAM-Blox [Mencer et al. 1998], for exam- 1999]. However, circuit libraries can also ple, is a set of circuit generators executed be used in general language compil- in C++ that generate structures for use ers by comparing the dataflow graph of with the PCI Pamette reconfigurable the application to the dataflow graphs processing board. The circuit generator of the library macros [Cadambi and presented by Chu et al. [1998] contains Goldstein 1999]. If a dataflow represen- a number of Java classes to allow a tation of a macro matches a portion of programmer to generate arbitrarily sized the application graph, the correspond- arithmetic and logical components for a ing macro is used for that part of the circuit. Although the examples presented configuration. in that paper were mapped to a Xilinx Another benefit of circuit design with 4000 series FPGA, the generator uses library macros is that of fast compila- architecture specific libraries for module tion. Because the library structures may generation. The target architecture can have been premapped, preplaced, and pre- therefore be changed through the use routed (at least within the macro bound- of a different design library. The Carry aries), the actual compile time is reduced Look-Ahead circuit generator described to the time required to place the library by Stohmann and Barke [1996] is also components and route between them. For retargetable, because it maps to an example, fast configuration was one of FPGA logic cell architecture defined by the main motivations for the creation of the user. libraries for circuit design in the DISC One drawback of the circuit generators reconfigurable image processing system is that they depend on a regular logic [Hutchings 1997]. and routing structure. Hierarchical rout- ing structures (such as those present in the Xilinx 6200 series [Xilinx 1996]) and 4.4. Circuit Generators specialized heterogeneous logic blocks are Circuit generators fulfill a role similar to frequently not accounted for. Therefore, circuit libraries, in that they provide opti- some optimized features of a particular ar- mized high-level structures for use within chitecture may be unused. For these cases, larger applications. Again, designers are a circuit macro from a library may pro- not required to understand the low-level vide a more highly optimized structure details of particular architectures. How- than one created with a circuit generator,

ACM Computing Surveys, Vol. 34, No. 2, June 2002. 192 K. Compton and S. Hauck provided that the library macro fits the puting to allocate memories to hold vari- needs of the application. ables and other data. Off-chip memories may be added to the reconfigurable sys- tem. Alternately, if a reconfigurable sys- 4.5. Partial Evaluation tem includes memory blocks embedded Functions that are to be implemented on into the reconfigurable logic, these may be the reconfigurable array should occupy used, provided that the storage require- as little area as possible, so as to maxi- ments do not surpass the available embed- mize the number of functions that can be ded memory. If multiple off-chip memories mapped to the hardware. This, combined are available to a reconfigurable system, with the minimization of the delay in- variables used in parallel should be placed curred by each circuit, increases the over- into different memory structures, such all acceleration of the application. Partial that they can be accessed simultaneously evaluation is the process of reducing hard- [Gokhale and Stone 1999]. When smaller ware requirements for a circuit structure embedded memory units are used, larger through optimization based upon known memories can be created from the smaller static inputs. Specifically, if an input is ones. However, in this case, it is desir- known to be constant, that value can po- able to ensure that each smaller mem- tentially be propagated through one or ory is close to the computation that most more gates in the structure at compile requires its contents [Babb et al. 1999]. time, and only the portions of a circuit that As mentioned earlier, the small embed- depend on time-varying inputs need to be ded memories that are not allocated for mapped to the reconfigurable structure. data storage may be used to perform logic One example of the usefulness of this functions. operation is that of constant coefficient multipliers. If one input to a multiplier 4.7. Parallelization is constant, a multiplier object can be re- duced from a general-purpose multiplier One of the benefits of reconfigurable com- to a set of additions with static-length puting is the ability to execute multi- shifts between them corresponding to the ple operations in parallel. In cases where locations of 1s in the binary constant. circuits are specified using a structural This type of reduction leads to a lower hardware description language, the user area requirement for the circuit, and po- specifies all structures and timing, and tentially higher performance due to fewer therefore either implicitly or explicitly gate delays encountered on the critical specifies any parallel operation. However, path. Partial evaluation can also be per- for behavioral and HLL descriptions, there formed in conjunction with circuit gener- are two methods to incorporate paral- ation, where the constants passed to the lelism: manual parallelization through generator function are used to simplify special instructions or compiler direc- the created hardware circuit [Wang and tives, and automatic parallelization by the Lewis 1997; Chu et al. 1998]. Other exam- compiler. ples of this type of optimization for specific To manually incorporate parallelism algorithms include the partial evaluation within an application, the programmer of DES encryption circuits [Leonard and can specifically mark sections of code Mangione-Smith 1997], and the partial that should run as parallel threads, and evaluation of constant multipliers and use similar operations to those used in fixed polynomial division circuits [Payne traditional parallel compilers [Cronquist 1997]. et al. 1998; Gokhale and Stone 1998]. For example, a signal/wait technique can be used to perform synchronization of 4.6. Memory Allocation the different threads of the computation. As with traditional software programs, it The RaPiD-B language [Cronquist et al. may be necessary in reconfigurable com- 1998] is one that uses this methodology.

ACM Computing Surveys, Vol. 34, No. 2, June 2002. Reconfigurable Computing 193

Although the NAPA C compiler [Gokhale required between the FPGAs, the num- and Stone 1998] requires programmers ber of paths with a high (inter-chip) de- to mark the areas of code for executing lay is reduced, and the circuit may have the host processor and the reconfigurable an overall higher performance. Similarly, hardware in parallel, it also detects and those sections of the circuit that require a exploits fine-grained parallelism within short delay time must be placed upon the computations destined for the reconfig- same chip. Global placement then deter- urable hardware. mines which of the actual FPGAs in the Automatic parallelization of inner loops multi-FPGA system will contain each of is another common technique in recon- the partitions. figurable hardware compilers to attempt After the circuit has been partitioned to maximize the use of the reconfig- into the different FPGA chips, the con- urable hardware. The compiler will se- nections between the chips must be lect the innermost loop level to be com- routed [Mak and Wong 1997; Ejnioui and pletely unrolled for parallel execution in Ranganathan 1999]. A global routing al- hardware, potentially creating a heav- gorithm determines at a high level the ily pipelined structure [Cronquist et al. connections between the FPGA chips. It 1998; Weinhardt and Luk 1999]. For these first selects a region of output pins on the cases, outer loops may not have multi- source FPGA for a given signal, and de- ple iterations executing simultaneously. termines which (if any) routing switches Any loop reordering to improve the par- or additional FPGAs the signal must allelism of the circuit must be done by the pass through to get to the destination programmer. However, some compiler sys- FPGA. Detailed routing and pin assign- tems have taken this procedure a step fur- ment [Slimane-Kade et al. 1994; Hauck ther and focus on the parallelization of all and Borriello 1997; Mak and Wong 1997; loops within the program, not just the in- Ejnioui and Ranganathan 1999] are then ner loops [Wang and Lewis 1997; Budiu used to assign signals to traces on an exist- and Goldstein 1999]. This type of compiler ing multi-FPGA board, or to create traces generates a control flow graph based upon for a multi-FPGA board that is to be cre- the entire program source code. Loop un- ated specifically to implement the given rolling is used in order to increase the circuit. available parallelism, and the graph is Because multi-FPGA systems use inter- then used to schedule parallel operations chip connections to allow the circuit parti- in the hardware. tions to communicate, they frequently re- quire a higher proportion of I/O resources vs. logic in each chip than is normally re- 4.8. Multi-FPGA System Software quired in single-FPGA use. For this rea- When reconfigurable systems use more son, some research has focused on meth- than one FPGA to form the complete ods to allow pins of the FPGAs to be reused reconfigurable hardware, there are ad- for multiple signals. This procedure is re- ditional compilation issues to deal with ferred to as Virtual Wires [Babb et al. [Hauck and Agarwal 1996]. The design 1993; Agarwal 1995; Selvidge et al. 1995], must first be partitioned into the differ- and allows for a flexible trade-off between ent FPGA chips [Hauck 1995; Acock and logic and I/O within a given multi-FPGA Dimond 1997; Vahid 1997; Brasen and system. Signals are multiplexed onto a Saucier 1998; Khalid 1999]. This is gen- single wire by using multiple virtual clock erally done by placing each highly con- cycles, one per multiplexed signal, within nected portions of a circuit into a single a user clock cycle, thus pipelining the com- chip. Multi-FPGA systems have a limited munication. In this manner, the I/O re- number of I/O pins that connect the chips quirements of a circuit can be reduced, together, and therefore their use must be while the logic requirements (because of minimized in the overall circuit mapping. the added circuitry used for the multiplex- Also, by minimizing the amount of routing ing) are increased.

ACM Computing Surveys, Vol. 34, No. 2, June 2002. 194 K. Compton and S. Hauck

4.9. Design Testing architecture-specific optimizations avail- able to generate a high-performance ap- After compilation, an application needs plication. However, this requires a great to be tested for correct operation be- deal of time and effort on the part of the de- fore deployment. For hardware configu- signer. At the opposite end of the spectrum rations that have been generated from is fully automatic compilation of a high- behavioral descriptions, this is similar level language. Using the automatic tools, to the debugging of a software applica- a software programmer can transparently tion. However, structurally and manu- utilize the reconfigurable hardware with- ally created circuits must be simulated out the need for direct intervention. The and debugged with techniques based upon circuits created using this method, while those from the design of general hard- quickly and easily created, are generally ware circuits. For these structures, simu- larger and slower than manually created lation and debugging are critical not only versions. The actual tools available for to ensure proper circuit operation, but compilation onto reconfigurable systems also to prevent possible incorrect connec- fall at various points within this range, tions from causing a short within the cir- where many are partially automated but cuit, which can damage the reconfigurable require some amount of manual aid. Cir- hardware. cuit designers for reconfigurable systems There are several different methods of therefore face a trade-off between the ease observing the behavior of a configuration of design and the quality of the final during simulation. The contents of mem- layout. ory structures within the design can be viewed, modified, or saved. This allows on- the-fly customization of the simulated ex- 5. RUN-TIME RECONFIGURATION ecution environment of the reconfigurable hardware, as well as a method for exam- Frequently, the areas of a program that ining the computation results. The input can be accelerated through the use of and output values of circuit structures and reconfigurable hardware are too numer- substructures can also be viewed either on ous or complex to be loaded simultane- a generated schematic drawing or with a ously onto the available hardware. For traditional waveform output. By examin- these cases, it is beneficial to be able ing these values, the operation of the cir- to swap different configurations in and cuit can be verified for correctness, and out of the reconfigurable hardware as conflicts on individual wires can be seen. they are needed during program execution A number of simulation and debugging (Figure 13). This concept is known as run- software systems have been developed time reconfiguration (RTR). that use some or all of these techniques Run-time reconfiguration is based upon [Arnold et al. 1992; Buell et al. 1996; the concept of virtual hardware, which is Gehring and Ludwig 1996; Lysaght and similar to virtual memory. Here, the phys- Stockwood 1996; Bellows and Hutchings ical hardware is much smaller than the 1998; Hutchings et al. 1999; McKay and sum of the resources required by each Singh 1999; Vasilko and Cabanis 1999]. of the configurations. Therefore, instead of reducing the number of configurations that are mapped, we instead swap them in and out of the actual hardware as they 4.10. Software Summary are needed. Because run-time reconfigu- Reconfigurable hardware systems require ration allows more sections of an appli- software compilation tools to allow pro- cation to be mapped into hardware than grammers to harness the benefits of can be fit in a non-run-time reconfig- reconfigurable computing. On one end urable system, a greater portion of the of the spectrum, circuits for reconfig- program can be accelerated. This provides urable systems can be designed manu- potential for an overall improvement in ally,leveraging all application-specific and performance.

ACM Computing Surveys, Vol. 34, No. 2, June 2002. Reconfigurable Computing 195

Fig. 13. Applications which are too large to entirely fit on the reconfigurable hardware can be partitioned into two or more smaller configurations that can occupy the hardware at different times.

During a single program’s execution, single context architecture is that it al- configurations are swapped in and out lows for an extremely fast context switch of the reconfigurable hardware. Some of (on the order of nanoseconds), whereas the these configurations will likely require ac- single context may take milliseconds or cess to the results of other configurations. more to reprogram. The partially reconfig- Configurations that are active at differ- urable architecture is also more suited to ent periods in time therefore must be pro- run-time reconfiguration than the single vided with a method to communicate with context, because small areas of the array one another. Primarily, this can be done can be modified without requiring that the through the use of registers [Ebeling et al. entire logic array be reprogrammed. 1996; Cadambi et al. 1998; Rupp et al. For all of these run-time reconfigurable 1998; Scalera and Vazquez 1998], the con- architectures, there are also a number of tents of which can remain intact between compilation issues that are not encoun- reconfigurations. This allows one configu- tered in systems that only configure at ration to store a value, and a later config- the beginning of an application. For ex- uration to read back that value for use in ample, run-time reconfigurable systems further computations. An alternative for are able to optimize based on values that reconfigurable systems that do not include are known only at run-time. Furthermore, state-holding devices is to write the result compilers must consider the run-time re- back to registers or memory external to the configurability when generating the dif- reconfigurable array, which is then read ferent circuit mappings, not only to be back by successive configurations [Hauck aware of the increase in time-multiplexed et al. 1997]. capacity, but also to schedule reconfigura- There are a few different configuration tions so as to minimize the overhead that memory styles that can be used with re- they incur. These software issues, as well configurable systems. A single context de- as an overview of methods to perform fast vice is a serially programmed chip that configuration, will be explored in the sec- requires a complete reconfiguration in or- tions that follow. der to change any of the programming bits. A multicontext device has multiple layers 5.1. Reconfigurable Models of programming bits, each of which can be active at a different point in time. De- Traditional FPGA structures have been vices that can be selectively programmed single context, only allowing one full-chip without a complete reconfiguration are configuration to be loaded at a time. How- called partially reconfigurable. These dif- ever, designers of reconfigurable systems ferent types of configuration memory are have found this style of configuration described in more detail later. An advan- to be too limiting or slow to efficiently tage of the multicontext FPGA over a implement run-time reconfiguration. The

ACM Computing Surveys, Vol. 34, No. 2, June 2002. 196 K. Compton and S. Hauck

Fig. 14. The different basic models of reconfigurable computing: single context, multicon- text, and partially reconfigurable. Each of these designs is shown performing a reconfigu- ration. following discussion defines the single con- total reconfiguration delay. If all the con- text device, and further considers newer figurations used within a certain time pe- FPGA designs (multicontext and partially riod are present in the same context, no reconfigurable), along with their impact reconfiguration will be necessary. How- on run-time reconfiguration. ever, if a number of successive configura- tions are each partitioned into different 5.1.1. Single Context. Current single contexts, several reconfigurations will be context FPGAs are programmed using needed, slowing the operation of the run- a serial stream of configuration infor- time reconfigurable system. mation. Because only sequential access is supported, any change to a configu- 5.1.2. Multicontext. A multicontext FPGA ration on this type of FPGA requires a includes multiple memory bits for each complete reprogramming of the entire programming bit location [DeHon 1996; chip. Although this does simplify the Trimberger et al. 1997a; Scalera and reconfiguration hardware, it does incur Vazquez 1998; Chameleon 2000]. These a high overhead when only a small part memory bits can be thought of as mul- of the configuration memory needs to be tiple planes of configuration information, changed. Many commercial FPGAs are of as shown in Figure 14. One plane of con- this style, including the Xilinx 4000 se- figuration information can be active at a ries [Xilinx 1994], the Altera Flex10K given moment, but the device can quickly series [Altera 1998], and Lucent’s Orca switch between different planes, or con- series [Lucent 1998]. This type of FPGA texts, of already-programmed configura- is therefore more suited for applications tions. In this manner, the multicontext de- that can benefit from reconfigurable com- vice can be considered a multiplexed set of puting without run-time reconfiguration. single context devices, which requires that A single context FPGA is depicted in a context be fully reprogrammed to per- Figure 14. form any modification. This system does In order to implement run-time recon- allow for the background loading of a con- figuration onto a single context FPGA, the text, where one plane is active and in ex- configurations must be grouped into con- ecution while an inactive place is in the texts, and each full context is swapped in process of being programmed. Figure 15 and out of the FPGA as needed. Because shows a multicontext memory bit, as used each of these swap operations involve re- in [Trimberger et al. 1997a]. A commer- configuring the entire FPGA, a good parti- cial product that uses this technique is the tioning of the configurations between con- CS2000 RCP series from Chameleon, Inc texts is essential in order to minimize the [Chameleon 2000]. This device provides

ACM Computing Surveys, Vol. 34, No. 2, June 2002. Reconfigurable Computing 197

portions of the array may continue execu- tion, allowing the overlap of computation with reconfiguration. This has the benefit of potentially hiding some of the reconfig- uration latency. When configurations do not require the entire area available within the array, a Fig. 15. A four-bit multicontexted programming bit [Trimberger et al. 1997a]. P0-P3 are the stored number of different configurations may programming bits, while C0-C3 are the chip-wide be loaded into unused areas of the hard- control lines that select the context to program or ware at different times. Since only part activate. of the array is reconfigured at a given point in time, the entire array does not re- two separate planes of programming in- quire reprogramming. Additionally, some formation. At any given time, one of these applications require the updating of only planes is controlling current execution on a portion of a mapped circuit, while the the reconfigurable fabric, and the other rest should remain intact, as shown in plane is available for background loading Figure 14. For example, in a filtering op- of the next needed configuration. eration in signal processing, a set of con- Fast switching between contexts makes stant values that change slowly over time the grouping of the configurations into may be reinitialized to a new value, yet the contexts slightly less critical, because if overall computation in the circuit remains a configuration is on a different context static. Using this selective reconfiguration than the one that is currently active, it can can greatly reduce the amount of configu- be activated within an order of nanosec- ration data that must be transferred to the onds, as opposed to milliseconds or longer. FPGA. Several run-time reconfigurable However, it is likely that the number of systems are based upon a partially re- contexts within a given program is larger configurable design, including Chimaera than the number of contexts available in [Hauck et al. 1997], PipeRench [Cadambi the hardware. In this case, the partition- et al. 1998; Goldstein et al. 2000], NAPA ing again becomes important to ensure [Rupp et al. 1998], and the Xilinx 6200 and that configurations occurring in close tem- Virtex FPGAs [Xilinx 1996, 1999]. poral proximity are in a set of contexts Unfortunately, since address informa- that are loaded into the multicontext de- tion must be supplied with configura- vice at the same time. More aspects involv- tion data, the total amount of information ing temporal partitioning for single- and transferred to the reconfigurable hard- multicontext devices will be discussed in ware may be greater than what is required the section on compilers for run-time re- with a single context design. This makes configurable systems. a full reconfiguration of the entire array slower than the single context version. 5.1.3. Partially Reconfigurable. In some However, a partially reconfigurable design cases, configurations do not occupy the full is intended for applications in which the reconfigurable hardware, or only a part of size of the configurations is small enough a configuration requires modification. In that more than one can fit on the available both of these situations, a partial recon- hardware simultaneously. Plus, as we dis- figuration of the array is required, rather cuss in subsequent sections, a number of than the full reconfiguration required by fast configuration methods have been ex- a single- or multicontext device. In a par- plored for partially reconfigurable systems tially reconfigurable FPGA, the underly- in order to help reduce the configuration ing programming bit layer operates like data traffic requirements. a RAM device. Using addresses to spec- ify the target location of the configuration 5.1.4. Pipeline Reconfigurable. A modifi- data allows for selective reconfiguration cation of the partially reconfigurable of the array. Frequently, the undisturbed FPGA design is one in which the partial

ACM Computing Surveys, Vol. 34, No. 2, June 2002. 198 K. Compton and S. Hauck

Fig. 16. A timeline of the configuration and reconfiguration of pipeline stages on a pipeline reconfigurable FPGA. This example shows three physical pipeline stages implementing five virtual pipeline stages [Cadambi et al. 1998]. reconfiguration occurs in increments of the first pipeline location in the hard- pipeline stages. This style of reconfig- ware (step 4), overwriting the first virtual urable hardware is called pipeline recon- pipeline stage. The reconfiguration of the figurable, or sometimes a striped FPGA hardware pipeline stages continues until [Luk et al. 1997b; Cadambi et al. 1998; the last virtual pipeline stage has been Deshpande and Somani 1999; Goldstein programmed (step 7), at which point the et al. 2000]. Each stage is configured as a first stage of the virtual pipeline is again whole. This is primarily used in datapath- configured onto the hardware for the next style computations, where more pipeline data set. These structures also allow for stages are used than can fit simultane- the overlap of configuration and execution, ously on available hardware. Figure 16 as one pipeline stage is configured while shows an example of a pipeline reconfig- the others are executing. Therefore, N-1 urable array implementing more pipeline data values are processed each time the stages than can fit on the available hard- virtual pipeline is fully traversed on an ware. In a pipeline-reconfigurable FPGA, N-stage hardware system. there are two primary execution possi- bilities. Either the number of hardware 5.2. Run-Time Partial Evaluation pipeline stages available is greater than or equal to the number of pipeline stages One of the advantages that a run-time re- of the designed circuit (virtual pipeline configurable device has over a system that stages), or the number of virtual pipeline is only programmed at the beginning of stages will exceed the number of hardware an application is the ability to perform pipeline stages. The first case is straight- hardware optimizations based upon val- forward: the circuit is simply mapped to ues determined at run-time. Partial evalu- the array, and some hardware stages may ation was already discussed in this article go unused. The second case is more com- in reference to compilation optimizations plex and is the one that requires run- for general reconfigurable systems. Run- time reconfiguration. The pipeline stages time partial evaluation allows for the fur- are configured one by one, from the start ther exploitation of “constants” because of the pipeline, through the end of the the configurations can be modified based available hardware stages (steps 1, 2, not only on completely static values, but and 3 in Figure 16). After each stage is also those that change slowly over time programmed, it begins computation. In [Burns et al. 1997; Luk et al. 1997a; Payne this manner, the configuration of a stage 1997; Wirthlin and Hutchings 1997; Chu is exactly one step ahead of the flow of et al. 1998; McKay and Singh 1999]. This data. Once the hardware pipeline has gives reconfigurable circuits the potential been completely filled, reuse of the hard- to achieve an even higher performance ware pipeline stages begins. Configura- than an ASIC, which must retain gener- tion of the next virtual stage begins at ality in these situations. The circuit in the

ACM Computing Surveys, Vol. 34, No. 2, June 2002. Reconfigurable Computing 199 reconfigurable system can be customized run-time reconfigurable system, the cir- to the application at a given time, rather cuits loaded on the hardware change over than to the application as a category. For time. If the user must specify by hand example, where an ASIC may have to the loading and execution of the circuits include a generic multiplier, a reconfig- in the reconfigurable hardware, then the urable system could instantiate a constant compilers must include methods to indi- coefficient multiplier that changes over cate these operations. JHDL [Bellows and time. Additionally, partial evaluation can Hutchings 1998; Hutchings et al. 1999] is be used in encryption systems [Leonard one such compiler. It provides for the in- and Mangione-Smith 1997]. A key-specific stantiation of configurations through the reconfigurable encrypter or decrypter is use of Java constructors, and the removal optimized for the particular key being of the circuits from the hardware by using used, but retains the ability to use more a destructor on the circuit objects. This al- than one key over the lifetime of the hard- lows the programmer to indicate exactly ware (unlike a key-specialized ASIC) or the loading pattern of the configurations. during actual run-time. Alternately, the compiler can automate Although partial evaluation can be used the use of the run-time reconfigurable to reduce the overall area requirements hardware. For a single context or multi- of a circuit by removing potentially ex- context device, configurations must be traneous hardware within the implemen- temporally partitioned into a number of tation, occasionally it is preferable to re- different full contexts of configuration serve sufficient area for the largest case, information. This involves determining and have all mappings occupy that area. which configurations are likely to be used This allows the partially evaluated por- near in time to one another, and which tion of a given configuration to be reconfig- configurations are able to fit together onto ured, while leaving the remainder of the the reconfigurable hardware. Ideally, the circuit intact. For example, if a constant number of reconfigurations that are to be coefficient multiplier within a larger con- performed is minimized. By reducing the figuration requires that the constant be number of reconfigurations, the propor- changed, only the area occupied by the tion of time spent in reconfiguration (com- multiplier requires reconfiguration. This pared to the time spent in useful compu- is true even if the new constant coefficient tation) is reduced. multiplier is a larger structure than the The problem of forming and schedul- previous one, because the reserved area ing single- and multiconfiguration con- for it is based upon the largest possibility texts for use in single context or multicon- [McKay and Singh 1999]. Although par- text FPGA designs has been discussed by tial evaluation does not minimize the area a number of groups [Chang and Marek- occupied by the circuit in this case, the Sadowska 1998; Trimberger 1998; Liu and speed of configuration is improved by mak- Wong 1999; Purna and Bhatia 1999; Li ing the multiplier a modular replaceable et al. 2000a]. In particular, a single cir- component. Additionally, this method re- cuit that is too large to fit within the re- tains the speed benefits of partial recon- configurable hardware may be partitioned figuration because it still minimizes the over time to form a sequential set of con- logic and routing actually used to imple- figurations. This involves examining the ment the structure. control flow graph of the circuit and divid- ing the circuit into distinct computation nodes. The nodes can then be grouped to- 5.3. Compilation and Configuration gether within contexts, based upon their Scheduling proximity to one another within the flow For some reconfigurable systems, a con- control graph. If possible, those config- figuration requires programming the re- urations that are used in quick succes- configurable hardware only at the start sion will be placed within the same group. of its execution. On the other hand, in a These groups are finally mapped into full

ACM Computing Surveys, Vol. 34, No. 2, June 2002. 200 K. Compton and S. Hauck contexts, to be loaded into the reconfig- There are a number of different tactics urable hardware at run-time. Nimble [Li for reducing the configuration overhead. et al. 2000a] is one of the compilers that First, loading of the configurations can be perform this type of operation. This com- timed such that the configuration over- piler focuses on mapping core loops within laps as much as possible with the execu- C code to reconfigurable hardware. Hard- tion of instructions by the host processor. ware models for the candidate loops that Second, compression techniques can be in- will fit within the reconfigurable hardware troduced to decrease the amount of config- are first extracted from the C application. uration data that must be transferred to Then these loops are grouped into indi- the system. Third, specialized hardware vidual configurations using a partitioning can be used to adjust the physical loca- method in order to encourage the hard- tion of configurations at run-time based on ware loops that are used in close temporal where the free area on the hardware is lo- proximity to be mapped to the same config- cated at any given time. Finally, the actual uration, reducing configuration overhead. process of transferring the data from the For partially reconfigurable designs, the host processor to the reconfigurable hard- compiler must determine a good place- ware can be modified to include a configu- ment in order to prevent configurations ration cache, which would provide a faster that are used together in close temporal reconfiguration. proximity from occupying the same re- 5.4.1. Configuration Prefetching. Perfor- sources. Again, through minimizing the mance is improved when the actual con- number of reconfigurations, the overall figuration of the hardware is overlapped performance of the system is increased, as with computations performed by the configuration is a slow process [Li et al. host processor, because programming the 2000b]. An alternative approach, which reconfigurable hardware requires from allows the final placement of a configura- milliseconds to seconds to accomplish. tion to be determined at run-time, is also Overlapping configuration and execution discussed within the Fast Configuration prevents the host processor from stalling section of this article. while it is waiting for the configuration to finish, and hides the configuration time 5.4. Fast Configuration from the program execution. Configura- tion prefetching [Hauck 1998a] attempts Because run-time reconfigurable systems to leverage this overlap by determining involve reconfiguration during program when to initiate reconfiguration of the execution, the reconfiguration must be hardware in order to maximize overlap done as efficiently and as quickly as pos- with useful computation on the host sible. This is in order to ensure that the processor. It also seeks to minimize the overhead of the reconfiguration does not chance that a configuration will be pre- eclipse the benefit gained by hardware ac- fetched falsely, overwriting the configura- celeration. Stalling execution of either the tion that is actually used next. host processor or the reconfigurable hard- ware because of configuration is clearly 5.4.2. Configuration Compression. Unfor- undesirable. In the DISC II system, from tunately, there will always be cases in 25% [Wirthlin and Hutchings 1996] to 71% which the configuration overheads cannot [Wirthlin and Hutchings 1995] of execu- be successfully hidden using a prefetch- tion time is spent in reconfiguration, while ing technique. This can occur when a con- in the UCLA ATR work this figure can rise ditional branch occurs immediately be- to over 98.5% [Mangione-Smith 1999]. If fore the use of a configuration, potentially the delays caused by reconfiguration are making a 100% correct prefetch predic- reduced, performance can be greatly in- tion impossible, or when multiple config- creased. Therefore, fast configuration is an urations or contexts must be loaded in important area of research for run-time re- quick succession. In these cases, the delay configurable systems. incurred is minimized when the amount

ACM Computing Surveys, Vol. 34, No. 2, June 2002. Reconfigurable Computing 201 of data transferred from the host proces- out of groups of smaller configurations, sor to the reconfigurable array is mini- the configuration overhead of partial re- mized. Configuration compression can be configuration is reduced because more op- used to compact this configuration infor- erations can be present on chip simul- mation [Hauck et al. 1998b; Hauck and taneously. However, there are some area Wilson 1999; Li and Hauck 1999; Dandalis and execution penalties imposed by this and Prasanna 2001]. method, creating a trade-off between re- One form of configuration compression duced reconfiguration overhead and faster has already been implemented in a com- execution with a smaller area. mercial system. The Xilinx 6200 series of FPGA [Xilinx 1996] contains wildcarding 5.4.3. Relocation and Defragmentation in hardware, which provides a method to pro- Partially Reconfigurable Systems. Partially gram multiple logic cells with a single ad- reconfigurable systems have the advan- dress and data value. This is accomplished tage over single context systems in that by setting a special register to indicate they allow a new configuration to be writ- which of the address bits should behave ten to the programmable logic while the as “don’t-care” values, resolving to multi- configurations not occupying that same ple addresses for configuration. For exam- area remain intact and available for future ple, suppose two configuration addresses, use. Because these configurations will not 00010 and 00110, are both to be pro- have to be reconfigured onto the array, grammed with the same value. By setting and because the programming of a sin- the wildcard register to 00100, the address gle configuration can require the transfer value sent is interpreted as 00X10 and of far less configuration data than the pro- both these locations are programmed us- gramming of an entire context, a partially ing either of the two addresses above in a reconfigurable system can incur less con- single operation. Hauck et al. [1998b] dis- figuration overhead than a single context cuss the benefits of this hardware, while FPGA. Li and Hauck [1999] cover a potential ex- However, inefficiencies can arise if two tension to the concept, where “don’t care” partial configurations are supposed to values in the configuration stream can be be located at overlapping physical loca- used to allow areas with similar but not tions on the FPGA. If these configura- identical configuration data values to also tions are repeatedly used one after an- be programmed simultaneously. other, they must be swapped in and out of Within partially reconfigurable sys- the array each time. This type of conflict tems, there is an added potential to com- could negate much of the benefit achieved press effectively the amount of data sent by partially reconfigurable systems. A to the reconfigurable hardware. A con- better solution to this problem is to allow figuration can possibly reuse configura- the final placement of the configurations tion information already present on the to occur at run-time, allowing for run- array, such that only the areas differing time relocation of those configurations in configuration values must be repro- [Li et al. 2000b; Compton et al. 2002]. grammed. Therefore, configuration time Using relocation, a new configuration can be reduced through the identification may be placed onto the reconfigurable of these common components and the cal- array where it will cause minimum con- culation of the incremental configurations flict with other needed configurations al- that must be loaded [Luk et al. 1997a; ready present on the hardware. A num- Shirazi et al. 1998]. ber of different systems support run-time Alternately, similar operations can be relocation, including Chimaera [Hauck grouped together to form a single con- et al. 1997], Garp [Hauser and Wawrzynek figuration that contains extra control cir- 1997], and PipeRench [Cadambi et al. cuitry in order to implement the various 1998; Goldstein et al. 2000]. functions within the group [Kastrup et al. Even with relocation, partially reconfig- 1999]. By creating larger configurations urable hardware can still suffer from some

ACM Computing Surveys, Vol. 34, No. 2, June 2002. 202 K. Compton and S. Hauck placement conflicts that could be avoided figurable array. However, in many archi- by using an additional hardware optimiza- tectures, there are some routing resources tion. Over time, as a partially reconfig- that traverse long distances, and may tra- urable device loads and unloads config- verse areas allocated to different config- urations, the location of the unoccupied urations. Care must be taken such that area on the array is likely to become frag- different configurations do not attempt to mented, similar to what occurs in mem- drive to these wires simultaneously, as ory systems when RAM is allocated and multiple drivers to a wire can potentially deallocated. There may be enough empty damage the hardware. Therefore, systems area on the device to hold an incoming such as the Xilinx 6200 [Xilinx 1996] and configuration, but it may be distributed Chimaera [Hauck et al. 1997] have spe- throughout the array. A configuration nor- cially designed routing resources that pre- mally requires a contiguous region of the vent multiple drivers. LEGO [Chow et al. chip, so it would have to overwrite a por- 1999b] includes an additional control sig- tion of a valid configuration in order to nal preventing conflicts during the span of be placed onto the reconfigurable hard- time between startup and actual program- ware. A system that incorporates the abil- ming of the hardware. ity to perform defragmentation of the re- An additional difficulty in using run- configurable array, however, would be able time reconfigurable systems occurs when to consolidate the unused area by mov- the host processor runs multiple threads ing valid configurations to new locations or processes. These threads or processes [Diessel and El Gindy 1997; Compton et al. may each have their own sets of config- 2002]. This area can then be used by in- urations that are to be mapped to the coming configurations, potentially with- reconfigurable hardware. Issues such as out overwriting any of the moved config- the correct use of memory protection and urations. virtual memory must be considered dur- ing memory accesses by the reconfigurable 5.4.4. Configuration Caching. Because a hardware [Chien and Byun 1999; Jacob great deal of the delay caused by config- and Chow 1999; Jean et al. 1999]. An- uration is due to the distance between other problem can occur when one thread the host processor and the reconfigurable or process configures the hardware, which hardware, as well the reading of the is then reconfigured by a different thread configuration data from a file or main or process. Threads and processes must be memory, a configuration cache can poten- prevented from incorrectly calling hard- tially reduce the costs of reconfiguration ware functions that no longer appear [Deshpande et al. 1999; Li et al. 2000b]. on the reconfigurable hardware. This re- By storing the configurations in fast mem- quires that the state of the reconfigurable ory near to the reconfigurable array, the hardware be set to “dirty” on a main pro- data transfer during reconfiguration is ac- cessor context switch, or re-loaded with celerated, and the overall time required the correct configuration context. is reduced. Additionally, a special config- Partially reconfigurable systems must uration cache can allow for specialized di- also protect against inter-process or inter- rect output to the reconfigurable hardware thread conflicts within the array. Even [Compton et al. 2000]. This output can if each application has ensured that leverage the close proximity of the cache their own configurations can safely co- by providing high-bandwidth communica- exist, a combination of configurations from tions that would facilitate wide parallel different applications re-introduces the loading of the configuration data, further possibility of inadvertently causing an reducing configuration times. electrical short within the reconfigurable hardware. This particular issue can be 5.5. Potential Problems with RTR solved through the use of an architecture Partial reconfiguration involves selec- that does not have “bad” configurations, tively programming portions of the recon- such as the 6200 series [Xilinx 1996] and

ACM Computing Surveys, Vol. 34, No. 2, June 2002. Reconfigurable Computing 203

Chimaera [Hauck et al. 1997]. The po- ASIC, reconfigurable systems provide a tential for this type of conflict also intro- method to map circuits into hardware. Re- duces the possibility of extremely destruc- configurable systems therefore have the tive configurations that can destroy the potential to achieve far greater perfor- system’s underlying hardware. mance than software as a result of bypass- ing the fetch-decode-execute cycle of tradi- tional microprocessors as well as possibly 5.6. Run-Time Reconfiguration Summary exploiting a greater degree of parallelism. We have discussed the benefits of using Reconfigurable hardware systems come run-time reconfiguration to increase the in many forms, from a configurable func- benefits gained through reconfigurable tional unit integrated directly into a CPU, computing. Different configurations may to a reconfigurable coprocessor coupled be used at different phases of a program’s with a host microprocessor, to a multi- execution, customizing the hardware not FPGA stand-alone unit. The level of cou- only for the application, but also for the pling, granularity of computation struc- different stages of the application. Run- tures, and form of routing resources are all time reconfiguration also allows configu- key points in the design of reconfigurable rations larger than the available recon- systems. The use of heterogeneous struc- figurable hardware to be implemented, tures can also greatly add to the overall as these circuits can be split into sev- performance of the final design. eral smaller ones that are used in succes- Compilation tools for reconfigurable sion. Because of the delays associated with systems range from simple tools that aid configuration, this style of computing re- in the manual design and placement of quires that reconfiguration be performed circuits, to fully automatic design suites in a very efficient manner. Multicontext that use program code written in a high- and partially reconfigurable FPGAs are level language to generate circuits and the both designed to improve the time re- controlling software. The variety of tools quired for reconfiguration. Hardware opti- available allows designers to choose be- mizations, such as wildcarding, run-time tween manual and automatic circuit cre- relocation, and defragmentation, fur- ation for any or all of the design steps. ther decrease configuration overhead in Although automatic tools greatly simplify a partially reconfigurable design. Soft- the design process, manual creation is still ware techniques to enable fast configura- important for performance-driven appli- tion, including prefetching and incremen- cations. Circuit libraries and circuit gen- tal configuration calculation, were also erators are additional software tools that discussed. enable designers to quickly create efficient designs. These tools attempt to aid the designer in gaining the benefits of man- ual design without entirely sacrificing the 6. CONCLUSION ease of automatic circuit creation. Reconfigurable computing is becoming an Finally, run-time reconfiguration pro- important part of research in computer vides a method to accelerate a greater por- architectures and software systems. By tion of a given application by allowing the placing the computationally intense por- configuration of the hardware to change tions of an application onto the reconfig- over time. Apart from the benefits of added urable hardware, that application can be capacity through the use of virtual hard- greatly accelerated. This is because recon- ware, run-time reconfiguration also allows figurable computing combines many of the for circuits to be optimized based on run- benefits of both software and ASIC im- time conditions. In this manner, perfor- plementations. Like software, the mapped mance of a reconfigurable system can ap- circuit is flexible, and can be changed over proach or even surpass that of an ASIC. the lifetime of the system or even the Reconfigurable computing systems have lifetime of the application. Similar to an shown the ability to accelerate program

ACM Computing Surveys, Vol. 34, No. 2, June 2002. 204 K. Compton and S. Hauck execution greatly, providing a high- ALTERA CORPORATION. 1998. Data Book. Altera performance alternative to software-only Corporation, San Jose, CA. implementations. However, no one hard- ALTERA CORPORATION. 1999. Altera MegaCore Functions. Available online at http://www.altera. ware design has emerged as the clear pin- com/html/tools/megacore.html. Altera Corpora- nacle of reconfigurable design. Although tion, San Jose, CA. general-purpose FPGA structures have ALTERA CORPORATION. 2001. Press Release: Al- standardized into LUT-based architec- tera Unveils First Complete System-on-a- tures, groups designing hardware for re- Programmable-Chip Solution at Embedded configurable computing are currently also Systems Conference. Altera Corporation, San Jose, CA. exploring the use of heterogeneous struc- ANNAPOLIS MICROSYSTEMS,INC. 1998. Wildfire Ref- tures and word-width computational ele- erence Manual. Annapolis Microsystems, Inc, ments. Those designing compiler systems Annapolis, MD. face the task of improving automatic de- ARNOLD, J. M., BUELL,D.A.,AND DAVIS, E. G. 1992. sign tools to the point where they may Splash 2. In Proceedings of the ACM Symposium achieve mappings comparable to manual on Parallel Algorithms and Architectures, 316– design for even high-performance applica- 324. tions. Within both of these research cat- BABB, J., RINARD, M., MORITZ, C. A., LEE, W., FRANK, M., BARUA, R., AND AMARASINGHE, S. 1999. Par- egories lies the additional topic of run- allelizing applications into silicon. IEEE Sympo- time reconfiguration. While some work sium on Field-Programmable Custom Comput- has been done in this field as well, re- ing Machines, 70–80. search must continue in order to be able BABB, J., TESSIER, R., AND AGARWAL, A. 1993. Vir- to perform faster and more efficient re- tual wires: Overcoming pin limitations in FPGA- based logic emulators. In IEEE Workshop configuration. Further study into each of on FPGAs for Custom Computing Machines, these topics is necessary in order to har- 142–151. ness the full potential of reconfigurable BELLOWS,P.AND HUTCHINGS, B. 1998. JHDL—An computing. HDL for reconfigurable systems. IEEE Sympo- sium on Field-Programmable Custom Comput- ing Machines, 175–184. REFERENCES BETZ,V.AND ROSE, J. 1997. VPR: A new packing, placement and routing tool for FPGA research. ABOUZEID, P., BABBA, P., DE PAULET,M.C.,AND SAUCIER, Lecture Notes in Computer Science 1304—Field- G. 1993. Input-driven partitioning methods Programmable Logic and Applications. W. Luk, and application to synthesis on table-lookup- P. Y. K. Cheung, and M. Glesner, Eds. Springer- based FPGA’s. IEEE Trans. Comput. Aid. Des. Verlag, Berlin, Germany, 213–222. Integ. Circ. Syst. 12, 7, 913–925. BETZ,V.AND ROSE, J. 1999. FPGA routing archi- ACOCK,S.J.B.AND DIMOND, K. R. 1997. Automatic tecture: Segmentation and buffering to optimize mapping of algorithms onto multiple FPGA- speed and density. ACM/SIGDA International SRAM Modules. Field-Programmable Logic and Symposium on FPGAs, 59–68. Applications, W. Luk, P. Y. K. Cheung, and BRASEN,D.R.,AND SAUCIER, G. 1998. Using cone M. Glesner, Eds. Lecture Notes in Computer structures for circuit partitioning into FPGA Science, vol. 1304, Springer-Verlag, Berlin, packages. IEEE Trans. CAD Integ. Circ. Syst. 17, Germany, 255–264. 7, 592–600. ADAPTIVE SILICON,INC. 2001. MSA 2500 Pro- BROWN,S.D.,FRANCIS,R.J.,ROSE,J.,AND VRANESIC, grammable Logic Cores. Adaptive Silicon, Inc., Z. G. 1992a. Field-Programmable Gate Ar- Los Gatos, CA. rays, Kluwer Academic Publishers, Boston, MA. AGARWAL, A. 1995. VirtualWires: A Technology BROWN, S., ROSE,J.,AND VRANESIC, Z. G. 1992b. A for Massive Multi-FPGA Systems. Available detailed router for field-programmable gate ar- online at http://www.ikos.com/products/virtual- rays. IEEE Trans. Comput. Aid. Desi. 11, 5, 620– wires.ps. 628. AGGARWAL,A.AND LEWIS, D. 1994. Routing archi- BUDIU,M.AND GOLDSTEIN, S. C. 1999. Fast com- tectures for hierarchical field programmable pilation for pipelined reconfigurable fabrics. gate arrays. In Proceedings of the IEEE Interna- ACM/SIGDA International Symposium on tional Conference on Computer Design, 475–478. FPGAs, 195–205. ALEXANDER,M.J.AND ROBINS, G. 1996. New BUELL, D., ARNOLD,S.M.,AND KLEINFELDER,W.J. performance-driven FPGA routing algorithms. 1996. SPLASH 2: FPGAs in a Custom Comput- IEEE Trans. CAD Integ. Circ. Syst. 15, 12, 1505– ing Machine, IEEE Computer Society Press, Los 1517. Alamitos, CA.

ACM Computing Surveys, Vol. 34, No. 2, June 2002. Reconfigurable Computing 205

BURNS, J., DONLIN, A., HOGG, J., SINGH,S.,AND CHOW, P., SEO,S.O.,ROSE, J., CHUNG, K., PAEZ´ -MONZON´ , DE WIT, M. 1997. A dynamic reconfigu- G., AND RAHARDJA, I. 1999b. The design of an ration run-time system. IEEE Symposium SRAM-based field-programmable Gate Array— on Field-Programmable Custom Computing Part II: Circuit Design and Layout. IEEE Trans. Machines, 66–75. VLSI Syst. 7, 3, 321–330. BUTTS,M.AND BATCHELLER, J. 1991. Method of us- CHOWDHARY,A.AND HAYES, J. P. 1997. General ing electronically reconfigurable logic circuits. modeling and technology-mapping technique for US Patent 5,036,473. LUT-based FPGAs. ACM/SIGDA International CADAMBI,S.AND GOLDSTEIN, S. C. 1999. CPR: A Symposium on FPGAs, 43–49. configuration profiling tool. IEEE Symposium CHU, M., WEAVER, N., SULIMMA, K., DEHON, A., AND on Field-Programmable Custom Computing WAWRZYNEK, J. 1998. Object oriented circuit- Machines, 104–113. generators in Java. IEEE Symposium on Field- CADAMBI, S., WEENER, J., GOLDSTEIN,S.C.,SCHMIT, H., Programmable Custom Computing Machines, AND THOMAS, D. E. 1998. Managing pipeline- 158–166. reconfigurable FPGAs. ACM/SIGDA Interna- COMPTON, K., COOLEY, J., KNOL,S.,AND HAUCK, tional Symposium on FPGAs, 55–64. S. 2000. Configuration relocation and defrag- CALLAHAN,T.J.,CHONG, P., DEHON, A., AND WAWRZYNEK, mentation for FPGAs, Northwestern Univer- J. 1998. Fast Module Mapping and Placement sity Technical Report, Available online at http:// for in FPGAs. ACM/SIGDA Interna- www.ece.nwu.edu/∼kati/publications.html. tional Symposium on FPGAs, 123–132. COMPTON, K., LI, Z., COOLEY, J., KNOL,S.,AND HAUCK, CALLAHAN,T.J.,HAUSER,J.R.,AND WAWRZYNEK,J. S. 2002. Configuration relocation and defrag- 2000. The Garp architecture and C compiler. mentation for run-time reconfigurable comput- IEEE Comput. 3, 4, 62–69. ing. IEEE Trans. VLSI Syst., to appear. CARDOSO,J.M.P.AND NETO, H. C. 1999. Macro- CONG,J.AND HWANG, Y. Y. 1998. Boolean match- based hardware compilation of JavaTM byte- ing for complex PLBs in LUT-based FPGAs with codes into a dynamic reconfigurable computing application to architecture evaluation. ACM/ system. IEEE Symposium on Field-Programm- SIGDA International Symposium on FPGAs, able Custom Computing Machines, 2–11. 27–34. CHAMELEON SYSTEMS,INC. 2000. CS2000 Advance CONG,J.AND WU, C. 1998. An efficient algorithm Product Specification. Chameleon Systems, Inc., for performance-optimal FPGA technology map- San Jose, CA. ping with retiming. IEEE Trans. CAD Integr. Circ. Syst. 17, 9, 738–748. CHAN,P.K.AND SCHLAG, M. D. F. 1997. Accel- eration of an FPGA router. IEEE Symposium CONG, J., WU,C.,AND DING, Y. 1999. Cut ranking on Field-Programmable Custom Computing and pruning enabling a general and efficient Machines, 175–181. FPGA mapping solution. ACM/SIGDA Interna- tional Symposium on FPGAs, 29–35. CHANG,D.AND MAREK-SADOWSKA, M. 1998. Parti- tioning sequential circuits on dynamically recon- CONG,J.AND XU, S. 1998. Technology mapping figurable FPGAs. ACM/SIGDA International for FPGAs with embedded memory blocks. Symposium on FPGAs, 161–167. ACM/SIGDA International Symposium on FPGAs, 179–188. CHANG,S.C.,MAREK-SADOWSKA, M., AND HWANG,T.T. 1996. Technology mapping for TLU FPGA’s CRONQUIST,D.C.,FRANKLIN, P., BERG,S.G., based on decomposition of binary decision AND EBELING, C. 1998. Specifying and com- diagrams. IEEE Trans. CAD Integ. Circ. Syst. 15, piling applications for RaPiD. IEEE Sympo- 10, 1226–1248. sium on Field-Programmable Custom Comput- ing Machines, 116–125. CHICHKOV,A.V.AND ALMEIDA, C. B. 1997. An hard- ware/software partitioning algorithm for cus- DANDALIS,A.AND PRASANNA, V. K. 2001. Configura- tom computing machines. Lecture Notes in Com- tion compression for FPGA-based embedded sys- puter Science 1304—Field-Programmable Logic tems. ACM/SIGDA International Symposium and Applications. W. Luk, P. Y. K. Cheung, on Field-Programmable Gate Arrays, 173–182. and M. Glesner, Eds. Springer-Verlag, Berlin, DEHON, A. 1996. DPGA Utilization and Applica- Germany, 274–283. tion. ACM/SIGDA International Symposium on CHIEN,A.A.AND BYUN, J. H. 1999. Safe and pro- FPGAs, 115–121. tected execution for the morph/AMRM recon- DEHON, A. 1999. Balancing interconnect and com- figurable processor. IEEE Symposium on Field- putation in a reconfigurable computing array (or, Programmable Custom Computing Machines, why you don’t really want 100% LUT utiliza- 209–221. tion). ACM/SIGDA International Symposium CHOW, P., SEO,S.O.,ROSE, J., CHUNG, K., PAEZ´ -MONZON´ , on FPGAs, 69–78. G., AND RAHARDJA, I. 1999a. The design of an DESHPANDE, D., SOMANI,A.K.,AND TYAGI,A. SRAM-based field-programmable Gate Array— 1999. Configuration caching vs data caching Part I: Architecture. IEEE Trans. VLSI Syst. 7, for striped FPGAs. ACM/SIGDA International 2, 191–197. Symposium on FPGAs, 206–214.

ACM Computing Surveys, Vol. 34, No. 2, June 2002. 206 K. Compton and S. Hauck

DIESSEL,O.AND EL GINDY, H. 1997. Run-time com- GRAHAM,P.AND NELSON, B. 1996. Genetic algo- paction of FPGA designs. Lecture Notes in rithms in software and in hardware—A per- Computer Science 1304—Field-Programmable formance analysis of workstations and custom Logic and Applications. W. Luk, P. Y. K. computing machine implementations. IEEE Cheung, M. Glesner, Eds. Springer-Verlag, Symposium on FPGAs for Custom Computing Berlin, Germany, 131–140. Machines, 216–225. DOLLAS, A., SOTIRIADES, E., AND EMMANOUELIDES,A. HAUCK, S. 1995. Multi-FPGA systems. Ph.D. dis- 1998. Architecture and design of GE1, A FCCM sertation, Univ. Washington, Dept. of C.S.&E. for golomb ruler derivation. IEEE Sympo- HAUCK, S. 1998a. Configuration prefetch for sin- sium on Field-Programmable Custom Comput- gle context reconfigurable . ACM/ ing Machines, 48–56. SIGDA International Symposium on FPGAs, EBELING, C., CRONQUIST,D.C.,AND FRANKLIN,P. 65–74. 1996. RaPiD—Reconfigurable pipelined dat- HAUCK, S. 1998b. The roles of FPGAs in repro- apath. Lecture Notes in Computer Science grammable systems. Proc. IEEE 86, 4, 615–638. 1142—Field-Programmable Logic: Smart Appli- HAUCK,S.AND AGARWAL A. 1996. Software tech- cations, New Paradigms and Compilers.R.W. nologies for reconfigurable systems. Dept. of Hartenstein, M. Glesner, Eds. Springer-Verlag, ECE Technical Report, Northwestern Univ. Berlin, Germany, 126–135. Available online at http://www.ee.washington. EJNIOUI,A.AND RANGANATHAN, N. 1999. Multi- edu/faculty/hauck/publications.html. terminal net routing for partial crossbar-based HAUCK,S.AND BORRIELLO, G. 1997. Pin assignment multi-FPGA systems. ACM/SIGDA Interna- for multi-FPGA systems. IEEE Trans. Comput. tional Symposium on FPGAs, 176–184. Aid. Desi. Integ. Circ. Syst. 16, 9, 956–964. ELBIRT,A.J.AND PAAR, C. 2000. An FPGA im- HAUCK, S., BORRIELLO,G.,AND EBELING, C. 1998a. plementation and performance evaluation of Mesh routing topologies for multi-FPGA sys- the serpent block cipher. ACM/SIGDA Interna- tems. IEEE Trans. VLSI Syst. 6, 3, 400–408. tional Symposium on FPGAs, 33–40. HAUCK, S., FRY,T.W.,HOSLER,M.M.,AND KAO,J.P. EMMERT,J.M.AND BHATIA, D. 1999. A methodology 1997. The Chimaera reconfigurable functional for fast FPGA floorplanning. ACM/SIGDA In- unit. IEEE Symposium on Field-Programmable ternational Symposium on FPGAs, 47–56. Custom Computing Machines, 87–96. ESTRIN, G., BUSSEL, B., TURN, R., AND BIBB, J. 1963. HAUCK, S., LI, Z., AND SCHWABE, E. 1998b. Configu- Parallel processing in a restructurable com- ration compression for the Xilinx XC6200 FPGA. puter system. IEEE Trans. Elect. Comput. 747– IEEE Symposium on Field-Programmable Cus- 755. tom Computing Machines, 138–146. GALLOWAY, D. 1995. The transmogrifier C hard- HAUCK,S.AND WILSON, W. D. 1999. Runlength ware description language and compiler for compression techniques for FPGA configura- FPGAs. IEEE Symposium on FPGAs for Custom tions. Dept. of ECE Technical Report, North- Computing Machines, 136–144. western Univ. Available online at http://www. GEHRING,S.AND LUDWIG, S. 1996. The trianus sys- ee.washington.edu/faculty/hauck/publications. tem and its application to custom computing. html. Lecture Notes in Computer Science 1142—Field- HAUSER,J.R.AND WAWRZYNEK, J. 1997. Garp: A Programmable Logic: Smart Applications, New MIPS processor with a reconfigurable coproces- Paradigms and Compilers. R. W. Hartenstein sor. IEEE Symposium on Field-Programmable and M. Glesner, Eds. Springer-Verlag, Berlin, Custom Computing Machines, 12–21. Germany, 176–184. HAYNES,S.D.AND CHEUNG, P. Y. K. 1998. A re- GEHRING,S.W.AND LUDWIG, S. H. M. 1998. Fast configurable multiplier array for video image integrated tools for circuit design with FPGAs. processing tasks, suitable for embedding in an ACM/SIGDA International Symposium on FPGA structure. IEEE Symposium on Field- FPGAs, 133–139. Programmable Custom Computing Machines, GOKHALE,M.B.AND STONE, J. M. 1998. NAPA C: 226–234. Compiling for a hybrid RISC/FPGA architec- HEILE,F.AND LEAVER, A. 1999. Hybrid product ture. IEEE Symposium on Field-Programmable term and LUT based architectures using embed- Custom Computing Machines, 126–135. ded memory blocks. ACM/SIGDA International GOKHALE,M.B.AND STONE, J. M. 1999. Automatic Symposium on FPGAs, 13–16. allocation of arrays to memories in FPGA proces- HUANG,W.J.,SAXENA,N.,AND MCCLUSKEY,E.J. sors with multiple memory banks. IEEE Sympo- 2000. A reliable LZ data compressor on sium on Field-Programmable Custom Comput- reconfigurable coprocessors. IEEE Symposium ing Machines, 63–69. on Field-Programmable Custom Computing GOLDSTEIN,S.C.,SCHMIT, H., BUDIU, M., CADAMBI, Machines, 249–258. S., MOE, M., AND TAYLOR, R. 2000. PipeRench: HUELSBERGEN, L. 2000. A representation for dy- A Reconfigurable Architecture and Compiler, namic graphs in reconfigurable hardware IEEE Computer, vol. 33, No. 4. and its application to fundamental graph

ACM Computing Surveys, Vol. 34, No. 2, June 2002. Reconfigurable Computing 207

algorithms. ACM/SIGDA International Sympo- KRUPNOVA, H., RABEDAORO,C.,AND SAUCIER, G. 1997. sium on FPGAs, 105–115. Synthesis and floorplanning for large hierarchi- HUTCHINGS, B. L. 1997. Exploiting reconfig- cal FPGAs. ACM/SIGDA International Sympo- urability through domain-specific systems. sium on FPGAs, 105–111. Lecture Notes in Computer Science 1304— LAI,Y.T.AND WANG, P. T. 1997. Hierarchical in- Field-Programmable Logic and Applications. terconnection structures for field programmable W. Luk, P. Y. K. Cheung, and M. Glesner, gate arrays. IEEE Trans. VLSI Syst. 5, 2, 186– Eds. Springer-Verlag, Berlin, Germany, 193– 196. 202. LAUFER, R., TAYLOR,R.R.,AND SCHMIT, H. 1999. HUTCHINGS, B., BELLOWS, P., HAWKINS, J., HEMMERT, PCI-PipeRench and the SwordAPI: A system for S., NELSON,B.,AND RYTTING, M. 1999. A CAD stream-based reconfigurable computing. IEEE suite for high-performance FPGA design. IEEE Symposium on Field-Programmable Custom Symposium on Field-Programmable Custom Computing Machines, 200–208. Computing Machines, 12–24. LEE,Y.S.AND WU, A. C. H. 1997. A performance HWANG,T.T.,OWENS, R. M., IRWIN,M.J.,AND and routability-driven router for FPGA’s consid- WANG, K. H. 1994. Logic synthesis for field- ering path delays. IEEE Trans. CAD Integ. Circ. programmable gate arrays. IEEE Trans. Com- Syst. 16, 2, 179–185. put. Aid. Des. Integ. Circ. Syst. 13, 10, 1280– LEONARD,J.AND MANGIONE-SMITH, W. H. 1997. A 1287. case study of partially evaluated hardware cir- INUANI,M.K.AND SAUL, J. 1997. Technology map- cuits: Key-specific DES. Lecture Notes in Com- ping of heterogeneous LUT-based FPGAs. Lec- puter Science 1304—Field-Programmable Logic ture Notes in Computer Science 1304—Field- and Applications. W. Luk, P. Y. K. Cheung, Programmable Logic and Applications. W. Luk, and M. Glesner, Eds. Springer-Verlag, Berlin, P. Y. K. Cheung, and M. Glesner, Eds. Springer- Germany, 151–160. Verlag, Berlin, Germany, 223–234. LEUNG, K. H., MA,K.W.,WONG,W.K.,AND LEONG, JACOB,J.A.AND CHOW, P. 1999. Memory interfacing P. H. W. 2000. FPGA Implementation of a mi- and instruction specification for reconfigurable crocoded elliptic curve cryptographic processor. processors. ACM/SIGDA International Sympo- IEEE Symposium on Field-Programmable Cus- sium on Field-Programmable Gate Arrays, 145– tom Computing Machines, 68–76. 154. LEWIS, D. M., GALLOWAY,D.R.,VAN I ERSSEL, M., ROSE, JEAN,J.S.N.,TOMKO, K., YAVAGAL, V., SHAH,J., J., AND CHOW, P. 1997. The Transmogrifier-2: AND COOK R. 1999. Dynamic reconfiguration A 1 million gate rapid prototyping system. to support concurrent applications. IEEE Trans. ACM/SIGDA International Symposium on Comput. 48, 6, 591–602. FPGAs, 53–61. KASTRUP, B., BINK, A., AND HOOGERBRUGGE, J. 1999. LI, Y., CALLAHAN, T., DARNELL, E., HARR, R., KURKURE, ConCISe: A compiler-driven CPLD-based in- U., AND STOCKWOOD, J. 2000a. Hardware- struction set accelerator. IEEE Symposium software co-design of embedded reconfigurable on Field-Programmable Custom Computing architectures. Design Automation Conference, Machines, 92–101. 507–512. KHALID, M. A. S. 1999. Routing architecture and LI, Z., COMPTON, K., AND HAUCK, S. 2000b. Config- layout synthesis for multi-FPGA systems. Ph.D. uration caching for FPGAs. IEEE Symposium dissertation, Dept. of ECE, Univ. Toronto. on Field-Programmable Custom Computing Machines, 22–36. KHALID,M.A.S.AND ROSE, J. 1998. A hybrid complete-graph partial-crossbar routing archi- LI,Z.AND HAUCK, S. 1999. Don’t care discovery for tecture for multi-FPGA systems. ACM/SIGDA FPGA configuration compression. ACM/SIGDA International Symposium on FPGAs, 45–54. International Symposium on FPGAs, 91–98. KIM,H.J.AND MANGIONE-SMITH, W. H. 2000. Fac- LIN, X., DAGLESS, E., AND LU, A. 1997. Technol- toring large numbers with programmable hard- ogy mapping of LUT based FPGAs for delay ware. ACM/SIGDA International Symposium optimisation. Lecture Notes in Computer Sci- on FPGAs, 41–48. ence 1304—Field-Programmable Logic and Ap- plications. W. Luk, P. Y. K. Cheung, and M. KIM,H.S.,SOMANI,A.K.,AND TYAGI, A. 2000. A Glesner, Eds. Springer-Verlag, Berlin, Germany, reconfigurable multi-function computing cache 245–254. architecture. ACM/SIGDA International Sym- posium on FPGAs, 85–94. LIU,H.AND WONG, D. F. 1999. Circuit partitioning for dynamically reconfigurable FPGAs. ACM/ KRESS, R., HARTENSTEIN,R.W.,AND NAGELDINGER,U. SIGDA International Symposium on FPGAs, 1997. An operating system for custom comput- 187–194. ing machines based on the paradigm. Lecture Notes in Computer Science 1304—Field- LUCENT TECHNOLOGIES,INC. 1998. FPGA Data Programmable Logic and Applications. W. Luk, Book. Lucent Technologies, Inc., Allentown, PA. P. Y. K. Cheung, and M. Glesner, Eds. Springer- LUK, W., SHIRAZI,N.,AND CHEUNG, P. Y. K. 1997a. Verlag, Berlin, Germany, 304–313. Compilation tools for run-time reconfigurable

ACM Computing Surveys, Vol. 34, No. 2, June 2002. 208 K. Compton and S. Hauck

designs. IEEE Symposium on Field-Programm- FPGAs. ACM/SIGDA International Symposium able Custom Computing Machines, 56–65. on FPGAs, 35–42. LUK, W., SHIRAZI, N., GUO, S. R., AND CHEUNG,P. PAYNE, R. 1997. Run-time parameterised circuits Y. K. 1997b. Pipeline morphing and virtual for the Xilinx XC6200. Lecture Notes in Com- pipelines. Lecture Notes in Computer Science puter Science 1304—Field-Programmable Logic 1304—Field-Programmable Logic and Applica- and Applications. W. Luk, P. Y. K. Cheung, tions. W. Luk, P. Y. K. Cheung, and M. Glesner, and M. Glesner, Eds. Springer-Verlag, Berlin, Eds. Springer-Verlag, Berlin, Germany, 111– Germany, 161–172. 120. PURNA,K.M.G.AND BHATIA, D. 1999. Temporal LYSAGHT,P.AND STOCKWOOD, J. 1996. A simulation partitioning and scheduling data flow graphs for tool for dynamically reconfigurable field pro- reconfigurable computers. IEEE Trans. Comput. grammable gate arrays. IEEE Trans. VLSI Syst. 48, 6, 579–590. 4, 3, 381–390. QUICKTURN,ACADENCE COMPANY. 1999a. System MAK,W.K.AND WONG, D. F. 1997. Board- RealizerTM. Available online at http://www. level multi net routing for FPGA-based logic quickturn . com / products / systemrealizer . htm. emulation. ACM Trans. Des. Automat. Elect. Quickturn, A Cadence Company, San Jose, CA. Syst. 2, 2, 151–167. QUICKTURN,ACADENCE COMPANY. 1999b. MANGIONE-SMITH, W. H. 1999. ATR from UCLA. MercuryTM Design Verification System Technol- Personal Commun. ogy Backgrounder. Available online at http:// MANGIONE-SMITH, W. H., HUTCHINGS, B., ANDREWS,D., www.quickturn.com/products/mercury backgro- DEHON, A., EBELING, C., HARTENSTEIN, R., MENCER, under.htm. Quickturn, A Cadence Company, O., MORRIS, J., PALEM, K., PRASANNA,V.K.,AND San Jose, CA, 1999. SPAANENBURG, H. A. E. 1997. Seeking solu- RAZDAN,R.AND SMITH, M. D. 1994. A high- tions in configurable computing. IEEE Comput. performance microarchitecture with hardware- 30, 12, 38–43. programmable functional units. International MARSHALL, A., STANSFIELD, T.,KOSTARNOV, I., VUILLEMIN, Symposium on Microarchitecture, 172–180. J., AND HUTCHINGS, B. 1999. A reconfigurable RENCHER,M.AND HUTCHINGS, B. L. 1997. Auto- arithmetic array for multimedia applications. mated target recognition on SPLASH2. IEEE ACM/SIGDA International Symposium on Symposium on Field-Programmable Custom FPGAs, 135–143. Computing Machines, 192–200. MCKAY,N.AND SINGH, S. 1999. Debugging tech- ROSE, J., EL GAMAL, A., AND SANGIOVANNI-VINCENTELLI, niques for dynamically reconfigurable hard- A. 1993. Architecture of field-programmable ware. IEEE Symposium on Field-Programmable gate arrays. Proc. IEEE 81, 7, 1013–1029. Custom Computing Machines, 114–122. RUPP, C. R., LANDGUTH, M., GARVERICK, T., GOMERSALL, MCMURCHIE,L.AND EBELING, C. 1995. Pathfinder: E., HOLT, H., ARNOLD,J.M.,AND GOKHALE,M. A negotiation-based performance-driven router 1998. The NAPA adaptive processing architec- for FPGAs. ACM/SIGDA International Sympo- ture. IEEE Symposium on Field-Programmable sium on FPGAs, 111–117. Custom Computing Machines, 28–37. MENCER, O., MORF, M., AND FLYNN, M. J. 1998. PAM- SANGIOVANNI-VINCENTELLI, A., EL GAMAL, A., AND ROSE, blox: High performance FPGA design for adap- J. 1993. Synthesis methods for field pro- tive computing. IEEE Symposium on Field- grammable gate arrays. Proc. IEEE 81, 7, 1057– Programmable Custom Computing Machines, 1083. 167–174. SANKAR,Y.AND ROSE, J. 1999. Trading quality MIYAMORI,T.AND OLUKOTUN, K. 1998. A quanti- for compile time: Ultra-fast placement for tative analysis of reconfigurable coprocessors FPGAs. ACM/SIGDA International Symposium for multimedia applications. IEEE Symposium on FPGAs, 157–166. on Field-Programmable Custom Computing SCALERA,S.M.AND VAZQUEZ, J. R. 1998. The Machines, 2–11. design and implementation of a context MORITZ, C. A., YEUNG,D.,AND AGARWAL, A. 1998. switching FPGA. IEEE Symposium on Field- Exploring optimal cost performance designs Programmable Custom Computing Machines, for Raw microprocessors. IEEE Symposium 78–85. on Field-Programmable Custom Computing SELVIDGE, C., AGARWAL, A., DAHL, M., AND BABB J. Machines, 12–27. 1995. TIERS: Topology IndependEnt Pipelined NAM, G. J., SAKALLAH, K. A., AND RUTENBAR, Routing and Scheduling for VirtualWireTM R. A. 1999. Satisfiability-based layout re- Compilation. ACM/SIGDA International Sym- visited: detailed routing of complex FPGAs posium on Field-Programmable Gate Arrays, via search-based boolean SAT. ACM/SIDGA 25–31. International Symposium on FPGAs, 167– SENOUCI, S. A., AMOURA, A., KRUPNOVA, H., AND SAUCIER, 175. G. 1998. Timing driven floorplanning on pro- PAN,P.AND LIN, C. C. 1998. A new retiming-based grammable hierarchical targets. ACM/SIGDA technology mapping algorithm for LUT-based International Symposium on FPGAs, 85–92.

ACM Computing Surveys, Vol. 34, No. 2, June 2002. Reconfigurable Computing 209

SHAHOOKAR,K.AND MAZUMDER, P. 1991. VLSI cell capacity FPGA. ACM/SIGDA International placement techniques. ACM Comput. Surv. 23, Symposium on FPGAs, 3–9. 2, 145–220. TSU, W., MACY, K., JOSHI, A., HUANG, R., WALKER,N., SHI,J.AND BHATIA, D. 1997. Performance driven TUNG, T., ROWHANI, O., GEORGE, V., WAWRZYNEK, floorplanning for FPGA based designs. J., AND DEHON, A. 1999. HSRA: High-speed, ACM/SIGDA International Symposium on hierarchical synchronous reconfigurable ar- FPGAs, 112–118. ray. ACM/SIGDA International Symposium on SHIRAZI, N., LUK,W.,AND CHEUNG, P. Y. K. 1998. FPGAs, 125–134. Automating production of run-time reconfig- VAHID, F. 1997. I/O and performance tradeoffs urable designs. IEEE Symposium on Field- with the FunctionBus during multi-FPGA parti- Programmable Custom Computing Machines, tioning. ACM/SIGDA International Symposium 147–156. on FPGAs, 27–34. SLIMANE-KADI, M., BRASEN,D.,AND SAUCIER, G. 1994. VARGHESE, J., BUTTS, M., AND BATCHELLER, J. 1993. A fast-FPGA prototyping system that uses An efficient logic emulation system. IEEE Trans. inexpensive high-performance FPIC. ACM/ VLSI Syst. 1, 2, 171–174. SIGDA Workshop on Field-Programmable Gate VASILKO,M.AND CABANIS, D. 1999. Improving sim- Arrays. ulation accuracy in design methodologies for dy- SOTIRIADES, E., DOLLAS, A., AND ATHANAS, P. 2000. namically reconfigurable logic systems. IEEE Hardware-software codesign and parallel imple- Sympos. Field-Prog. Cust. Comput. Mach. 123– mentation of a Golomb ruler derivation engine. 133. IEEE Symposium on Field-Programmable Cus- VUILLEMIN, J., BERTIN, P., RONCIN, D., SHAND, M., tom Computing Machines, 227–235. TOUATI, H., AND BOUCARD, P. 1996. Pro- STOHMANN,J.AND BARKE, E. 1996. An universal grammable active memories: Reconfigurable CLA adder generator for SRAM-based FPGAs. systems come of age. IEEE Trans. VLSI Syst. 4, Lecture Notes in Computer Science 1142—Field- 1, 56–69. Programmable Logic: Smart Applications, New WANG,Q.AND LEWIS, D. M. 1997. Automated field- Paradigms and Compilers. R. W. Hartenstein programmable compute accelerator design using and M. Glesner, Eds. Springer-Verlag, Berlin, partial evaluation. IEEE Symposium on Field- Germany, 44–54. Programmable Custom Computing Machines, SWARTZ,J.S.,BETZ,V.,AND ROSE, J. 1998. A 145–154. fast routability-driven router for FPGAs. ACM/ WEINHARDT,M.AND LUK, W. 1999. Pipeline vector- SIGDA International Symposium on FPGAs, ization for reconfigurable systems. IEEE Sympo- 140–149. sium on Field-Programmable Custom Comput- SYNOPSYS,INC. 2000. CoCentric System C Com- ing Machines, 52–62. piler. Synopsys, Inc., Mountain View, CA. WILTON, S. J. E. 1998. SMAP: Heterogeneous tech- SYNPLICITY,INC. 1999. Synplify User Guide Release nology mapping for area reduction in FPGAs 5.1. Synplicity, Inc., Sunnyvale, CA. with embedded memory arrays. ACM/SIGDA TAKAHARA, A., MIYAZAKI, T.,MUROOKA, T.,KATAYAMA, M., International Symposium on FPGAs, 171–178. HAYASHI, K., TSUTSUI, A., ICHIMORI,T.,AND FUKAMI, WIRTHLIN,M.J.AND HUTCHINGS, B. L. 1995. A dy- K. 1998. More wires and fewer LUTs: A namic instruction set computer. IEEE Sym- design methodology for FPGAs. ACM/SIGDA posium on FPGAs for Custom Computing International Symposium on FPGAs, 12–19. Machines, 99–107. THAKUR, S., CHANG,Y.W.,WONG,D.F.,AND WIRTHLIN,M.J.AND HUTCHINGS, B. L. 1996. Se- MUTHUKRISHNAN, S. 1997. Algorithms for an quencing run-time reconfigured hardware with FPGA switch module routing problem with ap- software. ACM/SIGDA International Sympo- plication to global routing. IEEE Trans. CAD sium on FPGAs, 122–128. Integ. Circ. Syst. 16, 1, 32–46. WIRTHLIN,M.J.AND HUTCHINGS, B. L. 1997. Improv- TOGAWA, N., YANAGISAWA, M., AND OHTSUKI, T. 1998. ing functional density through run-time con- Maple-OPT: A performance-oriented simultane- stant propagation. ACM/SIGDA International ous technology mapping, placement, and global Symposium on FPGAs, 86–92. gouting algorithm for FPGA’s. IEEE Trans. CAD WITTIG,R.D.AND CHOW, P. 1996. OneChip: An Integ. Circ. Syst. 17, 9, 803–818. FPGA processor with reconfigurable logic. IEEE TRIMBERGER, S. 1998. Scheduling designs into a Symposium on FPGAs for Custom Computing time-multiplexed FPGA. ACM/SIGDA Interna- Machines, 126–135. tional Symposium on FPGAs, 153–160. WOOD,R.G.AND RUTENBAR, R. A. 1997. FPGA TRIMBERGER, S., CARBERRY, D., JOHNSON, A., AND routing and routability estimation via Boolean WONG, J. 1997a. A time-multiplexed FPGA. satisfiability. ACM/SIGDA International Sym- IEEE Symposium on Field-Programmable Cus- posium on FPGAs, 119–125. tom Computing Machines, 22–28. WU,Y.L.AND MAREK-SADOWSKA, M. 1997. Routing TRIMBERGER, S., DUONG, K., AND CONN, B. 1997b. for array-type FPGA’s. IEEE Trans. CAD Integ. Architecture issues and solutions for a high- Circ. Syst. 16, 5, 506–518.

ACM Computing Surveys, Vol. 34, No. 2, June 2002. 210 K. Compton and S. Hauck

XILINX,INC. 1994. The Programmable Logic Data macro generator. Lecture Notes in Computer Sci- Book. Xilinx, Inc., San Jose, CA. ence 1142—Field-Programmable Logic: Smart XILINX,INC. 1996. XC6200: Advance Product Spec- Applications, New Paradigms and Compil- ification. Xilinx, Inc., San Jose, CA. ers. R. W. Hartenstein and M. Glesner, Eds. Springer-Verlag, Berlin, Germany, 307– XILINX,INC. 1997. LogiBLOX: Product Specifica- 326. tion. Xilinx, Inc., San Jose, CA. TM YI,K.AND JHON, C. S. 1996. A new FPGA tech- XILINX,INC. 1999. Virtex 2.5 V Field Pro- nology mapping approach by cluster merging. grammable Gate Arrays: Advance Product Spec- Lecture Notes in Computer Science 1142—Field- ification. Xilinx, Inc., San Jose, CA. Programmable Logic: Smart Applications, New XILINX,INC. 2000. Press Release: IBM and Xilinx Paradigms and Compilers. R. W. Hartenstein Team to Create New Generation of Integrated and M. Glesner, Eds. Springer-Verlag, Berlin, Circuits. Xilinx, Inc., San Jose, CA. Germany, 366-370. XILINX,INC. 2001. Virtex-II 1.5V Field Pro- ZHONG, P., MARTINOSI, M., ASHAR,P.,AND MALIK,S. grammable Gate Arrays: Advance Product 1998. Accelerating Boolean satisfiability with Specification. Xilinx, Inc., San Jose, CA. configurable hardware. IEEE Symposium YASAR, G., DEVINS, J., TSYRKINA, Y., STADTLANDER, on Field-Programmable Custom Computing G., AND MILLHAM, E. 1996. Growable FPGA Machines, 186–195.

Received May 2000; revised October 2001 and January 2002; accepted February 2002

ACM Computing Surveys, Vol. 34, No. 2, June 2002.