Page 1 of 17 Computers & Digital Techniques

Reconfigurable Computing: Architectures and Design Methods

Journal: IEE Proc. Computers & Digital Techniques

Manuscript ID: CDT-2004-5086.R1

Manuscript Type: Research Paper

Date Submitted by the n/a Author:

Keyword:

IEE Proceedings Review Copy Only Computers & Digital Techniques Page 2 of 17

Reconfigurable Computing: Architectures and Design Methods

∗ ∗∗ ∗∗∗ ∗ ∗ ∗∗ T.J. Todman , G.A. Constantinides , S.J.E. Wilton , O. Mencer , W. Luk and P.Y.K. Cheung ∗Department of Computing, Imperial College London, UK ∗∗Department of Electrical and Electronic Engineering, Imperial College London, UK ∗∗∗Department of Electrical and Computer Engineering, University of British Columbia, Canada

Abstract— Reconfigurable computing is becoming increasingly the power consumption will tend to be much lower than that attractive for many applications. This survey covers two aspects for a general-purpose processor. A recent study [113] reports of reconfigurable computing: architectures and design methods. that moving critical software loops to reconfigurable hardware Our paper includes recent advances in reconfigurable architec- tures, such as the II and Virtex 4 FPGA results in average energy savings of 35% to 70% with an devices. We identify major trends in general-purpose and special- average speedup of 3 to 7 times, depending on the particular purpose design methods. It is shown that reconfigurable com- device used. puting designs are capable of achieving up to 500 times speedup Other advantages of reconfigurable computing include a and 70% energy savings over microprocessor implementations reduction in size and component count (and hence cost), for specific applications. improved time-to-market, and improved flexibility and upgrad- I. INTRODUCTION ability. These advantages are especially important for em- bedded applications. Indeed, there is evidence [125] that Reconfigurable computing is rapidly establishing itself as embedded systems developers show a growing interest in a major discipline that covers various subjects of learning, reconfigurable computing systems, especially with the intro- including both computing science and electronic engineering. duction of soft cores which can contain one or more instruction Reconfigurable computing involves the use of reconfigurable processors [7], [43], [74], [103], [104], [136]. devices, such as Field Programmable Gate Arrays (FPGAs), In this paper, we present a survey of modern reconfigurable for computing purposes. Reconfigurable computing is also system architectures and design methods. Although we also known as configurable computing or custom computing, since provide background information on notable aspects of older many of the design techniques can be seen as customising a technologies, our focus is on the most recent architectures and computational fabric for specific applications [76]. design methods, as well as the trends that will drive each of Reconfigurable computing systems often have impressive these areas in the near future. In other words, we intend to performance. Consider, as an example, the point multiplication complement other survey papers [16], [27], [77], [101], [121] operation in Elliptic Curve cryptography. For a key size of by: 270 bits, it has been reported [120] that a point multiplication can be computed in 0.36 ms with a reconfigurable computing 1) providing an up-to-date survey of material that appears design implemented in an XC2V6000 FPGA at 66 MHz. after the publication of the papers mentioned above; In contrast, an optimised software implementation requires 2) identifying explicitly the main trends in architectures and 196.71 ms on a dual-Xeon computer at 2.6 GHz; so the design methods for reconfigurable computing; reconfigurable computing design is more than 540 times faster, 3) examining reconfigurable computing from a perspective while its clock speed is almost 40 times slower than the different from existing surveys, for instance classifying Xeon processors. This example illustrates a hardware design design methods as special-purpose and general-purpose; implemented on a reconfigurable computing platform. We 4) offering various direct comparisons of technology op- regard such implementations as a subset of reconfigurable tions according to a selected set of metrics from different computing, which in general can involve the use of runtime perspectives. reconfiguration and soft processors. The rest of the paper is organised as follows. Section 2 Is this speed advantage of reconfigurable computing over contains background material that motivates the reconfigurable traditional microprocessors a one-off or a sustainable trend? computing approach. Section 3 describes the structure of Recent research suggests that it is a trend rather than a one-off reconfigurable fabrics, showing how various researchers and for a wide variety of applications: from image processing [51] vendors have developed fabrics that can efficiently accelerate to floating-point operations [124]. time-critical portions of applications. Section 4 then covers Sheer speed, while important, is not the only strength of recent advances in the development of design methods that reconfigurable computing. Another compelling advantage is map applications to these fabrics, and distinguishes between reduced energy and power consumption. In a reconfigurable those which employ special-purpose and general-purpose opti- system, the circuitry is optimized for the application, such that mization methods. Finally, Section 5 concludes the paper and

IEE Proceedings Review Copy Only Page 3 of 17 Computers & Digital Techniques

summarises the main trends in architectures, design methods a reconfigurable fabric upon which custom functional units and applications of reconfigurable computing. can be built. The processor(s) executes sequential and non- critical code, while code that can be efficiently mapped to II. BACKGROUND hardware can be “executed” by processing units that have Many of today’s compute-intensive applications require been mapped to the reconfigurable fabric. Like a custom more processing power than ever before. Applications such integrated circuit, the functions that have been mapped to the as streaming video, image recognition and processing, and reconfigurable fabric can take advantage of the parallelism highly interactive services are placing new demands on the achievable in a hardware implementation. Also like an ASIC, computation units that implement these applications. At the the embedded system designer can produce the right mix same time, the power consumption targets, the acceptable of functional and storage units in the reconfigurable fabric, packaging and manufacturing costs, and the time-to-market providing a computing structure that matches the application. requirements of these computation units are all decreasing Unlike an ASIC, however, a new fabric need not be designed rapidly, especially in the embedded hand-held devices market. for each application. A given fabric can implement a wide Meeting these performance requirements under the power, variety of functional units. This means that a reconfigurable cost, and time-to-market constraints is becoming increasingly computing system can be built out of off-the-shelf compo- challenging. nents, significantly reducing the long design-time inherent in In the following, we describe three ways of supporting an ASIC implementation. Also unlike an ASIC, the functional such processing requirements: high-performance micropro- units implemented in the reconfigurable fabric can change cessors, application-specific integrated circuits, and reconfig- over time. This means that as the environment or usage of urable computing systems. the embedded system changes, the mix of functional units can High-performance microprocessors provide an off-the-shelf adapt to better match the new environment. The reconfigurable means of addressing processing requirements described earlier. fabric in a handheld device, for instance, might implement Unfortunately for many applications, a single processor, even large matrix multiply operations when the device is used in one an expensive state-of-the-art processor, is not fast enough. In mode, and large signal processing functions when the device addition, the power consumption (100 watts or more) and is used in another mode. cost (possibly thousands of dollars) state-of-the-art processors Typically, not all of the embedded system functionality place them out-of-reach for many embedded applications. needs to be implemented by the reconfigurable fabric. Only Even if microprocessors continue to follow Moore’s Law so those parts of the computation that are time-critical and contain that their density doubles every 18 months, they may still be a high degree of parallelism need to be mapped to the recon- unable to keep up with the requirements of some of the most figurable fabric, while the remainder of the computation can be aggressive embedded applications. implemented by a standard instruction processor. The interface Application-specific integrated circuits (ASICs) provide an- between the processor and the fabric, as well as the interface other means of addressing these processing requirements. between the memory and the fabric, are therefore of the utmost Unlike a software implementation, an ASIC implementation importance. Modern reconfigurable devices are large enough provides a natural mechanism for implementing the large to implement instruction processors within the programmable amount of parallelism found in many of these applications. fabric itself: soft processors. These can be general purpose, In addition, an ASIC circuit does not need to suffer from the or customised to a particular application; Application Specific serial (and often slow and power-hungry) instruction fetch, Instruction Processors and Flexible Instruction Processors are decode, and execute cycle that is at the heart of all mi- two such approaches. Section IV C part 2 deals with soft croprocessors. Furthermore, ASICs consume less power than processors in more detail. reconfigurable devices. Finally, an ASIC can contain just the Other devices show some of the flexibility of reconfigurable right mix of functional units for a particular application; in computers. Examples include Graphics Processor Units and contrast, an off-the-shelf microprocessor contains a fixed set Application Specific Array Processors. These devices perform of functional units which must be selected to satisfy a wide well on their intended application, but cannot run more general variety of applications. computations, unlike reconfigurable computers and micropro- Despite the advantages of ASICs, they are often infeasible cessors. or uneconomical for many embedded systems. This is primar- Despite the compelling promise of reconfigurable comput- ily due to two factors: the cost of producing an ASIC often ing, it has limitations of which designers should be aware. due to the mask’s cost (up to $1 million [100]), and the time to For instance, the flexible routing on the bit level tends to develop a custom integrated circuit, can both be unacceptable. produce large silicon area and performance overhead when Only the very highest-volume applications would the improved compared with ASIC technology. Hence for large volume performance and lower per-unit price warrant the high non- production of designs in applications without the need for recurring engineering (NRE) cost of designing an ASIC. field upgrade, ASIC technology or gate array technology can A third means of providing this processing power is a still deliver higher performance design at lower unit cost reconfigurable computing system. A reconfigurable comput- than reconfigurable computing technology. However, since ing system typically contains one or more processors and FPGA technology tracks advances in memory technology and

IEE Proceedings Review Copy Only Computers & Digital Techniques Page 4 of 17

has demonstrated impressive advances in the last few years, many are confident that the current rapid progress in FPGA speed, capacity and capability will continue, together with the reduction in price.

It should be noted that the development of reconfigurable E C A E systems is still a maturing field. There are a number of F H R C challenges in developing a reconfigurable system. We describe CPU E T A N C I three of such challenges below. O / First, the structure of the reconfigurable fabric and the I interfaces between the fabric, processor(s), and memory must RECONFIGURABLE be very efficient. Some reconfigurable computing systems use PROCESSING UNIT a standard Field-Programmable Gate Array [3], [6], [67], [87], (a) External stand-alone processing unit. [94], [135] as a reconfigurable fabric, while others adopt custom-designed fabrics [26], [41], [42], [50], [54], [79], [82], [86], [99], [106], [108], [118]. E

Another challenge is the development of computer-aided C A E design and compilation tools that map an application to a F H R C reconfigurable computing system. This involves determining CPU E A T C N which parts of the application should be mapped to the fabric I

O /

and which should be mapped to the processor, determining I when and how often the reconfigurable fabric should be (b) Attached processing unit. reconfigured, which changes the functional units implemented in the fabric, as well as the specification of algorithms for efficient mappings to the reconfigurable system. In this paper, we provide a survey of reconfigurable com- puting, focusing our discussion on both the issues described E above. In the next section, we provide a survey of various ar- C A E F chitectures that are found useful for reconfigurable computing; H R C

CPU E A material on design methods will follow. T C N I

O

RCHITECTURES /

III. A I We shall first describe system-level architectures for re- (c) Co-processsor. configurable computing. We then present various flavours of

reconfigurable fabric. Finally we identify and summarise the E C

main trends. A CPU E F H R C E A

A. System-level architectures T C N

FU I A reconfigurable system typically consists of one or more O / processors, one or more reconfigurable fabrics, and one or I more memories. Reconfigurable systems are often classified (d) Reconfigurable functional unit. according to the degree of coupling between the reconfigurable fabric and the CPU. Compton and Hauck [27] present the four classifications shown in Figure 1(a-d). In Figure 1(a), the Programmable reconfigurable fabric is in the form of one or more stand- Fabric alone devices. The existing input and output mechanisms of the processor are used to communicate with the reconfigurable CPU fabric. In this configuration, the data transfer between the fabric and the processor is relatively slow, so this architecture only makes sense for applications in which a significant amount of processing can be done by the fabric without (e) Processor embedded in a reconfigurable fabric. processor intervention. Emulation systems often take on this Fig. 1. Five classes of reconfigurable systems. The first four are adapted sort of architecture [18], [85]. from [27]. Figure 1(b) and Figure 1(c) shows two intermediate struc- tures. In both cases, the cost of communication is lower than that of the architecture in Figure 1(a). Architectures of these types are described in [8], [50], [54], [68], [99], [108], [126],

IEE Proceedings Review Copy Only Page 5 of 17 Computers & Digital Techniques

TABLE I SUMMARY OF SYSTEM ARCHITECTURES. Class CPU to memory Shared Fine Grained or Example bandwidth memory size Coarse Grained Application (a) External stand-alone processing unit RC2000 [23] 528MB/s 152MB Fine Grained Video processing (b) / (c) Attached processing unit / Co-processor Pilchard [73] 1064MB/s 20 Kbytes Fine Grained DES Encryption Morphosys [108] 800MB/s 2048 bytes Coarse Grained Video compression (d) Reconfigurable functional unit Chess [79] 6400MB/s 12288 bytes Coarse Grained Video processing (e) Processor embedded in a reconfigurable fabric Xilinx Virtex II Pro [135] 1600MB/s 1172KB Fine Grained Video compression

[132]. Next, Figure 1(d) shows an architecture in which the Inputs processor and the fabric are very tightly coupled; in this case, the reconfigurable fabric is part of the processor itself; perhaps

forming a reconfigurable sub-unit that allows for the creation B i t -

of custom instructions. Examples of this sort of architecture S

t Output r have been described in [79], [86], [96], [118]. e a Figure 1(e) shows a fifth organization. In this case, the m D Q processor is embedded in the programmable fabric. The pro- cessor can either be a “hard” core [5], [134], or can be a “soft” core which is implemented using the resources of the programmable fabric itself [7], [43], [74], [103], [104], [136]. (a) Three-input lookup table. A summary of the above organizations can be found in Table I. Note the bandwidth is the theoretical maximum Inputs available to the CPU: for example, in Chess [79], we assume that each block RAM is being accessed at its maximum rate. Organization (a) is by far the most common, and accounts for all commercial reconfigurable platforms. D Q t c

B. Reconfigurable fabric e n n

The heart of any reconfigurable system is the reconfigurable o c r

fabric. The reconfigurable fabric consists of a set of reconfig- e t D Q Outputs n I

urable functional units, a reconfigurable interconnect, and a l

flexible interface to connect the fabric to the rest of the system. a c o

In this section, we review each of these components, and show L how they have been used in both commercial and academic reconfigurable systems. D Q A common theme runs through this entire section: in each component of the fabric, there is a tradeoff between flexibility and efficiency. A highly flexible fabric is typically much larger (b) Cluster of lookup tables. and much slower than a less flexible fabric. On the other hand, a more flexible fabric is better able to adapt to the application Fig. 2. Fine-grained reconfigurable functional units. requirements. In the following discussions, we will see how this tradeoff has influenced the design of every part of every reconfigurable most common kind of fine-grained functional units are the system. A summary of the main features of various architec- small lookup tables that are used to implement the bulk of tures can be found in Table II. the logic in a commercial field-programmable gate array. A 1) Reconfigurable functional units: Reconfigurable func- coarse-grained functional unit, on the other hand, is typically tional units can be classified as either coarse-grained or fine- much larger, and may consist of arithmetic and logic units grained. A fine-grained functional unit can typically implement (ALUs) and possibly even a significant amount of storage. In a single function on a single (or small number) of bits. The this section, we describe the two types of functional units in

IEE Proceedings Review Copy Only Computers & Digital Techniques Page 6 of 17

TABLE II COMPARISON OF RECONFIGURABLE FABRICS AND DEVICES. Fabric or Device Fine Grained or Base Logic Component Routing Architecture Embedded Memory Special Features Coarse Grained ProASIC+ Fine 3-input block Horizontal and vertical 256x9 bit blocks Flash-based [3] tracks Altera Excalibur Fine 4-input Lookup Tables Horizontal and vertical 2Kbit memory blocks ARMv4T Embedded [5] tracks Processor Altera Stratix II Fine/Coarse 8-input Adaptive Horizontal and vertical 512 bits, 4Kbits, DSP blocks [6] Logic Module tracks and 512Kbit blocks Garp Fine logic or arithmetic functions 2-bit Buses in horizontal External to fabric [54] on four 2-bit input words and vertical columns Xilinx Virtex II Pro Fine 4-input Lookup Tables Horizontal and vertical 18Kbit blocks Embedded Multipliers, [134] tracks PowerPC 405 Processor Xilinx Virtex II Fine 4-input Lookup Tables Horizontal and vertical 18Kbit blocks Embedded Multipliers [135] tracks DReAM Coarse 8-bit ALUs 16-bit local Two 16x8 Dual Port Targets mobile [11] and global buses memory applications Elixent D-fabrix Coarse 4-bit ALUs 4-bit buses 256x8 memory blocks [42] HP Chess Coarse 4-bit ALUs 4-bit buses 256x8 bit memories [79] IMEC ADRES Coarse 32-bit ALUs 32-bit buses Small register files in [82] each logic component Matrix Coarse 8-bit ALUs Hierarchical 8-bit buses 256x8 bit memories [86] MorphoSys Coarse ALU and Multiplier, Buses External to fabric [108] and Shift Units Piperench Coarse 8-bit ALUs 8-bit Buses External to fabric Functional units [50] arranged in ‘stripes’ RaPiD Coarse ALUs buses Embedded memory [41] blocks Silicon Hive Avispa Coarse ALUs, Shifters, Accumulators, Buses Five embedded [106] and Multipliers memories

more detail. common arithmetic and logic functions. Figure 4 shows how Many reconfigurable systems use commercial FPGAs as the carry and cascade chains, as well as the ability to break a a reconfigurable fabric. These commercial FPGAs contain 4-input lookup table into four two-input lookup tables can be many three to six input lookup tables, each of which can be exploited to efficiently implement carry-select adders [6]. The thought of as a very fine-grained functional unit. Figure 2(a) multiplexers and the exclusive-or gate in Figure 4 are included illustrates a lookup table; by shifting in the correct pattern of as part of each logic array block, and need not be implemented bits, this functional unit can implement any single function using other lookup tables. of up to three inputs – the extension to lookup tables with The example in Figure 4 shows how the efficiency of larger numbers of inputs is clear. Typically, lookup tables are commercial FPGAs can be improved by adding architectural combined into clusters, as shown in Figure 2(b). Figure 3 support for common functions. We can go much further shows clusters in two popular FPGA families. Figure 3(a) than this though, and embed significantly larger, but far less shows a cluster in the Altera Stratix device; Altera calls flexible, reconfigurable functional units. There are two kinds these clusters “Logic Array Blocks [6]. Figure 3(b) shows of devices that contain coarse-grained functional units; modern a cluster in the Xilinx architecture [135]; Xilinx calls these FPGAs, which are primarily composed of fine-grained func- clusters “Configurable Logic Blocks” (CLBs). In the Altera tional units, are increasingly being enhanced by the inclusion diagram, each block labeled “LE” is a lookup table, while in of larger blocks. As an example, the Xilinx Virtex device the Xilinx diagram, each “slice” contains two lookup tables. contains embedded 18-bit by 18-bit multiplier units [135]. Other commercial FPGAs are described in [3], [67], [87], [94]. When implementing algorithms requiring a large amount of Reconfigurable fabrics containing lookup tables are very multiplication, these embedded blocks can significantly im- flexible, and can be used to implement any digital circuit. prove the density, speed and power of the device. On the other However, compared to the coarse-grained structures in the hand, for algorithms which do not perform multiplication, next subsection, these fine-grained structures have significantly these blocks are rarely useful. The Altera Stratix devices more area, delay, and power overhead. Recognizing that these contain a larger, but more flexible embedded block, called a fabrics are often used for arithmetic purposes, FPGA com- DSP block, shown in Figure 5 [6]. Each of these blocks can panies have added additional features such as carry-chains perform accumulate functions as well as multiply operations. and cascade-chains to reduce the overhead when implementing The comparison between the two devices clearly illustrates the

IEE Proceedings Review Copy Only Page 7 of 17 Computers & Digital Techniques

LAB Carry In Carry-In 0 1 Carry-In0 Compute Sum assuming Carry-In1 input carry=0 addnsub A0 LE Sum0 B0 Compute Sum assuming input carry=1 Sum1 A1 LE data1 B1 2-LUT data2 sum A2 LE Sum2 2-LUT B2 Compute Carry-out assuming input carry=0 2-LUT A3 LE Sum3 B3 2-LUT Compute Carry-out assuming input carry=1 A4 LE Sum4 B4

Carry-Out0 Carry-Out1 Fig. 4. Implementing a carry-select adder in an Altera Stratix device [6]. ‘LUT’ denotes ‘Lookup Table’.

flexibility and overhead tradeoff; the Altera DSP block may be more flexible than the Xilinx multiplier, however, it consumes more chip area and runs somewhat slower. General Purpose General Purpose Routing Routing LE A A A A A A A A D D D D D D D D N N N N N N N N E E E E E E E E LE Q Q Q Q Q Q Q Q

LE X X X X

To left LAB t c

LE A A A A D D D D e N N N N E E E E n To right LAB Q Q Q Q n

o LE

From left LAB c r e t From right LAB ADD/SUB ADD/SUB

n LE I

l a

c ADD

o LE L LE A D N

LE E Q LE

(a) Altera Logic Array Block [6] Fig. 5. Altera DSP Block [6].

COUT The commercial FPGAs described above contain both fine- Slice grained and coarse-grained blocks. There are also devices which contain only coarse-grained blocks [26], [41], [50], [79], Slice [82], [108]. An example of a coarse-grained architecture is Switch COUT the ADRES architecture which is shown in Figure 6 [82]. Matrix CIN Each reconfigurable functional unit in this device contains Slice a 32-bit ALU which can be configured to implement one of several functions including addition, multiplication, and Slice Fast connections logic functions, with two small register files. Clearly, such to neighbours a functional unit is far less flexible than the fine-grained CIN functional units described earlier; however if the application (b) Xilinx Configurable [135] requires functions which match the capabilities of the ALU, these functions can be very efficiently implemented in this Fig. 3. Commercial Logic Block Architectures. architecture. 2) Reconfigurable interconnects: Regardless of whether a device contains fine-grained functional units, coarse-grained functional units, or a mixture of the two, the functional units

IEE Proceedings Review Copy Only Computers & Digital Techniques Page 8 of 17 M G R G R A A G R G A inputs 32-bit inputs 32-bit inputs G G U A A A L L L P P P P P P U U U L M M M R R R R R R T

mux mux mux 1 1 32 32 32 32

pred Predicate src1 src2 Register Register Functional Unit File File pred_out result

1 32 (a) Totem coarse-grained routing architecture [26]

Multi-Context Register Configuration RAM

Interconnect 1-bit 32-bit Predicate Result RF RF RF RF

Fig. 6. ADRES Reconfigurable Functional Unit [82]. Interconnect

needed to be connected in a flexible way. Again, there is a ALU ALU LS Memory tradeoff between the flexibility of the interconnect (and hence the reconfigurable fabric) and the speed, area, and power- efficiency of the architecture. (b) Silicon Hive coarse-grained routing architecture [106]

Configuration Bit Fig. 8. Example coarse-grained routing architectures.

tures are commonly used in devices containing coarse-grained functional units. Figure 8 shows two examples of coarse- grained routing architectures: (a) the Totem reconfigurable system [26]; (b) the Silicon Hive reconfigurable system [106], which is less flexible but faster and smaller. 3) Emerging directions: Several emerging directions will (a) Fine-grained (b) Coarse-grained be covered in the following. These directions include low- power techniques, asynchronous architectures, and molecular Fig. 7. Routing architectures. microelectronics. • Low-power techniques. Early work explores the use of As before, reconfigurable interconnect architectures can be low-swing circuit techniques to reduce the power con- classified as fine-grained or coarse-grained. The distinction sumption in a hierarchical interconnect for a low-energy is based on the granularity with which wires are switched. FPGA [47]. Recent work involves: (a) activity reduc- This is illustrated in Figure 7, which shows a flexible inter- tion in power-aware design tools, with energy saving of connect between two buses. In the fine-grained architecture 23% [66]; (b) leakage current reduction methods such in Figure 7(a), each wire can be switched independently, as gate biasing and multiple supply-voltage integration, while in Figure 7(b), the entire bus is switched as a unit. with up to two times leakage power reduction [95]; and The fine-grained routing architecture in Figure 7(a) is more (c) dual supply-voltage methods with the lower voltage flexible, since not every bit needs to be routed in the same assigned to non-critical paths, resulting in an average way; however, the coarse-grained architecture in Figure 7(b) power reduction of 60% [46]. contains far fewer programming bits, and hence suffers much • Asynchronous architectures. There is an emerging in- less overhead. terest in asynchronous FPGA architectures. An asyn- Fine-grained routing architectures are usually found in chronous version of Piperench [50] is estimated to im- commercial FPGAs. In these devices, the functional units are prove performance by 80%, at the expense of a significant typically arranged in a grid pattern, and they are connected increase in configurable storage and wire count [60]. using horizontal and vertical channels. Significant research Other efforts in this direction include fine-grained asyn- has been performed in the optimization of the topology of chronous pipelines [119], quasi delay-insensitive ar- this interconnect [13], [72]. Coarse-grained routing architec- chitectures [133], and globally-asynchronous locally-

IEE Proceedings Review Copy Only Page 9 of 17 Computers & Digital Techniques

synchronous techniques [98]. design and general-purpose design. Low-level design methods • Molecular microelectronics. In the long term, molecular and tools, covering topics such as technology mapping, floor- techniques offer a promising opportunity for increasing planning, and place and route, are beyond the scope of this the capacity and performance of reconfigurable comput- paper – interested readers are referred to [27]. ing architectures [17]. Current work is focused on devel- oping Programmable Logic Arrays based on molecular- A. General-purpose design scale nano-wires [38], [130]. This section describes design methods and tools based on a general-purpose programming language such as C, possibly C. Architectures: main trends adapted to facilitate hardware development. Of course, tradi- The following summarises the main trends in architectures tional hardware description languages like VHDL and for reconfigurable computing. are widely available, especially on commercial reconfigurable 1) Coarse-grained fabrics: As reconfigurable fabrics are platforms. migrated to more advanced technologies, the cost (in terms A number of compilers from C to hardware have been of both speed and power) of the interconnect part of a developed. Some of the significant ones are reviewed here. reconfigurable fabric is growing. Designers are responding to These range from compilers which only target hardware, to this by increasing the granularity of their logic units, thereby those which target complete hardware/software systems; some reducing the amount of interconnect needed. In the Stratix II also partition into hardware and software components. device, Altera moved away from simple 4-input lookup tables, We can classify different design methods into two ap- and used a more complex logic block which can implement proaches: the annotation and constraint-driven approach, and functions of up to 7 inputs. We should expect to see a slow the source-directed compilation approach. The first approach migration to more complex logic blocks, even in stand-alone preserves the source programs in C or C++ as much as possible FPGAs. and makes use of annotation and constraint files to drive the 2) Heterogeneous functions: As devices are migrated to compilation process. The second approach modifies the source more advanced technologies, the number of transistors that language to let the designer to specify, for instance, the amount can be devoted to the reconfigurable logic fabric increases. of parallelism or the size of variables. This provides new opportunities to embed complex non- 1) Annotation and constraint-driven approach: The sys- programmable (or semi-programmable) functions, creating tems mentioned below employ annotations in the source-code heterogeneous architectures with both general- purpose logic and constraint files to control the optimisation process. Their resources and fixed-function embedded blocks. Modern Xilinx strength is that usually only minor changes are needed to parts have embedded 18 by 18 bit multipliers, while modern produce a compilable program from a software description Altera parts have embedded DSP units which can perform a – no extensive re-structuring is required. Five representative variety of multiply/accumulate functions. Again, we should methods are SPC [128], Streams-C [49], Sea Cucumber [59], expect to see a migration to more heterogeneous architectures SPARK [53] and Catapult-C [80]. in the near future. SPC [128] combines vectorisation, loop transformations and 3) Soft cores: The use of “soft” cores, particularly for retiming with automatic memory allocation to improve per- instruction processors, is increasing. A “soft” core is one formance. SPC accelerates C loop nests with data dependency in which the vendor provides a synthesisable version of the restrictions, compiling them into pipelines. Based on the SUIF function, and the user implements the function using the framework [131], this approach uses loop transformations, and reconfigurable fabric. Although this is less area- and speed- can take advantage of run-time reconfiguration and memory efficient than a hard embedded core, the flexibility and the ease access optimisation. Similar methods have been advocated by of integrating these soft cores makes them attractive. The extra other researchers [55], [102]. overhead becomes less of a hindrance as the number of tran- Streams-C [49] compiles a C program to synthesisable sistors devoted to the reconfigurable fabric increases. Altera VHDL. Streams-C exploits coarse-grained parallelism in and Xilinx both provide numerous soft cores, including soft stream-based computations; low-level optimisations such as instruction processors such as NIOS [7] and Microblaze [136]. pipelining are performed automatically by the compiler. Soft instruction processors have also been developed by a Sea Cucumber [59] compiles Java programs to hardware number of researchers, ranging from customisable JVM and using a similar scheme to Handel-C, which we detail in the MIPS processors [103] to ones specialised for machine learn- next section. Unlike Handel-C, no language extensions are ing [43] and data encryption [74]. needed; like Streams-C, users must call a library, in this case based on Communicating Sequential Processes (CSP [56]). IV. DESIGN METHODS Multiple circuit implementations of the library primitives Hardware compilers for high-level descriptions are increas- enable tradeoffs. ingly recognised to be the key to reducing the productivity SPARK [53] is a high-level synthesis framework targeting gap for advanced circuit development in general, and for multimedia and image processing. It compiles C code with reconfigurable designs in particular. This section looks at high- the following steps: (a) list scheduling based on speculative level design methods from two perspectives: special-purpose code motions and loop transformations, (b) resource binding

IEE Proceedings Review Copy Only Computers & Digital Techniques Page 10 of 17

pass with minimisation of interconnect, (c) finite state machine synchronous communications between them, possibly giving controller generation for the scheduled datapath, (d) code gen- greater scope for optimisation. It also allows asynchronous eration producing synthesisable register-transfer level VHDL. communications but otherwise resembles Handel-C, using the Logic synthesis tools then synthesise the output. same basic one-hot compilation scheme. Catapult C synthesises Register Transfer Level (RTL) de- Table III summarises the various compilers discussed in this scriptions from unannotated C++, using characterisations of section, showing their approach, source and target languages, the target technology from RTL synthesis tools [80]. Users target architecture and some example applications. Note that can set constraints to explore the design space, controlling the compilers discussed are not necessarily restricted to the ar- loop pipelining and resource sharing. chitectures reported; some can usually be ported to a different 2) Source-directed compilation approach: A different ap- architecture by using a different library of hardware primitives. proach adapts the source language to enable explicit descrip- tion of parallelism, communication and other customisable B. Special-purpose design hardware resources such as variable size. Examples of design Within the wide variety of problems to which reconfigurable methods following this approach include ASC [84], Handel- computing can be applied, there are many specific problem C [22], Haydn-C [37] and Bach-C [137]. domains which deserve special consideration. The motivation ASC [84] adopts C++ custom types and operators to provide is to exploit domain-specific properties: (a) to describe the a C++ programming interface on the algorithm, architecture, computation, such as using MATLAB for digital signal pro- arithmetic and gate levels. This enables the user to program cessing, and (b) to optimise the implementation, such as using on the desired level for each part of the application. Semi- word-length optimisation techniques described later. automated design space exploration further increases design We shall begin with an overview of digital signal processing productivity, while supporting the optimisation process on and relevant tools which target reconfigurable implementa- all available levels of abstraction. The object-oriented design tions. We then describe the word-length optimisation problem, enables efficient code-reuse, and includes an integrated arith- the solution to which promises rich rewards; an example of metic unit generation library [83]. A floating-point library [75] such a solution will be covered. Finally we summarise other provides over 200 different floating point units, each with domain-specific design methods which have been proposed for custom bitwidths for mantissa and exponent. video and image processing and networking. Handel-C [22] extends a subset of C to support flexible 1) Digital signal processing: One of the most successful width variables, signals, parallel blocks, bit-manipulation op- applications for reconfigurable computing is Real-time Digital erations, and channel communication. A distinctive feature is Signal Processing (DSP). This is illustrated by the inclusion that timing of the compiled circuit is fixed at one cycle per of hardware support for DSP in the latest FPGA devices, such C statement. This allows Handel-C programmers to schedule as the embedded DSP blocks in Altera Stratix II chips [6]. hardware resources manually. Handel-C compiles to a one- DSP problems tend to share the following properties: design hot state machine using a token-passing scheme developed by latency is usually less of an issue than design throughput, al- Page and Luk [92]; each assignment of the program maps gorithms tend to be numerically intensive but have very simple to exactly one control flip-flop in the state machine. These control structures, controlled numerical error is acceptable, control flip-flops capture the flow of control (represented by and standard metrics, such as signal-to-noise ratio, exist for the token) in the program: if the control flip-flop corresponding measuring numerical precision quality. to a particular statement is active, then control has passed to DSP algorithm design is often initially performed directly that statement, and the circuitry compiled for that statement is in a graphical programming environment such as Mathworks’ activated. When the statement has finished execution, it passes MATLAB Simulink [107]. Simulink is widely used within the token to the next statement in the program. the DSP community, and has been recently incorporated into Haydn-C [37] extends Handel-C for component-based de- the Xilinx System Generator [58] and Altera DSP builder [4] sign. Like Handel-C, it supports description of parallel blocks, design flows. Design approaches such as this are based on the bit-manipulation operations, and channel communication. The idea of data-flow graphs (DFGs) [69]. principal innovation of Haydn-C is a framework of optional Tools working with this form of description vary in the annotations to enable users to describe design constraints, level of user intervention required to specify the numerical and to direct source-level transformations such as scheduling properties of the implementation. For example, in the Xilinx and resource allocation. There are automated transformations System Generator flow [58], it is necessary to specify the so that a single high-level design can be used to produce number of bits used to represent each signal, the scaling of many implementations with different trade-offs. This approach each signal (namely the binary point location), and whether to has been evaluated using various case studies, including FIR use saturating or wrap-around arithmetic [30]. filters, fractal generators, and morphological operators. The Ideally, these implementation details could be automated. fastest morphological erosion design is 129 times faster and Beyond a standard DFG-based algorithm description, only one 3.4 times larger than the smallest design. piece of information should be required: a lower-bound on Bach-C [137] is similar to Handel-C but has an untimed the output signal to quantization noise acceptable to the user. semantics, only synchronising between parallel threads on Such a design tool would thus represent a truly ‘behavioural’

IEE Proceedings Review Copy Only Page 11 of 17 Computers & Digital Techniques

TABLE III SUMMARY OF GENERAL-PURPOSE HARDWARE COMPILERS. System Approach Source Target Target Example Language Language Architecture Applications Streams-C [49] Annotation / C + library RTL VHDL Xilinx FPGA Image contrast enhancement, Constraint-driven Pulsar detection [45] Sea Cucumber [59] Annotation / Java + library EDIF Xilinx FPGA none given Constraint-driven SPARK [53] Annotation / C RTL VHDL LSI, Altera FPGAs MPEG-1 predictor, Constraint-driven Image tiling SPC [128] Annotation / C EDIF Xilinx FPGAs String pattern matching, Constraint-driven Image skeletonisation ASC [84] Source-directed C++ using EDIF Xilinx FPGAs Wavelet compression, Compilation class library Encryption Handel-C [22] Source-directed Extended C Structural VHDL, Actel, Altera Image processing, Compilation Verilog, EDIF Xilinx FPGAs Polygon rendering [114] Haydn-C [37] Source-directed Extended C extended C Xilinx FPGAs FIR filter, Compilation (Handel-C) Image erosion Bach-C [137] Source-directed Extended C Behavioural and LSI FPGAs Viterbi decoders, Compilation RTL VHDL Image processing

synthesis route, exposing to the DSP engineer only those from a practical perspective [31]. There are, however, several aspects of design naturally expressed in the DSP application published approaches to word-length optimization. These can domain. be classified as heuristics offering an area / signal quality 2) The word-length optimization problem: Unlike tradeoff [28], [65], [127], approaches that make some sim- microprocessor-based implementations where the word- plifying assumptions on error properties [20], [65], or optimal length is defined a-priori by the hard-wired architecture of approaches that can be applied to algorithms with particular the processor, reconfigurable computing based on FPGAs mathematical properties [29]. allows the size of each variable to be customised to produce Some published approaches to the word-length optimization the best trade-offs in numerical accuracy, design size, speed, problem use an analytic approach to scaling and/or error and power consumption. The use of such custom data estimation [89], [111], [127], some use simulation [20], [65], representation for optimising designs is one of the main and some use a hybrid of the two [25]. The advantage of strengths of reconfigurable computing. analytic techniques is that they do not require representative Given this flexibility, it is desirable to automate the process simulation stimulus, and can be faster, however they tend to be of finding a good custom data representation. The most impor- more pessimistic. There is little analytical work on supporting tant implementation decision to automate is the selection of data-flow graphs containing cycles, although in [111] finite an appropriate word-length and scaling for each signal [32] in loop bounds are supported, while [31] supports cyclic data- a DSP system. Unlike microprocessor-based implementations, flow when the nodes are of a restricted set of types, extended where the word-length is defined a-priori by the hard-wired to semi-analytic technique with fewer restrictions in [34]. architecture of the processor, reconfigurable computing allows Some published approaches use worst-case instantaneous the word-length of each signal to be customised to produce the error as a measure of signal quality [20], [89], [127] whereas best trade-offs in numerical accuracy, design size, speed, and some use signal-to-noise ratio [28], [65]. power consumption. The use of custom data representation is The remainder of this section reviews in some detail partic- one of the greatest strengths ular research approaches in the field. It has been argued that, often, the most efficient hardware The Bitwise Project [111] proposes propagation of integer implementation of an algorithm is one in which a wide variety variable ranges backwards and forwards through data-flow of finite precision representations of different sizes are used graphs. The focus is on removing unwanted most-significant for different internal variables [28]. The accuracy observable at bits (MSBs). Results from integration in a synthesis flow the outputs of a DSP system is a function of the word-lengths indicate that area savings of between 15% and 86% combined used to represent all intermediate variables in the algorithm. with speed increases of up to 65% can be achieved compared However, accuracy is less sensitive to some variables than to to using 32-bit integers for all variables. others, as is implementation area. It is demonstrated in [32] The MATCH Project [89] also uses range propagation that by considering error and area information in a structured through data-flow graphs, except variables with a fractional way using analytical and semi-analytical noise models, it is component are allowed. All signals in the model of [89] possible to achieve highly efficient DSP implementations. must have equal fractional precision; the authors propose In [35] it has been demonstrated that the problem of an analytic worst-case error model in order to estimate the word-length optimization is NP-hard, even for systems with required number of fractional bits. Area reductions of 80% special mathematical properties that simplify the problem combined with speed increases of 20% are reported when

IEE Proceedings Review Copy Only Computers & Digital Techniques Page 12 of 17

simulink SNR representative compared to a uniform 32-bit representation. design constraints floating-point inputs Wadekar and Parker [127] have also proposed a methodol- ogy for word-length optimization. Like [89], this technique area Right-Size models also allows controlled worst-case error at system outputs, however each intermediate variable is allowed to take a word- verification testbench structural Xilinx makefile length appropriate to the sensitivity of the output errors to outputs VHDL coregen scripts quantization errors on that particular variable. Results indicate VHDL Xilinx synthesis coregen area reductions of between 15% and 40% over the optimum uniform word-length implementation. VHDL behavioural simulator VHDL Kum and Sung [65] and Cantin et al. [20] have pro- posed several word-length optimization techniques to trade- verification EDIF outputs off system area against system error. These techniques are comparison Xilinx heuristics based on bit-true simulation of the design under synthesis various internal word-lengths. FPGA In Bitsize [1], [2], Abdul Gaffar et al. propose a hybrid bitstream method based on the mathematical technique know as au- tomatic differentiation to perform bitwidth optimisation. In Fig. 9. Design flow for the Right-Size tool [34]. The shaded portions are this technique, the gradients of outputs with respect to the FPGA vendor-specific. internal variables are calculated and then used to determine the sensitivities of the outputs to the precision of the internal variables. The results show that it is possible to achieve an (e.g. using Simulink), a specification of the acceptable signal- area reduction of 20% for floating-point designs, and 30% for to-noise ratio at each output, and a set of representative input fixed-point designs, when given an output error specification signals. From these inputs, the tool automatically generates a of 0.75% against a reference design. synthesisable structural description of the architecture and a A useful survey of algorithmic procedures for word-length bit-true behavioural VHDL testbench, together with a set of determination has been provided by Cantin et al. [21]. In expected outputs for the provided set of representative inputs. this work, existing heuristics are classified under various Also generated is a makefile which can be used to automate categories. However the ‘exhaustive’ and ‘branch-and-bound’ the post-Right-Size synthesis process. procedures described in [21] do not necessarily capture the Application of Right-Size to various adaptive filters imple- optimum solution to the word-length determination problem, mented in a Xilinx Virtex FPGA has resulted in area reduction due to non-convexity in the constraint space: it is actually of up to 80%, power reduction of up to 98%, and speed-up of possible to have a lower error at a system output by reducing up to 36% over common alternative design methods without the word-length at an internal node [33]. Such an effect is word-length optimisation. modeled in the MILP approach proposed in [29]. 4) Other design methods: Besides signal processing, video A comparative summary of existing optimization systems is and image processing is another area that can benefit from provided in Table IV. Each system is classified according to special-purpose design methods. Three examples will be given the several defining features described below. to provide a flavour of this approach. First, the CHAMPION • Is the word-length and scaling selection performed system [90] maps designs captured in the Cantata graphical through analytic or simulation-based means? programming environment to multiple reconfigurable comput- • Can the system support algorithms exhibiting cyclic data ing platforms. Second, the IGOL framework [122] provides flow? (such as infinite impulse response filters). a layered architecture for facilitating hardware plug-ins to be • What mechanisms are supported for Most Significant incorporated in various applications in the Microsoft Windows Bit (MSB) optimizations? (Such as ignoring MSBs that operating system, such as Premiere, Winamp, VirtualDub and are known to contain no useful information, a technique DirectShow. Third, the SA-C compiler [15] maps a high-level determined by the scaling approach used). single-assignment language specialised for image processing • What mechanisms are supported for Least Significant Bit description into hardware, using various optimisation methods (LSB) optimizations? These involve the monitoring of including loop unrolling, array value propagation, loop-carried word-length growth. In addition, for those systems which array elimination, and multi-dimensional stripmining. support error-tradeoffs, further optimizations include the Recent work indicates that another application area that quantization (truncation or rounding) of unwanted LSBs. can benefit from special-purpose techniques is networking. • Does the system allow the user to trade-off numerical Two examples will be given. First, a framework has been accuracy for a more efficient implementation? developed to enable description of designs in the network 3) An example optimization flow: One possible design policy language Ponder [36], into reconfigurable hardware flow for word-length optmization, used in the Right-Size sys- implementations [70]. Second, it is shown [63] how descrip- tem [34] is illustrated in Figure 9 for Xilinx FPGAs. The inputs tions in the Click networking language can produce efficient to this system are a specification of the system behaviour reconfigurable designs.

IEE Proceedings Review Copy Only Page 13 of 17 Computers & Digital Techniques

TABLE IV A COMPARISON OF WORDLENGTH AND SCALING OPTIMIZATION SYSTEMS AND METHODS. System Analytic / Cyclic Data MSB-optimization LSB-optimizations Error Comments Simulation Flow? Trade off Benedetti analytic none through interval through ‘multi-interval’ no error can be pessimistic [12] arithmetic approach Stephenson analytic for finite loop through forward and none no error less pessimistic [111], [112] bounds backward range than [12] due to propagation backwards propagation Nayak analytic not supported through forward and through fixing number user-specified or fractional parts have [89] for error backward range of fractional bits inferred absolute equal wordlength analysis propagation for all variables bounds on error Wadekar analytic none through forward range through genetic user-specified uses Taylor series at [127] propagation algorithm search absolute bounds limiting values to for suitable determine error wordlengths propagation Keding hybrid with user through user-annotations through user-annotations not automated possible truncation [62], [129] intervention and forward range and forward wordlength error propagation propagation Cmar hybrid for scaling with user through combined wordlength bounded not automated less pessimistic [25] simulation for intervention simulation and forward through hybrid fixed or than [12] error only range propagation floating simulation due to input error propagation Kum simulation (hybrid yes through measurement through heuristics user-specified long simulation [64], [65], for multiply of variable mean and based on simulation bounds and time possible [116], [117] -accumulate signals standard deviation results metric in [64], [65]) Constantinides analytic yes through tight analytic through heuristics user-specified only applicable to [32] bounds on signal range based on an analytic bounds on linear time-invariant and automatic design noise model noise power systems of saturation arithmetic and spectrum Constantinides hybrid yes through simulation through heuristics user-specified only applicable to [34] based on a hybrid bounds on differentiable noise model noise power non-linear and spectrum systems Abdul Gaffar hybrid with user through simulation through automatic user-specified covers both [1], [2] intervention based range differentiation based bounds and fixed-point and propagation dynamic analysis metric floating-point

C. Other design methods smaller circuit with higher performance by treating run- In the following, we describe various design methods in time data as constant, and optimising them principally by brief. boolean algebra; (b) library-based compilation – the DISC 1) Run-time customisation: Many aspects of run-time re- compiler [24] makes use of a library of precompiled logic configuration have been explored [27], including the use of modules which can be loaded into reconfigurable resources directives in high-level descriptions [71]. Effective run-time by the procedure call mechanism; (c) exploiting information customisation hinges on appropriate design-time preparation about program branch probabilities [115]; the idea is to for such customisation. To illustrate this point, consider a run- promote utilisation by dedicating more resources to branches time customisable system that supports partial reconfiguration: which execute more frequently. A hardware compiler has been one part of the system continues to be operational, while developed to produce a collection of designs, each optimised another part is being reconfigured. As FPGAs get larger, for a particular branch probability; the best can be selected partial reconfiguration is becoming increasingly important as at run time by incorporating observed branch probability a means of reducing reconfiguration time. To support partial information from a queueing network performance model. reconfiguration, appropriate circuits must be built at fabrica- 2) Soft instruction processors: FPGA technology can now tion time as part of the FPGA fabric. Then at compile time, support one or more soft instruction processors implemented an initial configuration bitstream and incremental bitstreams using reconfigurable resources on a single chip; proprietary have to be produced, together with run-time customisation instruction processors, like MicroBlaze and Nios, are now facilities which can be executed, for instance, on a micro- available from FPGA vendors. Often such processors support processor serving as part of the run-time system [105]. Run- customisation of resources and custom instructions. Custom time customisation facilities can include support for condition instructions have two main benefits. First, they reduce the monitoring, design optimisation and reconfiguration control. time for instruction fetch and decode, provided that each cus- Opportunities for run-time design optimisation include: tom instruction replaces several regular instructions. Second, (a) run-time constant propagation [39], which produces a additional resources can be assigned to a custom instruction

IEE Proceedings Review Copy Only Computers & Digital Techniques Page 14 of 17

to improve performance. Bit-width optimisation, described in program to extract parallelism for an efficient result, which Section IV-B, can also be applied to customise instruction pro- is necessary if compilation from languages such as C is to cessors at compile time. A challenge of customising instruction compete with traditional methods for designing hardware. processors is that the tools for producing and analysing instruc- One example is the work of Babb et al. [10], targeting tions also need to be customised. For instance, the flexible custom, fixed-logic implementation while also applicable to instruction processor framework [103] has been developed reconfigurable hardware. The compiler uses the SUIF infras- to automate the steps in customising an instruction processor tructure to do several analyses to find what computations affect and the corresponding tools. Other researchers have proposed exactly what data, as far as possible. A tiled architecture is similar approaches [61]. synthesised, where all computation is kept as local as possible Instruction processors can also run declarative langauges. to one tile. More recently, Ziegler et al. [138] have used For instance, a scalable architecture [43], consisting of mul- loop transformations in mapping loop nests onto a pipeline tiple processors based on the Warren Abstract Machine, has spanning several FPGAs. A further effort is given by the Garp been developed to support the execution of the Progol system project [19]. [88], based on the declarative language Prolog. Its effective- ness has been demonstrated using the mutagenesis data set D. Emerging directions containing 12000 facts about chemical compounds. 1) Verification: As designs are becoming more complex, 3) Multi-FPGA compilation: Peterson et al. have developed techniques for verifying their correctness are becoming in- a C compiler which compiles to multi-FPGA systems [93]. creasingly important. Four approaches are described: (1) The The available FPGAs and other units are specified in a library InterSim framework [97] provides a means of combining file, allowing portability. The compiler can generate designs software simulation and hardware prototyping. (2) The Lava using speculative and lazy execution to improve performance system [14] can convert designs into a form suitable for input and ultimately they aim to partition a single program be- to a model checker; a number of FPGA design libraries have tween host and reconfigurable resource (hardware/software been verified in this way [109]. (3) The Ruby language [52] codesign). Duncan et al. have developed a system with similar supports correctness-preserving transformations, and a wide capabilities [40]. This is also retargetable, using hierarchical variety of hardware designs have been produced. Fourth, the architecture descriptions. It synthesises a VLIW architecture Pebble [78] hardware design language has been formally that can be partitioned across multiple FPGAs. Both methods specified [81], so that provably-correct design tools can be can split designs across several FPGAs, and are retargetable developed. via hardware description libraries. Other C-like languages that 2) Customisable hardware compilation: Recent work [123] have been developed include MoPL-3, a C extension support- explains how customisable frameworks for hardware compi- ing data procedural compilation for the Xputer architecture lation can enable rapid design exploration, and reusable and which comprises an array of reconfigurable ALUs [9], and extensible hardware optimisation. The framework compiles spC, a systolic parallel C variant for the Enable++ board [57]. a parallel imperative language like Handel-C, and supports 4) Hardware/software codesign: Several research groups multiple levels of design abstraction, transformational devel- have studied the problem of compiling C code to both hard- opment, optimisation by compiler passes, and metalanguage ware and software. The Garp compiler [19] is intended to facilities. The approach has been used in producing designs accelerate plain C, with no annotations to help the compiler, for applications such as signal and image processing, with making it more widely applicable. The work targets one different trade-offs in performance and resource usage. architecture only: the Garp chip, which integrates a RISC core and reconfigurable logic. This compiler also uses the E. Design methods: main trends SUIF framework. The compiler uses a technique first devel- We summarise the main trends in design methods for oped for VLIW architectures called hyperblock scheduling, reconfigurable computing below. which optimises for instruction-level parallelism across several 1) Special-purpose design: As explained earlier, special- common paths, at the expense of rarer paths. Infeasible or purpose design methods and tools enable both high-level rare paths are implemented on the processor with the more design as well as domain-specific optimisation. Existing meth- common, easily parallelisable paths synthesised into logic for ods, such as those compiling MATLAB Simulink descrip- the reconfigurable resource. Similarly, the NAPA C compiler tions into reconfigurable computing implementations [1], [4], targets the NAPA architecture [48], which also integrates [34], [58], [89], [91], allow application developers without a RISC processor reconfigurable logic. This compiler can electronic design experience to produce efficient hardware also work on plain C code but the programmer can add C implementations quickly and effectively. This is an area that pragmas to indicate large-scale parallelism and the bit-widths would assume further importance in future. of variables to the code. The compiler can synthesise pipelines 2) Low-power design: Several hardware compilers aim to from loops. minimise the power consumption of their generated designs. 5) Annotation-free compilation: Some researchers aim to Examples include special-purpose design methods such as compile a sequential program, without any annotations, into Right-Size [34] and PyGen [91], and general-purpose meth- efficient hardware design. This requires analysis of the source ods that target loops for configurable hardware implementa-

IEE Proceedings Review Copy Only Page 15 of 17 Computers & Digital Techniques

tion [113]. These design methods, when combined with low- [18] Inc, Palladium Datasheet, 2004. power architectures [46] and power-aware low-level tools [66], [19] T. Callahan and J. Wawrzynek, “Instruction-Level parallelism for re- configurable computing”, Field-Programmable Logic and Applications, can provide significant reduction in power consumption. LNCS 1482, Springer, 1998. 3) High-level transformations: Many hardware design [20] M.-A. Cantin, Y. Savaria and P. Lavoie, “An automatic word length de- methods [15], [53], [128] involve high-level transformations: termination method”, Proc. IEEE International Symposium on Circuits loop unrolling, loop restructuring and static single assignment and Systems, p.V-53-V-56, 2001. [21] M.-A. Cantin, Y. Savaria and P. Lavoie, “A comparison of automatic are three examples. The development of powerful transforma- word length optimization procedures”, Proc. IEEE Int. Symp. on tions for design optimisation will continue for both special- Circuits and Systems, 2002. purpose and general-purpose designs. [22] Celoxica, Handel-C Language Reference Manual for DK2.0, Document RM-1003-4.0, 2003. V. SUMMARY [23] Celoxica, RC2000 Development and evaluation board data sheet, version 1.1, 2004. This paper surveys two aspects of reconfigurable computing: [24] D. Clark and B. Hutchings, “The DISC programming environment”, architectures and design methods. The main trends in architec- Proc. Symp. on FPGAs for Custom Computing Machines, IEEE Com- puter Society Press, 1996. tures are coarse-grained fabrics, heterogeneous functions, and [25] R. Cmar, L. Rijnders, P. Schaumont, S. Vernalde and I. Bolsens, soft cores. The main trends in design methods are special- “A methodology and design environment for DSP ASIC fixed point purpose design methods, low-power techniques, and high- refinement”, Proc. Design Automation and Test in Europe, 1999. level transformations. We wonder what a survey paper on [26] K. Compton and S. Hauck, “Totem: Custom reconfigurable array generation”, Proc. Symp. on Field-Programmable Custom Computing reconfigurable computing, written in 2015, will cover? Machines, IEEE Computer Society Press, 2001. [27] K. Compton and S. Hauck, “Reconfigurable computing: a survey of ACKNOWLEDGMENTS systems and software”, ACM Computing Surveys, Vol. 34, No. 2, Our thanks to Ray Cheung and Sherif Yusuf for their support in pp. 171-210, June 2002. preparing this paper. The support of Celoxica, Xilinx and UK EPSRC [28] G.A. Constantinides, P.Y.K. Cheung and W. Luk, “The multiple (Grant number GR/R 31409, GR/R 55931, GR/N 66599) is gratefully wordlength paradigm”, Proc. Symp. on Field-Programmable Custom acknowledged. Computing Machines, IEEE Computer Society Press, 2001. [29] G.A. Constantinides, P.Y.K. Cheung and W. Luk, “Optimum REFERENCES wordlength allocation”, Proc. Symp. on Field-Programmable Custom Computing Machines, IEEE Computer Society Press, 2002. [1] A. Abdul Gaffar, O. Mencer, W. Luk, P.Y.K. Cheung and N. Shirazi, [30] G.A. Constantinides, P.Y.K. Cheung and W. Luk, “Optimum and “Floating-point bitwidth analysis via automatic differentiation”, Proc. heuristic synthesis of multiple wordlength architectures”, IEEE Trans. Int. Conf. on Field-Programmable Technology, IEEE, 2002. on Computer Aided Design of Integrated Circuits and Systems, Vol. [2] A. Abdul Gaffar, O. Mencer, W. Luk, and P.Y.K. Cheung, “Unifying 22, No. 10, pp. 1432-1442, October 2003. bit-width optimisation for fixed-point and floating-point designs”, Proc. [31] G.A. Constantinides, P.Y.K. Cheung and W. Luk, “Synthesis of satu- Symp. on Field-Programmable Custom Computing Machines, IEEE ration arithmetic architectures”, ACM Trans. on Design Automation of Computer Society Press, 2004. Electronic Systems, Vol. 8, No. 3, July 2003. [3] Actel Corp., ProASIC Plus Family Flash FPGAs, v3.5, April 2004. [32] G.A. Constantinides, P.Y.K. Cheung and W. Luk, Synthesis and Opti- [4] Altera Corp., DSP Builder User Guide, Version 2.1.3 rev.1, July 2003. mization of DSP Algorithms, Kluwer Academic, Dordrecht, 2004. [5] Altera Corp., Excalibur Device Overview, May 2002. [33] G.A. Constantinides, High Level Synthesis and Word Length Optimiza- [6] Altera Corp., Stratix II Device Handbook, February 2004. tion of Digital Signal Processing Systems, PhD thesis, Imperial college [7] Altera Corp., Nios II Processor Reference Handbook, May 2004. London, 2001. [8] Annapolis Microsystems, Inc., Wildfire Reference Manual, 1998. [34] G.A. Constantinides, “Perturbation analysis for word-length optimiza- [9] A. Ast, J. Becker, R. Hartenstein, R. Kress, H. Reinig and tion”, Proc. Symp. on Field-Programmable Custom Computing Ma- K. Schmidt, “Data-procedural languages for FPL-based machines”, chines, IEEE Computer Society Press, 2003. Field-Programmable Logic and Applications, LNCS 849, Springer, 1994. [35] G.A. Constantinides and G.J. Woeginger, “The complexity of multiple Applied Mathematics Letters [10] J. Babb, M. Reinard, C. Andras Moritz, W. Lee, M. Frank, R. Barwa wordlength assignment”, , Vol. 15, No. 2, and S. Amarasinghe, “Parallelizing applications into silicon”, Proc. pp. 137-140, February 2002. Symp. on FPGAs for Custom Computing Machines, IEEE Computer [36] N. Damianou, N. Dulay, E. Lupu and M. Sloman, “The Ponder policy Society Press, 1999. specification language”, Proc. Workshop on Policies for Distributed [11] J. Becker and M. Glesner, “A parallel dynamically reconfigurable ar- Systems and Networks, LNCS 1995, Springer, 2001. chitecture designed for flexible application-tailored hardware/software [37] J.G. De Figueiredo Coutinho and W. Luk, “Source-directed trans- systems in future mobile communication”, The Journal of Supercom- formations for hardware compilation”, Proc. Int. Conf. on Field- puting, Vol. 19, No. 1, 2001, pp. 105–127. Programmable Technology, IEEE, 2003. [12] A. Benedetti and B. Perona, “Bit-width optimization for configurable [38] A. DeHon and M.J. Wilson, “Nanowire-based sublithographic pro- DSP’s by multi-interval analysis”, Proc. 34th Asilomar Conference on grammable logic arrays”, Proc. Int. Symp. on FPGAs, ACM Press, Signals, Systems and Computers, 2000. 2004. [13] V. Betz, J. Rose and A. Marquardt, Architecture and CAD for Deep- [39] A. Derbyshire and W. Luk, “Compiling run-time parametrisable de- Submicron FPGAs, Kluwer Academic Publishers, February 1999. signs”, Proc. Int. Conf. on Field-Programmable Technology, IEEE, [14] P. Bjesse, K. Claessen, M. Sheeran and S. Singh, “Lava: hardware 2002. design in Haskell”, Proc. ACM Int. Conf. on Functional Programming, [40] A. Duncan, D. Hendry and P. Gray, “An overview of the COBRA- ACM Press, 1998. ABS high-level synthesis system for multi-FPGA systems”, Proc. [15] W. Bohm, J. Hammes, B. Draper, M. Chawathe, C. Ross, R. Rinker IEEE Symposium on FPGAs for Custom Computing Machines, IEEE and W. Najjar, “Mapping a single assignment programming language Computer Society Press, 1998. to reconfigurable systems”, The Journal of Supercomputing, Vol. 21, [41] C. Ebeling, D. Conquist and P. Franklin, “RaPiD – reconfigurable 2002, pp. 117–130. pipelined datapath”, Field-Programmable Logic and Applications, [16] K. Bondalapati and V.K. Prasanna, “Reconfigurable computing sys- LNCS 1142, Springer, 1996. tems”, Proc. IEEE, Vol. 90, No. 7, pp. 1201-1217, July 2002. [42] Elixent Corporation, DFA 1000 Accelerator Datasheet, 2003. [17] M. Butts, A. DeHon and S. Goldstein, “Molecular electronics: devices, [43] A. Fidjeland, W. Luk and S. Muggleton, “Scalable acceleration of systems and tools for gigagate, gigabit chips”, Proc. Int. Conference inductive logic programs”, Proc. Int. Conf. on Field-Programmable on Computer-Aided Design, IEEE, 2002. Technology, IEEE, 2002.

IEE Proceedings Review Copy Only Computers & Digital Techniques Page 16 of 17

[44] R. Franklin, D Carver and B.L. Hutchings, “Assisting network intru- Symp. on Field-Programmable Custom Computing Machines, IEEE sion detection with reconfigurable hardware, Proc. Symp. on Field- Computer Society Press, 1999. Programmable Custom Computing Machines, IEEE Computer Society [69] E.A. Lee and D.G. Messerschmitt, “Static scheduling of synchronous Press, 2002. data flow programs for digital signal processing”, IEEE Trans. Com- [45] J. Frigo, D. Palmer, M. Gokhale and M. Popkin-Paine, “Gamma- puters, January 1987. Ray pulsar detection using reconfigurable computing hardware”, Proc. [70] T.K. Lee, S. Yusuf, W. Luk, M. Sloman, E. Lupu and N. Dulay, Symp. on Field Programmable Custom Computing Machines, IEEE ”Compiling policy descriptions into reconfigurable firewall processors”, Computer Society Press, 2003. Proc. Symp. on Field-Programmable Custom Computing Machines, [46] A. Gayasen, K. Lee, N. Vijaykrishnan, M. Kandemir, M. J. Irwin IEEE Computer Society Press, 2003. and T. Tuan, “A dual-VDD low power FPGA architecture”, Field [71] T.K. Lee, A. Derbyshire, W. Luk and P.Y.K. Cheung, “High-level Programmable Logic and Applications, LNCS Series, Springer, 2004. language extensions for run-time reconfigurable systems”, Proc. Int. [47] V. George, H. Zhang, and J. Rabaey, “The design of a low energy Conf. on Field-Programmable Technology, IEEE, 2003. FPGA”, Proc. Int. Symp. on Low Power Electronics and Design, 1999. [72] G. Lemieux and D. Lewis, Design of Interconnect Networks for [48] M. Gokhale and J. Stone, “NAPA C: compiling for a hybrid Programmable Logic, Kluwer Academic Publishers, Feb. 2004. RISC/FPGA architecture”, Proc. Symp. on Field-Programmable Cus- [73] P. Leong, M. Leong, O. Cheung, T. Tung, C. Kwok, M. Wong tom Computing Machines, IEEE Computer Society Press, 1998. and K. Lee, “Pilchard – A Reconfigurable Computing Platform with [49] M. Gokhale, J.M. Stone, J. Arnold and M. Kalinowski, “Stream- Memory Slot Interface”, Proc. Symp. on Field-Programmable Custom oriented FPGA computing in the Streams-C high level language”, Proc. Computing Machines IEEE Computer Society Press, 2001 Symp. on Field-Programmable Custom Computing Machines, IEEE [74] P.H.W. Leong and K.H. Leung, “A microcoded Elliptic Curve Processor Computer Society Press, 2000. using FPGA technology”, IEEE Transactions on Very Large Scale [50] S.C. Goldstein, H. Schmit, M. Budiu, S. Cadambi, M. Moe and R. Integration Systems, Vol. 10, No. 5, 2002, pp. 550-559. Taylor, “PipeRench: a reconfigurable architecture and compiler”, IEEE [75] J. Liang, R. Tessier and O. Mencer, “Floating point unit generation and Computer, Vol. 33, No. 4, 2000, pp. 70-77. evaluation for FPGAs”, Proc. Symp. on Field-Programmable Custom [51] Z. Guo, W. Najjar, F. Vahid and K. Vissers, “A quantitative analysis Computing Machines, IEEE Computer Society Press, 2003. of the speedup factors of FPGAs over processors”, Proc. Int. Symp. on [76] W. Luk, “Customising processors: design-time and run-time opportu- FPGAs, ACM Press, 2004. nities”, Computer Systems: Architectures, Modeling, and Simulation, [52] S. Guo and W. Luk, “An integrated system for developing regular array LNCS 3133, Springer, 2004. design”, Journal of Systems Architecture, Vol. 47, 2001. [77] W. Luk, P.Y.K. Cheung and N. Shirazi, “Configurable computing”, [53] S. Gupta, N.D. Dutt, R.K. Gupta and A. Nicolau, “SPARK: a high- Electrical Engineer’s Handbook, W.K. Chen (ed.), Academic Press, level synthesis framework for applying parallelizing compiler trans- 2004. formations”, Proc. International Conference on VLSI Design, January [78] W. Luk and S.W. McKeever, “Pebble: a language for parametrised 2003. and reconfigurable hardware design”, Field-Programmable Logic and [54] J.R. Hauser and J. Wawrzynek, “Garp: a MIPS processor with a Applications, LNCS 1482, Springer, 1998. reconfigurable processor”, IEEE Symposium on Field-Programmable [79] A. Marshall, T. Stansfield, I Kostarnov, J. Vuillemin and B. Hutch- Custom Computing Machines, IEEE Computer Society Press, 1997. ings, “A reconfigurable arithmetic array for multimedia applications”, [55] T. Harriss, R. Walke, B. Kienhuis and E. Deprettere, “Compilation from ACM/SIGDA International Symposium on FPGAs, Feb 1999, pp. 135- Matlab to process networks realized in FPGA”, Design Automation of 143. Embedded Systems, Vol. 7, No. 4, 2002. [80] S. McCloud, Catapult C Synthesis-based design flow: speeding im- [56] C.A.R. Hoare, Communicating Sequential Processes, Prentice Hall, plementation and increasing flexibility, White Paper, , 1985. 2004. [57] H. Hogl,¨ A. Kugel, J. Ludvig, R. Manner¨ , K. Noffz and R. Zoz, [81] S.W. McKeever, W. Luk and A. Derbyshire, “Compiling hardware “Enable++: a second-generation FPGA processor”, IEEE Symposium descriptions with relative placement information for parametrised li- on FPGAs for Custom Computing Machines, IEEE Computer Society braries”, Formal Methods in Computer-Aided Design, LNCS 2517, Press, 1995. Springer, 2002. [58] J. Hwang, B. Milne, N. Shirazi and J.D. Stroomer, “System level [82] B. Mei, S. Vernalde, D. Verkest, H. De Man and R. Lauwereins, tools for DSP in FPGAs”, Field-Programmable Logic and Applications, “ADRES: An architecture with tightly coupled VLIW processor and LNCS 2147, Springer, 2001. coarse-grained reconfigurable matrix”, Field-Programmable Logic and [59] P.A. Jackson, B.L. Hutchings and J.L. Tripp, “Simulation and synthesis Applications, LNCS 2778, Springer, 2003. of CSP-based interprocess communication”, Proc. Symp. on Field- [83] O. Mencer, “PAM-Blox II: design and evaluation of C++ module gener- Programmable Custom Computing Machines, IEEE Computer Society ation for computing with FPGAs”, Proc. Symp. on Field-Programmable Press, 2003. Custom Computing Machines, IEEE Computer Society Press, 2002. [60] H. Kagotani and H. Schmit, “Asynchronous PipeRench: architecture [84] O. Mencer, D.J. Pearce, L.W. Howes and W. Luk, “Design space and performance evaluations”, Proc. Symp. on Field-Programmable exploration with A Stream Compiler”, Proc. Int. Conf. on Field Custom Computing Machines, IEEE Computer Society Press, 2003. Programmable Technology, IEEE, 2003, [61] V. Kathail et. al., “PICO: automatically designing custom computers”, [85] Mentor Graphics, Vstation Pro: High Performance System Verification, Computer, Vol. 35, No. 9, Sept. 2002. 2003. [62] H. Keding, M. willems, M. Coors and H. Meyr, “FRIDGE: A fixed- [86] E. Mirsky and A. DeHon, “MATRIX: a reconfigurable computing point design and simulation environment”, Proc. Design Automation architecture with configurable instruction distribution and deployable and Test in Europe, 1998. resources”, Proc. Symp. on Field-Programmable Custom Computing [63] C. Kulkarni, G. Brebner and G. Schelle, “Mapping a domain specific Machines, IEEE Computer Society Press, 1996. language to a platform FPGA”, Proc. Design Automation Conference, [87] K. Morris, “Virtex 4: Xilinx details its next generation”, FPGA and 2004. Programmable Logic Journal, June 2004. [64] K. Kum and W. Sung, “Word-length optimization for high-level syn- [88] S.H. Muggleton, “Inverse entailment and Progol”, New Generation thesis of digital signal processing systems”, Proc. IEEE Int. Workshop Computing, Vol. 13, 1995. on Signal Processing Systems, 1998. [89] A. Nayak, M. Haldar, A. Choudhary and P. Banerjee, “Precision and [65] K.-I. Kum and W. Sung, “Combined word-length optimization and error analysis of MATLAB applications during automated hardware high-level synthesis of digital signal processing systems”, IEEE Trans. synthesis for FPGAs”, Proc. Design Automation and Test in Europe, Computer Aided Design, Vol. 20, No. 8, pp. 921–930, August 2001. 2001. [66] J. Lamoureux and S.J.E. Wilton, “On the interaction between power- [90] S. Ong, N. Kerkiz, B. Srijanto, C. Tan, M. Langston, D. Newport and aware FPGA CAD algorithms”, Int. Conf. on Computer-Aided Design, D. Bouldin, “Automatic mapping of multiple applications to multiple IEEE, 2003. adaptive computing systems”, Proc. Int. Symp. on Field-Programmable [67] Corp, ispXPGA Family, January 2004. Custom Computing Machines, IEEE Computer Society Press, 2001. [68] R. Laufer, R. Taylor and H. Schmit, “PCI-PipeRench and the Swor- [91] J. Ou and V. Prasanna, “PyGen: a MATLAB/Simulink based tool for dAPI: a system for stream-based reconfigurable computing”, Proc. synthesizing parameterized and energy efficient designs using FPGAs”,

IEE Proceedings Review Copy Only Page 17 of 17 Computers & Digital Techniques

Proc. Int. Symp. on Field-Programmable Custom Computing Machines, [117] W. Sung and K. Kum, “Simulation-based word-length optimization IEEE Computer Society Press, 2004. method for fixed-point digital signal processing systems”, IEEE Trans. [92] I. Page and W. Luk, “Compiling occam into FPGAs”, FPGAs, Abing- Signal Processing, Vol. 43, No. 12, December 1995, pp. 3087–3090. don EE&CS Books, 1991. [118] M. Taylor et al, “The RAW microprocessor: a computational fabric for [93] J. Peterson, B. O’Connor and P. Athanas, “Scheduling and partitioning software circuits and general purpose programs”, IEEE Micro, vol 22. ANSI-C programs onto multi-FPGA CCM architectures”, Int. Symp. No. 2, March/April 2002, pp. 25–35. on FPGAs for Custom Computing Machines, IEEE Computer Society [119] J. Teife and R. Manohar, “Programmable asynchronous pipeline ar- Press, 1996. rays”, Field Programmable Logic and Applications, LNCS 2778, [94] Quicklogic Corp., Eclipse-II Family Datasheet, January 2004. Springer, 2003. [95] A. Rahman and V. Polavarapuv, “Evaluation of low-leakage design [120] N. Telle, C.C. Cheung and W. Luk, “Customising hardware designs techniques for Field Programmable Gate Arrays”, Proc. Int. Symp. on for Elliptic Curve Cryptography”, Computer Systems: Architectures, Field-Programmable Gate Arrays, ACM Press, 2004. Modeling, and Simulation, LNCS 3133, Springer, 2004. [96] R. Razdan and M.D. Smith, “A high performance microarchitecture [121] R. Tessier and W. Burleson, “Reconfigurable computing and digital with hardware programmable functional units”, International Sympo- signal processing: a survey”, Journal of VLSI Signal Processing, sium on Microarchitecture, 1994, pp. 172-180. Vol. 28, May/June 2001, pp. 7–27. [97] T. Rissa, W. Luk and P.Y.K. Cheung, “Automated combination of [122] D. Thomas and W. Luk, “A framework for development and distribution simulation and hardware prototyping”, Proc. Int. Conf. on Engineering of ”, Reconfigurable Technology: FPGAs and of Reconfigurable systems and algorithms, CSREA Press, 2004.. Reconfigurable Processors for Computing and Communications, Proc. [98] A. Royal and P.Y.K. Cheung, “Globally asynchronous locally syn- SPIE, Vol. 4867, 2002. chronous FPGA architectures”, Field Programmable Logic and Ap- [123] T. Todman, J.G.F. Coutinho and W. Luk, “Customisable hardware com- plications, LNCS 2778, Springer, 2003. pilation”, Proc. Int. Conf. on Engineering of Reconfigurable Systems [99] C.R. Rupp, M. Landguth, T. Garverick, E. Gomersall, H. Holt, J. and Algorithms, CSREA PRess, 2004. Arnold and M. Gokhale, “The NAPA adaptive processing architec- [124] K. Underwood, “FPGAs vs. CPUs: trends in peak floating-point ture”, IEEE Symposium on Field-Programmable Custom Computing performance”, Proc. int. Symp. on FPGAs, ACM Press, 2004. Machines, May 1998, pp. 28-37. [125] L. Vereen, “Soft FPGA cores attract embedded develop- [100] T. Saxe and B. Faith, “ Less is More with FPGAs”, EE ers”, Embedded Systems Programming, 23 April 2004, Times, September 13, 2004 http://www.eetimes.com/ http://www.embedded.com/showArticle.jhtml?articleID=19200183. showArticle.jhtml?articleID=47203801 [126] J. Vuillemin, P. Bertin, D. Roncin, M. Shand, H. Touati and P. Boucard, [101] P. Schaumont, I. Verbauwhede, K. Keutzer and M. Sarrafzadeh, “A “Programmable Active Memories: Reconfigurable Systems come of quick safari through the reconfiguration jungle”, Proc. Design Automa- age”, IEEE Transactions on VLSI Systems, Vol. 4, No. 1, March 1996, tion Conf., ACM Press, 2001. pp. 56-69. [102] R. Schreiber et. al., “PICO-NPA: High-level synthesis of nonpro- [127] S.A. Wadekar and A.C. Parker, “Accuracy sensitive word-length se- grammable hardware accelerators”, Journal of VLSI Signal Processing lection for algorithm optimization”, Proc. International Conference on Systems, Vol. 31, No. 2, June 2002. Computer Design, 1998. [103] S.P. Seng, W. Luk and P.Y.K. Cheung, “Flexible instruction proces- [128] M. Weinhardt and W. Luk, “Pipeline vectorization”, IEEE Trans. on sors”, Proc. Int. Conf. on Compilers, Arch. and Syn. for Embedded Computer-Aided Design, Vol. 20, No. 2, 2001. Systens, ACM Press, 2000. [129] M. Willems, V. Bursgens,¨ H. Keding, T. Grotk¨ er and M. Meyer, [104] S.P. Seng, W. Luk and P.Y.K. Cheung, “Run-time adaptive flexible “System-level fixed-point design based on an interpolative approach”, instruction processors”, Field-Programmable Logic and Applications, Proc 34th Design Automation Conference, June 1997. LNCS 2438, Springer, 2002. [130] R.S. Williams and P.J. Kuekes, “Molecular nanoelectronics”, Proc. Int. [105] N. Shirazi, W. Luk and P.Y.K. Cheung, “Framework and tools for run- Symp. on Circuits and Systems, IEEE, 2000. time reconfigurable designs”, IEE Proc.-Comput. Digit. Tech., May, [131] R.P. Wilson, R.S. French, C.S. Wilson, S.P. Amarasinghe, J.M. An- 2000. derson, S.W.K. Tjiang, S.-W. Liao, C.-W. Tseng, M.W. Hall, M.S. [106] Silicon Hive, “Avispa Block Accelerator”, Product Brief, 2003. Lam and J.L. Hennessy, “SUIF: an infrastructure for research on [107] Simulink, http://www.mathworks.com parallelizing and optimizing compilers”, ACM SIGPLAN Noticies, [108] H. Singh, M-H Lee, G. Lu, F. Kurdahi, N. Bagherzadeh and E. Chaves, Vol. 29, No. 12, December 1994. “MorphoSys: An integrated reconfigurable system for data-parallel and [132] R.D. Wittig and P. Chow, “OneChip: an FPGA processor with recon- compute intensive applications”, IEEE Trans. on Computers, Vol. 49, figurable logic”, IEEE Symposium on FPGAs for Custom Computing No. 5, May 2000, pp. 465-481. Machines, 1996. [109] S. Singh and C.J. Lillieroth, “Formal verification of reconfigurable [133] C.G. Wong, A.J. Martin and P. Thomas, “An architecture for asyn- cores”, Proc. Symp. on Field-Programmable Custom Computing Ma- chronous FPGAs”, Proc. Int. Conf. on Field-Programmable Technol- chines, IEEE Computer Society Press, 1999. ogy, IEEE, 2003. [110] I. Sourdis and D. Pnevmatikatos, “Fast, large-scale string matching for [134] Xilinx, Inc., PowerPC 405 Processor Block Reference Guide, October a 10Gbps FPGA-based network intrusion system”, Field Programmable 2003. Logic and Applications, LNCS 2778, Springer, 2003. [135] Xilinx, Inc., Virtex II Datasheet, June 2004. [111] M. Stephenson, J. Babb and S. Amarasinghe, “Bitwidth analysis with [136] Xilinx, Inc., Microblaze Processor Reference Guide, June 2004. application to silicon compilation”, Proc. SIGPLAN Programming [137] A. Yamada, K. Nishida, R. Sakurai, A. Kay, T. Nomura and T. Kambe, Language Design and Implementation, June 2000. “Hardware synthesis with the Bach system”, Proc. ISCAS, IEEE, 1999. [112] M.W. Stephenson, Bitwise: Optimizing bitwidths using data-range [138] H. Ziegler, B. So, M. Hall and P. Diniz, “Coarse-grain pipelin- propagation, Master’s Thesis, Massachussets Institute of Technology, ing on multiple-FPGA architectures”, in IEEE Symposium on Field- Dept. Electrical engineering and Computer Science, May 2000. Programmable Custom Computing Machines, IEEE, 2002, pp77–88. [113] G. Stitt, F. Vahid and S. Nematbakhsh, “Energy savings and speedups from partitioning critical software loops to hardware in embedded systems”, ACM Trans. on Embedded Computing Systems, Vol. 3, No. 1, Feb. 2004, pp. 218–232. [114] H. Styles and W. Luk, “Customising graphics applications: techniques and programming interface”, Proc. Symp. on Field-Programmable Custom Computing Machines, IEEE Computer Society Press, 2000. [115] H. Styles and W. Luk, “Branch optimisation techniques for hardware compilation”, Field-Programmable Logic and Applications, LNCS 2778, Springer, 2003. [116] W. Sung and K. Kum, “Word-length determination and scaling software for a signal flow block diagram”, Proc. IEEE Int. Conf. on Acoustics Speech and Signal Processing, 1994.

IEE Proceedings Review Copy Only