1 Reconfigurable Computing: Architectures and Design Methods

∗ ∗∗ ∗∗∗ ∗ ∗ ∗∗ ∗ T.J. Todman , G.A. Constantinides , S.J.E. Wilton , O. Mencer , W. Luk and P.Y.K. Cheung Department of Computing, Imperial College London, UK ∗∗ Department of Electrical and Electronic Engineering, Imperial College London, UK ∗∗∗ Department of Electrical and Computer Engineering, University of British Columbia, Canada

Abstract

Reconfigurable computing is becoming increasingly attractive for many applications. This survey covers two aspects of reconfigurable computing: architectures and design methods. Our chapter includes recent advances in reconfigurable architectures, such as the II and Virtex 4 FPGA devices. We identify major trends in general-purpose and special-purpose design methods.

I. INTRODUCTION

Reconfigurable computing is rapidly establishing itself as a major discipline that covers various subjects of learning, including both computing science and electronic engineering. Reconfigurable computing involves the use of reconfigurable devices, such as Field Programmable Gate Arrays (FPGAs), for computing purposes. Reconfigurable computing is also known as configurable computing or custom computing, since many of the design techniques can be seen as customising a computational fabric for specific applications [102]. Reconfigurable computing systems often have impressive performance. Consider, as an example, the point multiplication operation in Elliptic Curve cryptography. For a key size of 270 bits, it has been reported [172] that a point multiplication can be computed in 0.36 ms with a reconfigurable computing design implemented in an XC2V6000 FPGA at 66 MHz. In contrast, an optimised implementation requires 196.71 ms on a dual-Xeon computer at 2.6 GHz; so the reconfigurable computing design is more than 540 times faster, while its clock speed is almost 40 times slower than the Xeon processors. This example illustrates a hardware design implemented on a reconfigurable computing platform. We regard such implementations as a subset of reconfigurable computing, which in general can involve the use of runtime reconfiguration and soft processors. Reconfigurable computing involves devices that can be reconfigured: their circuits can be changed after they are manufactured. This means that rather than using a single circuit for many applications, like a , specific circuits can be generated for specific applications. How can circuits be changed after manufacture? Typically, reconfigurable devices use memory whose state switches logical elements (for example flip-flops and function generators), and the wiring between them. The state of all these memory bits is known as the configuration of the device, and determines its function (for example, image processor, network firewall). In section 3, we survey different styles of reconfigurable logical elements and wiring. New circuits for new applications can be uploaded to the reconfigurable device by writing to the configuration memory. An example of a reconfigurable device is the Xilinx Virtex 4 [116]. In this device, the configuration memory controls logical 2 elements (which include flip-flops, function generators, multiplexors and memories), and wiring, arranged in a hierarchical scheme. Designing circuits for reconfigurable devices is akin to designing application-specific integrated circuits, with the additional possibility that the design can change, perhaps in response to data received. Design methods can be general-purpose (for example, using the C programming language), or special-purpose (for example, using domain-specific tools such as MATLAB). We review such design methods in section 4. There are many commercial tools which support reconfigurable computing, including:

• Xilinx’s ISE [197], an example of a reconfigurable device vendor tool, which generates configurations for Xilinx’s families of reconfigurable hardware from inputs such as hardware description languages. • Celoxica’s DK [31] design suite, which allows descriptions based on the C programming language to be translated to configurations for reconfigurable hardware in the form of hardware description languages. • Synplicity Synplify Pro [168], which allows reconfigurable designs in hardware description languages (perhaps the output of Celoxica’s DK) to be optimised and converted into a netlist, which reconfigurable device vendor tools (such as Xilinx’s ISE) can then convert into a configuration.

Is this speed advantage of reconfigurable computing over traditional a one-off or a sustainable trend? Recent research suggests that it is a trend rather than a one-off for a wide variety of applications: from image processing [65] to floating-point operations [179]. Sheer speed, while important, is not the only strength of reconfigurable computing. Another compelling advantage is reduced energy and power consumption. In a reconfigurable system, the circuitry is optimized for the application, such that the power consumption will tend to be much lower than that for a general-purpose processor. A recent study [162] reports that moving critical software loops to reconfigurable hardware results in average energy savings of 35% to 70% with an average speedup of 3 to 7 times, depending on the particular device used. Other advantages of reconfigurable computing include a reduction in size and component count (and hence cost), improved time-to-market, and improved flexibility and upgradability. These advantages are especially important for embedded applications. Indeed, there is evidence [180] that embedded systems developers show a growing interest in reconfigurable computing systems, especially with the introduction of soft cores which can contain one or more instruction processors [7], [55], [100], [149], [150], [196]. In this chapter, we present a survey of modern reconfigurable system architectures and design methods. Although we also provide background information on notable aspects of older technologies, our focus is on the most recent architectures and design methods, as well as the trends that will drive each of these areas in the near future. In other words, we intend to complement other surveys [21], [36], [103], [146], [173] by:

1) providing an up-to-date survey of material that appears after the publication of the papers mentioned above; 2) identifying explicitly the main trends in architectures and design methods for reconfigurable computing; 3) examining reconfigurable computing from a perspective different from existing surveys, for instance classifying design methods as special-purpose and general-purpose; 4) offering various direct comparisons of technology options according to a selected set of metrics from different perspectives.

The rest of the chapter is organised as follows. Section 2 contains background material that motivates the reconfigurable computing approach. Section 3 describes the structure of reconfigurable fabrics, showing how various researchers and vendors 3 have developed fabrics that can efficiently accelerate time-critical portions of applications. Section 4 then covers recent advances in the development of design methods that map applications to these fabrics, and distinguishes between those which employ special-purpose and general-purpose optimization methods. Finally, Section 5 concludes and summarises the main trends in architectures, design methods and applications of reconfigurable computing.

II. BACKGROUND

Many of today’s compute-intensive applications require more processing power than ever before. Applications such as streaming video, image recognition and processing, and highly interactive services are placing new demands on the computation units that implement these applications. At the same time, the power consumption targets, the acceptable packaging and manufacturing costs, and the time-to-market requirements of these computation units are all decreasing rapidly, especially in the embedded hand-held devices market. Meeting these performance requirements under the power, cost, and time-to-market constraints is becoming increasingly challenging. In the following, we describe three ways of supporting such processing requirements: high-performance microprocessors, application-specific integrated circuits, and reconfigurable computing systems. High-performance microprocessors provide an off-the-shelf means of addressing processing requirements described earlier. Unfortunately for many applications, a single processor, even an expensive state-of-the-art processor, is not fast enough. In addition, the power consumption (100 watts or more) and cost (possibly thousands of dollars) state-of-the-art processors place them out-of-reach for many embedded applications. Even if microprocessors continue to follow Moore’s Law so that their density doubles every 18 months, they may still be unable to keep up with the requirements of some of the most aggressive embedded applications. Application-specific integrated circuits (ASICs) provide another means of addressing these processing requirements. Unlike a software implementation, an ASIC implementation provides a natural mechanism for implementing the large amount of parallelism found in many of these applications. In addition, an ASIC circuit does not need to suffer from the serial (and often slow and power-hungry) instruction fetch, decode, and execute cycle that is at the heart of all microprocessors. Furthermore, ASICs consume less power than reconfigurable devices. Finally, an ASIC can contain just the right mix of functional units for a particular application; in contrast, an off-the-shelf microprocessor contains a fixed set of functional units which must be selected to satisfy a wide variety of applications. Despite the advantages of ASICs, they are often infeasible or uneconomical for many embedded systems. This is primarily due to two factors: the cost of producing an ASIC often due to the mask’s cost (up to $1 million [144]), and the time to develop a custom integrated circuit, can both be unacceptable. Only the very highest-volume applications would the improved performance and lower per-unit price warrant the high non-recurring engineering (NRE) cost of designing an ASIC. A third means of providing this processing power is a reconfigurable computing system. A reconfigurable computing system typically contains one or more processors and a reconfigurable fabric upon which custom functional units can be built. The processor(s) executes sequential and non-critical code, while code that can be efficiently mapped to hardware can be “executed” by processing units that have been mapped to the reconfigurable fabric. Like a custom integrated circuit, the functions that have been mapped to the reconfigurable fabric can take advantage of the parallelism achievable in a hardware implementation. Also like an ASIC, the embedded system designer can produce the right mix of functional and storage units in the reconfigurable fabric, providing a computing structure that matches the application. 4

Unlike an ASIC, however, a new fabric need not be designed for each application. A given fabric can implement a wide variety of functional units. This means that a reconfigurable computing system can be built out of off-the-shelf components, significantly reducing the long design-time inherent in an ASIC implementation. Also unlike an ASIC, the functional units implemented in the reconfigurable fabric can change over time. This means that as the environment or usage of the embedded system changes, the mix of functional units can adapt to better match the new environment. The reconfigurable fabric in a handheld device, for instance, might implement large matrix multiply operations when the device is used in one mode, and large signal processing functions when the device is used in another mode.

Typically, not all of the embedded system functionality needs to be implemented by the reconfigurable fabric. Only those parts of the computation that are time-critical and contain a high degree of parallelism need to be mapped to the reconfigurable fabric, while the remainder of the computation can be implemented by a standard instruction processor. The interface between the processor and the fabric, as well as the interface between the memory and the fabric, are therefore of the utmost importance. Modern reconfigurable devices are large enough to implement instruction processors within the programmable fabric itself: soft processors. These can be general purpose, or customised to a particular application; Application Specific Instruction Processors and Flexible Instruction Processors are two such approaches. Section IV C part 2 deals with soft processors in more detail.

Other devices show some of the flexibility of reconfigurable computers. Examples include Graphics Processor Units and Application Specific Array Processors. These devices perform well on their intended application, but cannot run more general computations, unlike reconfigurable computers and microprocessors.

Despite the compelling promise of reconfigurable computing, it has limitations of which designers should be aware. For instance, the flexible routing on the bit level tends to produce large silicon area and performance overhead when compared with ASIC technology. Hence for large volume production of designs in applications without the need for field upgrade, ASIC technology or gate array technology can still deliver higher performance design at lower unit cost than reconfigurable computing technology. However, since FPGA technology tracks advances in memory technology and has demonstrated impressive advances in the last few years, many are confident that the current rapid progress in FPGA speed, capacity and capability will continue, together with the reduction in price.

It should be noted that the development of reconfigurable systems is still a maturing field. There are a number of challenges in developing a reconfigurable system. We describe three of such challenges below.

First, the structure of the reconfigurable fabric and the interfaces between the fabric, processor(s), and memory must be very efficient. Some reconfigurable computing systems use a standard Field-Programmable Gate Array [3], [6], [91], [116], [135], [195] as a reconfigurable fabric, while others adopt custom-designed fabrics [35], [52], [53], [64], [72], [106], [110], [114], [142], [152], [155], [170].

Another challenge is the development of computer-aided design and compilation tools that map an application to a reconfig- urable computing system. This involves determining which parts of the application should be mapped to the fabric and which should be mapped to the processor, determining when and how often the reconfigurable fabric should be reconfigured, which changes the functional units implemented in the fabric, as well as the specification of algorithms for efficient mappings to the reconfigurable system.

In this chapter, we provide a survey of reconfigurable computing, focusing our discussion on both the issues described above. In the next section, we provide a survey of various architectures that are found useful for reconfigurable computing; material on design methods will follow. 5

III. ARCHITECTURES

We shall first describe system-level architectures for reconfigurable computing. We then present various flavours of recon- figurable fabric. Finally we identify and summarise the main trends.

A. System-level architectures

A reconfigurable system typically consists of one or more processors, one or more reconfigurable fabrics, and one or more memories. Reconfigurable systems are often classified according to the degree of coupling between the reconfigurable fabric and the CPU. Compton and Hauck [36] present the four classifications shown in Figure 1(a-d). In Figure 1(a), the reconfigurable fabric is in the form of one or more stand-alone devices. The existing input and output mechanisms of the processor are used to communicate with the reconfigurable fabric. In this configuration, the data transfer between the fabric and the processor is relatively slow, so this architecture only makes sense for applications in which a significant amount of processing can be done by the fabric without processor intervention. Emulation systems often take on this sort of architecture [24], [113].

TABLE I

SUMMARY OF SYSTEM ARCHITECTURES.

Class CPU to memory Shared Fine Grained or Example bandwidth memory size Coarse Grained Application (a) External stand-alone processing unit RC2000 [30] 528MB/s 152MB Fine Grained Video processing (b) / (c) Attached processing unit / Co-processor Pilchard [99] 1064MB/s 20 Kbytes Fine Grained DES Encryption Morphosys [155] 800MB/s 2048 bytes Coarse Grained Video compression (d) Reconfigurable functional unit Chess [106] 6400MB/s 12288 bytes Coarse Grained Video processing (e) Processor embedded in a reconfigurable fabric Xilinx Virtex II Pro [195] 1600MB/s 1172KB Fine Grained Video compression

Figure 1(b) and Figure 1(c) shows two intermediate structures. In both cases, the cost of communication is lower than that of the architecture in Figure 1(a). Architectures of these types are described in [8], [64], [72], [92], [142], [155], [181], [191]. Next, Figure 1(d) shows an architecture in which the processor and the fabric are very tightly coupled; in this case, the reconfigurable fabric is part of the processor itself; perhaps forming a reconfigurable sub-unit that allows for the creation of custom instructions. Examples of this sort of architecture have been described in [106], [114], [138], [170]. Figure 1(e) shows a fifth organization. In this case, the processor is embedded in the programmable fabric. The processor can either be a “hard” core [5], [194], or can be a “soft” core which is implemented using the resources of the programmable fabric itself [7], [55], [100], [149], [150], [196]. A summary of the above organizations can be found in Table I. Note the bandwidth is the theoretical maximum available to the CPU: for example, in Chess [106], we assume that each block RAM is being accessed at its maximum rate. Organization (a) is by far the most common, and accounts for all commercial reconfigurable platforms. 6 E C A E F H R C CPU E T A N C I

O / I

RECONFIGURABLE PROCESSING UNIT (a) External stand-alone processing unit. E C A E F H R C

CPU E A T C N I

O / I

(b) Attached processing unit. E C A E F H R C

CPU E A T C N I

O / I

(c) Co-processsor. E C A

CPU E F H R C E A T C N

FU I

O / I

(d) Reconfigurable functional unit.

Programmable Fabric

CPU

(e) Processor embedded in a reconfigurable fabric.

Fig. 1. Five classes of reconfigurable systems. The first four are adapted from [36]. 7

Inputs B i t - S

t Output r e a

m D Q

(a) Three-input lookup table.

Inputs

D Q t c e n n o c r e

t D Q Outputs n I

l a c o L

D Q

(b) Cluster of lookup tables.

Fig. 2. Fine-grained reconfigurable functional units.

B. Reconfigurable fabric

The heart of any reconfigurable system is the reconfigurable fabric. The reconfigurable fabric consists of a set of reconfigurable functional units, a reconfigurable interconnect, and a flexible interface to connect the fabric to the rest of the system. In this section, we review each of these components, and show how they have been used in both commercial and academic reconfigurable systems. A common theme runs through this entire section: in each component of the fabric, there is a tradeoff between flexibility and efficiency. A highly flexible fabric is typically much larger and much slower than a less flexible fabric. On the other hand, a more flexible fabric is better able to adapt to the application requirements. In the following discussions, we will see how this tradeoff has influenced the design of every part of every reconfigurable system. A summary of the main features of various architectures can be found in Table II. 1) Reconfigurable functional units: Reconfigurable functional units can be classified as either coarse-grained or fine-grained. A fine-grained functional unit can typically implement a single function on a single (or small number) of bits. The most common kind of fine-grained functional units are the small lookup tables that are used to implement the bulk of the logic in a commercial field-programmable gate array. A coarse-grained functional unit, on the other hand, is typically much larger, and may consist of arithmetic and logic units (ALUs) and possibly even a significant amount of storage. In this section, we describe the two types of functional units in more detail. 8

TABLE II

COMPARISON OF RECONFIGURABLE FABRICS AND DEVICES.

Fabric or Device Fine Grained or Base Logic Component Routing Architecture Embedded Memory Special Features Coarse Grained ProASIC+ Fine 3-input block Horizontal and vertical 256x9 bit blocks Flash-based [3] tracks Altera Excalibur Fine 4-input Lookup Tables Horizontal and vertical 2Kbit memory blocks ARMv4T Embedded [5] tracks Processor Altera Stratix II Fine/Coarse 8-input Adaptive Horizontal and vertical 512 bits, 4Kbits, DSP blocks [6] Logic Module tracks and 512Kbit blocks Garp Fine logic or arithmetic functions 2-bit Buses in horizontal External to fabric [72] on four 2-bit input words and vertical columns Xilinx Virtex II Pro Fine 4-input Lookup Tables Horizontal and vertical 18Kbit blocks Embedded Multipliers, [194] tracks PowerPC 405 Processor Xilinx Virtex II Fine 4-input Lookup Tables Horizontal and vertical 18Kbit blocks Embedded Multipliers [195] tracks DReAM Coarse 8-bit ALUs 16-bit local Two 16x8 Dual Port Targets mobile [12] and global buses memory applications Elixent D-fabrix Coarse 4-bit ALUs 4-bit buses 256x8 memory blocks [53] HP Chess Coarse 4-bit ALUs 4-bit buses 256x8 bit memories [106] IMEC ADRES Coarse 32-bit ALUs 32-bit buses Small register files in [110] each logic component Matrix Coarse 8-bit ALUs Hierarchical 8-bit buses 256x8 bit memories [114] MorphoSys Coarse ALU and Multiplier, Buses External to fabric [155] and Shift Units Piperench Coarse 8-bit ALUs 8-bit Buses External to fabric Functional units [64] arranged in ‘stripes’ RaPiD Coarse ALUs buses Embedded memory [52] blocks Silicon Hive Avispa Coarse ALUs, Shifters, Accumulators, Buses Five embedded [152] and Multipliers memories

Many reconfigurable systems use commercial FPGAs as a reconfigurable fabric. These commercial FPGAs contain many three to six input lookup tables, each of which can be thought of as a very fine-grained functional unit. Figure 2(a) illustrates a lookup table; by shifting in the correct pattern of bits, this functional unit can implement any single function of up to three inputs – the extension to lookup tables with larger numbers of inputs is clear. Typically, lookup tables are combined into clusters, as shown in Figure 2(b). Figure 3 shows clusters in two popular FPGA families. Figure 3(a) shows a cluster in the Altera Stratix device; Altera calls these clusters “Logic Array Blocks” [6]. Figure 3(b) shows a cluster in the Xilinx architecture [195]; Xilinx calls these clusters “Configurable Logic Blocks” (CLBs). In the Altera diagram, each block labeled “LE” is a lookup table, while in the Xilinx diagram, each “slice” contains two lookup tables. Other commercial FPGAs are described in [3], [91], [116], [135]. Reconfigurable fabrics containing lookup tables are very flexible, and can be used to implement any digital circuit. However, compared to the coarse-grained structures in the next subsection, these fine-grained structures have significantly more area, 9

General Purpose General Purpose Routing Routing LE

LE

LE

To left LAB t

c LE e

n To right LAB n

o LE

From left LAB c r e t From right LAB

n LE I

l a c

o LE L LE

LE

LE

(a) Altera Logic Array Block [6]

COUT

Slice

Slice Switch COUT Matrix CIN Slice

Slice Fast connections to neighbours CIN (b) Xilinx Configurable [195]

Fig. 3. Commercial Logic Block Architectures.

LAB Carry In Carry-In 0 1 Carry-In0 Compute Sum assuming Carry-In1 input carry=0 addnsub A0 LE Sum0 B0 Compute Sum assuming input carry=1 Sum1 A1 LE data1 B1 2-LUT data2 sum A2 LE Sum2 2-LUT B2 Compute Carry-out assuming input carry=0 2-LUT A3 LE Sum3 B3 2-LUT Compute Carry-out assuming input carry=1 A4 LE Sum4 B4

Carry-Out0 Carry-Out1 Fig. 4. Implementing a carry-select adder in an Altera Stratix device [6]. ‘LUT’ denotes ‘Lookup Table’. 10 delay, and power overhead. Recognizing that these fabrics are often used for arithmetic purposes, FPGA companies have added additional features such as carry-chains and cascade-chains to reduce the overhead when implementing common arithmetic and logic functions. Figure 4 shows how the carry and cascade chains, as well as the ability to break a 4-input lookup table into four two-input lookup tables can be exploited to efficiently implement carry-select adders [6]. The multiplexers and the exclusive-or gate in Figure 4 are included as part of each logic array block, and need not be implemented using other lookup tables. The example in Figure 4 shows how the efficiency of commercial FPGAs can be improved by adding architectural support for common functions. We can go much further than this though, and embed significantly larger, but far less flexible, reconfigurable functional units. There are two kinds of devices that contain coarse-grained functional units; modern FPGAs, which are primarily composed of fine-grained functional units, are increasingly being enhanced by the inclusion of larger blocks. As an example, the Xilinx Virtex device contains embedded 18-bit by 18-bit multiplier units [195]. When implementing algorithms requiring a large amount of multiplication, these embedded blocks can significantly improve the density, speed and power of the device. On the other hand, for algorithms which do not perform multiplication, these blocks are rarely useful. The Altera Stratix devices contain a larger, but more flexible embedded block, called a DSP block, shown in Figure 5 [6]. Each of these blocks can perform accumulate functions as well as multiply operations. The comparison between the two devices clearly illustrates the flexibility and overhead tradeoff; the Altera DSP block may be more flexible than the Xilinx multiplier, however, it consumes more chip area and runs somewhat slower. A A A A A A A A D D D D D D D D N N N N N N N N E E E E E E E E Q Q Q Q Q Q Q Q

X X X X A A A A D D D D N N N N E E E E Q Q Q Q

ADD/SUB ADD/SUB

ADD A D N E Q

Fig. 5. Altera DSP Block [6].

The commercial FPGAs described above contain both fine-grained and coarse-grained blocks. There are also devices which contain only coarse-grained blocks [35], [52], [64], [106], [110], [155]. An example of a coarse-grained architecture is the ADRES architecture which is shown in Figure 6 [110]. Each reconfigurable functional unit in this device contains a 32-bit ALU which can be configured to implement one of several functions including addition, multiplication, and logic functions, with two small register files. Clearly, such a functional unit is far less flexible than the fine-grained functional units described earlier; however if the application requires functions which match the capabilities of the ALU, these functions can be very efficiently implemented in this architecture. 2) Reconfigurable interconnects: Regardless of whether a device contains fine-grained functional units, coarse-grained functional units, or a mixture of the two, the functional units needed to be connected in a flexible way. Again, there is a tradeoff between the flexibility of the interconnect (and hence the reconfigurable fabric) and the speed, area, and power-efficiency of 11

inputs 32-bit inputs 32-bit inputs

mux mux mux 1 1 32 32 32 32

pred Predicate src1 src2 Register Register Functional Unit File File pred_out result

1 32

Multi-Context Register Configuration RAM

1-bit 32-bit Predicate Result

Fig. 6. ADRES Reconfigurable Functional Unit [110]. the architecture.

Configuration Bit

(a) Fine-grained (b) Coarse-grained

Fig. 7. Routing architectures.

As before, reconfigurable interconnect architectures can be classified as fine-grained or coarse-grained. The distinction is based on the with which wires are switched. This is illustrated in Figure 7, which shows a flexible interconnect between two buses. In the fine-grained architecture in Figure 7(a), each wire can be switched independently, while in Figure 7(b), the entire bus is switched as a unit. The fine-grained routing architecture in Figure 7(a) is more flexible, since not every bit needs to be routed in the same way; however, the coarse-grained architecture in Figure 7(b) contains far fewer programming bits, and hence suffers much less overhead. Fine-grained routing architectures are usually found in commercial FPGAs. In these devices, the functional units are typically arranged in a grid pattern, and they are connected using horizontal and vertical channels. Significant research has been performed in the optimization of the topology of this interconnect [16], [97]. Coarse-grained routing architectures are commonly used in devices containing coarse-grained functional units. Figure 8 shows two examples of coarse-grained routing architectures: (a) the Totem reconfigurable system [35]; (b) the Silicon Hive reconfigurable system [152], which is less flexible but faster and smaller. 3) Emerging directions: Several emerging directions will be covered in the following. These directions include low-power techniques, asynchronous architectures, and molecular microelectronics. 12 M G R G R A A G R G A G G U A A A L L L P P P P P P U U U L M M M R R R R R R T

(a) Totem coarse-grained routing architecture [35]

Interconnect

RF RF RF RF

Interconnect

ALU ALU LS Memory

(b) Silicon Hive coarse-grained routing architecture [152]

Fig. 8. Example coarse-grained routing architectures.

• Low-power techniques. Early work explores the use of low-swing circuit techniques to reduce the power consumption in a hierarchical interconnect for a low-energy FPGA [60]. Recent work involves: (a) activity reduction in power-aware design tools, with energy saving of 23% [90]; (b) leakage current reduction methods such as gate biasing and multiple supply-voltage integration, with up to two times leakage power reduction [136]; and (c) dual supply-voltage methods with the lower voltage assigned to non-critical paths, resulting in an average power reduction of 60% [59]. • Asynchronous architectures. There is an emerging interest in asynchronous FPGA architectures. An asynchronous version of Piperench [64] is estimated to improve performance by 80%, at the expense of a significant increase in configurable storage and wire count [84]. Other efforts in this direction include fine-grained asynchronous pipelines [171], quasi delay-insensitive architectures [192], and globally-asynchronous locally-synchronous techniques [141]. • Molecular microelectronics. In the long term, molecular techniques offer a promising opportunity for increasing the capacity and performance of reconfigurable computing architectures [23]. Current work is focused on developing Programmable Logic Arrays based on molecular-scale nano-wires [48], [189].

C. Architectures: main trends

The following summarises the main trends in architectures for reconfigurable computing. 1) Coarse-grained fabrics: As reconfigurable fabrics are migrated to more advanced technologies, the cost (in terms of both speed and power) of the interconnect part of a reconfigurable fabric is growing. Designers are responding to this by increasing the granularity of their logic units, thereby reducing the amount of interconnect needed. In the Stratix II device, Altera moved away from simple 4-input lookup tables, and used a more complex logic block which can implement functions of up to 7 inputs. We should expect to see a slow migration to more complex logic blocks, even in stand-alone FPGAs. 13

2) Heterogeneous functions: As devices are migrated to more advanced technologies, the number of transistors that can be devoted to the reconfigurable logic fabric increases. This provides new opportunities to embed complex non-programmable (or semi-programmable) functions, creating heterogeneous architectures with both general- purpose logic resources and fixed- function embedded blocks. Modern Xilinx parts have embedded 18 by 18 bit multipliers, while modern Altera parts have embedded DSP units which can perform a variety of multiply/accumulate functions. Again, we should expect to see a migration to more heterogeneous architectures in the near future. 3) Soft cores: The use of “soft” cores, particularly for instruction processors, is increasing. A “soft” core is one in which the vendor provides a synthesisable version of the function, and the user implements the function using the reconfigurable fabric. Although this is less area- and speed- efficient than a hard embedded core, the flexibility and the ease of integrating these soft cores makes them attractive. The extra overhead becomes less of a hindrance as the number of transistors devoted to the reconfigurable fabric increases. Altera and Xilinx both provide numerous soft cores, including soft instruction processors such as NIOS [7] and Microblaze [196]. Soft instruction processors have also been developed by a number of researchers, ranging from customisable JVM and MIPS processors [149] to ones specialised for machine learning [55] and data encryption [100].

IV. DESIGN METHODS

Hardware compilers for high-level descriptions are increasingly recognised to be the key to reducing the productivity gap for advanced circuit development in general, and for reconfigurable designs in particular. This section looks at high-level design methods from two perspectives: special-purpose design and general-purpose design. Low-level design methods and tools, covering topics such as technology mapping, floorplanning, and place and route, are beyond the scope of this chapter – interested readers are referred to [36].

A. General-purpose design

This section describes design methods and tools based on a general-purpose programming language such as C, possibly adapted to facilitate hardware development. Of course, traditional hardware description languages like VHDL and are widely available, especially on commercial reconfigurable platforms. A number of compilers from C to hardware have been developed. Some of the significant ones are reviewed here. These range from compilers which only target hardware, to those which target complete hardware/software systems; some also partition into hardware and software components. We can classify different design methods into two approaches: the annotation and constraint-driven approach, and the source- directed compilation approach. The first approach preserves the source programs in C or C++ as much as possible and makes use of annotation and constraint files to drive the compilation process. The second approach modifies the source language to let the designer to specify, for instance, the amount of parallelism or the size of variables. 1) Annotation and constraint-driven approach: The systems mentioned below employ annotations in the source-code and constraint files to control the optimisation process. Their strength is that usually only minor changes are needed to produce a compilable program from a software description – no extensive re-structuring is required. Five representative methods are SPC [183], Streams-C [62], Sea Cucumber [81], SPARK [67] and Catapult-C [108]. SPC [183] combines vectorisation, loop transformations and retiming with automatic memory allocation to improve perfor- mance. SPC accelerates C loop nests with data dependency restrictions, compiling them into pipelines. Based on the SUIF 14 framework [190], this approach uses loop transformations, and can take advantage of run-time reconfiguration and memory access optimisation. Similar methods have been advocated by other researchers [73], [147]. Streams-C [62] compiles a C program to synthesisable VHDL. Streams-C exploits coarse-grained parallelism in stream-based computations; low-level optimisations such as pipelining are performed automatically by the compiler. Sea Cucumber [81] compiles Java programs to hardware using a similar scheme to Handel-C, which we detail in the next section. Unlike Handel-C, no language extensions are needed; like Streams-C, users must call a library, in this case based on Communicating Sequential Processes (CSP [76]). Multiple circuit implementations of the library primitives enable tradeoffs. SPARK [67] is a high-level synthesis framework targeting multimedia and image processing. It compiles C code with the following steps: (a) list scheduling based on speculative code motions and loop transformations, (b) resource binding pass with minimisation of interconnect, (c) finite state machine controller generation for the scheduled , (d) code generation producing synthesisable register-transfer level VHDL. Logic synthesis tools then synthesise the output. Catapult C synthesises Register Transfer Level (RTL) descriptions from unannotated C++, using characterisations of the target technology from RTL synthesis tools [108]. Users can set constraints to explore the design space, controlling loop pipelining and resource sharing. 2) Source-directed compilation approach: A different approach adapts the source language to enable explicit description of parallelism, communication and other customisable hardware resources such as variable size. Examples of design methods following this approach include ASC [112], Handel-C [29], Haydn-C [47] and Bach-C [198]. ASC [112] adopts C++ custom types and operators to provide a C++ programming interface on the algorithm, architecture, arithmetic and gate levels. This enables the user to program on the desired level for each part of the application. Semi-automated design space exploration further increases design productivity, while supporting the optimisation process on all available levels of abstraction. The object-oriented design enables efficient code-reuse, and includes an integrated arithmetic unit generation library [111]. A floating-point library [101] provides over 200 different floating point units, each with custom bitwidths for mantissa and exponent. Handel-C [29] extends a subset of C to support flexible width variables, signals, parallel blocks, bit-manipulation operations, and channel communication. A distinctive feature is that timing of the compiled circuit is fixed at one cycle per C statement. This allows Handel-C programmers to schedule hardware resources manually. Handel-C compiles to a one-hot state machine using a token-passing scheme developed by Page and Luk [128]; each assignment of the program maps to exactly one control flip-flop in the state machine. These control flip-flops capture the flow of control (represented by the token) in the program: if the control flip-flop corresponding to a particular statement is active, then control has passed to that statement, and the circuitry compiled for that statement is activated. When the statement has finished execution, it passes the token to the next statement in the program. Haydn-C [47] extends Handel-C for component-based design. Like Handel-C, it supports description of parallel blocks, bit-manipulation operations, and channel communication. The principal innovation of Haydn-C is a framework of optional annotations to enable users to describe design constraints, and to direct source-level transformations such as scheduling and resource allocation. There are automated transformations so that a single high-level design can be used to produce many implementations with different trade-offs. This approach has been evaluated using various case studies, including FIR filters, fractal generators, and morphological operators. The fastest morphological erosion design is 129 times faster and 3.4 times larger than the smallest design. Bach-C [198] is similar to Handel-C but has an untimed semantics, only synchronising between parallel threads on syn- 15 chronous communications between them, possibly giving greater scope for optimisation. It also allows asynchronous commu- nications but otherwise resembles Handel-C, using the same basic one-hot compilation scheme.

TABLE III

SUMMARY OF GENERAL-PURPOSE HARDWARE COMPILERS.

System Approach Source Target Target Example Language Language Architecture Applications Streams-C [62] Annotation / C + library RTL VHDL Xilinx FPGA Image contrast enhancement, Constraint-driven Pulsar detection [58] Sea Cucumber [81] Annotation / Java + library EDIF Xilinx FPGA none given Constraint-driven SPARK [67] Annotation / C RTL VHDL LSI, Altera FPGAs MPEG-1 predictor, Constraint-driven Image tiling SPC [183] Annotation / C EDIF Xilinx FPGAs String pattern matching, Constraint-driven Image skeletonisation ASC [112] Source-directed C++ using EDIF Xilinx FPGAs Wavelet compression, Compilation class library Encryption Handel-C [29] Source-directed Extended C Structural VHDL, Actel, Altera Image processing, Compilation Verilog, EDIF Xilinx FPGAs Polygon rendering [163] Haydn-C [47] Source-directed Extended C extended C Xilinx FPGAs FIR filter, Compilation (Handel-C) Image erosion Bach-C [198] Source-directed Extended C Behavioural and LSI FPGAs Viterbi decoders, Compilation RTL VHDL Image processing

Table III summarises the various compilers discussed in this section, showing their approach, source and target languages, target architecture and some example applications. Note that the compilers discussed are not necessarily restricted to the architectures reported; some can usually be ported to a different architecture by using a different library of hardware primitives.

B. Special-purpose design

Within the wide variety of problems to which reconfigurable computing can be applied, there are many specific problem domains which deserve special consideration. The motivation is to exploit domain-specific properties: (a) to describe the computation, such as using MATLAB for digital signal processing, and (b) to optimise the implementation, such as using word-length optimisation techniques described later. We shall begin with an overview of digital signal processing and relevant tools which target reconfigurable implementations. We then describe the word-length optimisation problem, the solution to which promises rich rewards; an example of such a solution will be covered. Finally we summarise other domain-specific design methods which have been proposed for video and image processing and networking. 1) Digital signal processing: One of the most successful applications for reconfigurable computing is Real-time Digital Signal Processing (DSP). This is illustrated by the inclusion of hardware support for DSP in the latest FPGA devices, such as the embedded DSP blocks in Altera Stratix II chips [6]. DSP problems tend to share the following properties: design latency is usually less of an issue than design throughput, algorithms tend to be numerically intensive but have very simple control structures, controlled numerical error is acceptable, and standard metrics, such as signal-to-noise ratio, exist for measuring numerical precision quality. 16

DSP algorithm design is often initially performed directly in a graphical programming environment such as Mathworks’ MATLAB Simulink [154]. Simulink is widely used within the DSP community, and has been recently incorporated into the Xilinx System Generator [80] and Altera DSP builder [4] design flows. Design approaches such as this are based on the idea of data-flow graphs (DFGs) [94]. Tools working with this form of description vary in the level of user intervention required to specify the numerical properties of the implementation. For example, in the Xilinx System Generator flow [80], it is necessary to specify the number of bits used to represent each signal, the scaling of each signal (namely the binary point location), and whether to use saturating or wrap-around arithmetic [39]. Ideally, these implementation details could be automated. Beyond a standard DFG-based algorithm description, only one piece of information should be required: a lower-bound on the output signal to quantization noise acceptable to the user. Such a design tool would thus represent a truly ‘behavioural’ synthesis route, exposing to the DSP engineer only those aspects of design naturally expressed in the DSP application domain. 2) The word-length optimization problem: Unlike microprocessor-based implementations where the word-length is defined a-priori by the hard-wired architecture of the processor, reconfigurable computing based on FPGAs allows the size of each variable to be customised to produce the best trade-offs in numerical accuracy, design size, speed, and power consumption. The use of such custom data representation for optimising designs is one of the main strengths of reconfigurable computing. Given this flexibility, it is desirable to automate the process of finding a good custom data representation. The most important implementation decision to automate is the selection of an appropriate word-length and scaling for each signal [41] in a DSP system. Unlike microprocessor-based implementations, where the word-length is defined a-priori by the hard-wired architecture of the processor, reconfigurable computing allows the word-length of each signal to be customised to produce the best trade-offs in numerical accuracy, design size, speed, and power consumption. The use of custom data representation is one of the greatest strengths It has been argued that, often, the most efficient hardware implementation of an algorithm is one in which a wide variety of finite precision representations of different sizes are used for different internal variables [37]. The accuracy observable at the outputs of a DSP system is a function of the word-lengths used to represent all intermediate variables in the algorithm. However, accuracy is less sensitive to some variables than to others, as is implementation area. It is demonstrated in [41] that by considering error and area information in a structured way using analytical and semi-analytical noise models, it is possible to achieve highly efficient DSP implementations. In [44] it has been demonstrated that the problem of word-length optimization is NP-hard, even for systems with special mathematical properties that simplify the problem from a practical perspective [40]. There are, however, several published approaches to word-length optimization. These can be classified as heuristics offering an area / signal quality tradeoff [37], [89], [182], approaches that make some simplifying assumptions on error properties [26], [89], or optimal approaches that can be applied to algorithms with particular mathematical properties [38]. Some published approaches to the word-length optimization problem use an analytic approach to scaling and/or error estimation [123], [160], [182], some use simulation [26], [89], and some use a hybrid of the two [34]. The advantage of analytic techniques is that they do not require representative simulation stimulus, and can be faster, however they tend to be more pessimistic. There is little analytical work on supporting data-flow graphs containing cycles, although in [160] finite loop bounds are supported, while [40] supports cyclic data-flow when the nodes are of a restricted set of types, extended to semi-analytic technique with fewer restrictions in [43]. 17

Some published approaches use worst-case instantaneous error as a measure of signal quality [26], [123], [182] whereas some use signal-to-noise ratio [37], [89]. The remainder of this section reviews in some detail particular research approaches in the field. The Bitwise Project [160] proposes propagation of integer variable ranges backwards and forwards through data-flow graphs. The focus is on removing unwanted most-significant bits (MSBs). Results from integration in a synthesis flow indicate that area savings of between 15% and 86% combined with speed increases of up to 65% can be achieved compared to using 32-bit integers for all variables. The MATCH Project [123] also uses range propagation through data-flow graphs, except variables with a fractional component are allowed. All signals in the model of [123] must have equal fractional precision; the authors propose an analytic worst-case error model in order to estimate the required number of fractional bits. Area reductions of 80% combined with speed increases of 20% are reported when compared to a uniform 32-bit representation. Wadekar and Parker [182] have also proposed a methodology for word-length optimization. Like [123], this technique also allows controlled worst-case error at system outputs, however each intermediate variable is allowed to take a word- length appropriate to the sensitivity of the output errors to quantization errors on that particular variable. Results indicate area reductions of between 15% and 40% over the optimum uniform word-length implementation. Kum and Sung [89] and Cantin et al. [26] have proposed several word-length optimization techniques to trade-off system area against system error. These techniques are heuristics based on bit-true simulation of the design under various internal word-lengths. In Bitsize [1], [2], Abdul Gaffar et al. propose a hybrid method based on the mathematical technique know as automatic differentiation to perform bitwidth optimisation. In this technique, the gradients of outputs with respect to the internal variables are calculated and then used to determine the sensitivities of the outputs to the precision of the internal variables. The results show that it is possible to achieve an area reduction of 20% for floating-point designs, and 30% for fixed-point designs, when given an output error specification of 0.75% against a reference design. A useful survey of algorithmic procedures for word-length determination has been provided by Cantin et al. [27]. In this work, existing heuristics are classified under various categories. However the ‘exhaustive’ and ‘branch-and-bound’ procedures described in [27] do not necessarily capture the optimum solution to the word-length determination problem, due to non- convexity in the constraint space: it is actually possible to have a lower error at a system output by reducing the word-length at an internal node [42]. Such an effect is modeled in the MILP approach proposed in [38]. A comparative summary of existing optimization systems is provided in Table IV. Each system is classified according to the several defining features described below.

• Is the word-length and scaling selection performed through analytic or simulation-based means? • Can the system support algorithms exhibiting cyclic data flow? (such as infinite impulse response filters). • What mechanisms are supported for Most Significant Bit (MSB) optimizations? (Such as ignoring MSBs that are known to contain no useful information, a technique determined by the scaling approach used). • What mechanisms are supported for Least Significant Bit (LSB) optimizations? These involve the monitoring of word- length growth. In addition, for those systems which support error-tradeoffs, further optimizations include the quantization (truncation or rounding) of unwanted LSBs. • Does the system allow the user to trade-off numerical accuracy for a more efficient implementation? 18

simulink SNR representative design constraints floating-point inputs

area Right-Size models

verification testbench structural Xilinx makefile outputs VHDL coregen scripts

VHDL Xilinx synthesis coregen

VHDL behavioural simulator VHDL

verification EDIF outputs

comparison Xilinx synthesis

FPGA bitstream

Fig. 9. Design flow for the Right-Size tool [43]. The shaded portions are FPGA vendor-specific.

3) An example optimization flow: One possible design flow for word-length optmization, used in the Right-Size system [43] is illustrated in Figure 9 for Xilinx FPGAs. The inputs to this system are a specification of the system behaviour (e.g. using Simulink), a specification of the acceptable signal-to-noise ratio at each output, and a set of representative input signals. From these inputs, the tool automatically generates a synthesisable structural description of the architecture and a bit-true behavioural VHDL testbench, together with a set of expected outputs for the provided set of representative inputs. Also generated is a makefile which can be used to automate the post-Right-Size synthesis process. Application of Right-Size to various adaptive filters implemented in a Xilinx Virtex FPGA has resulted in area reduction of up to 80%, power reduction of up to 98%, and speed-up of up to 36% over common alternative design methods without word-length optimisation. 4) Other design methods: Besides signal processing, video and image processing is another area that can benefit from special- purpose design methods. Three examples will be given to provide a flavour of this approach. First, the CHAMPION system [124] maps designs captured in the Cantata graphical programming environment to multiple reconfigurable computing platforms. Second, the IGOL framework [174] provides a layered architecture for facilitating hardware plug-ins to be incorporated in various applications in the Microsoft Windows operating system, such as Premiere, Winamp, VirtualDub and DirectShow. Third, the SA-C compiler [19] maps a high-level single-assignment language specialised for image processing description into hardware, using various optimisation methods including loop unrolling, array value propagation, loop-carried array elimination, and multi-dimensional stripmining. Recent work indicates that another application area that can benefit from special-purpose techniques is networking. Two examples will be given. First, a framework has been developed to enable description of designs in the network policy language Ponder [45], into reconfigurable hardware implementations [95]. Second, it is shown [87] how descriptions in the Click networking language can produce efficient reconfigurable designs.

C. Other design methods

In the following, we describe various design methods in brief. 1) Run-time customisation: Many aspects of run-time reconfiguration have been explored [36], including the use of direc- tives in high-level descriptions [96]. Effective run-time customisation hinges on appropriate design-time preparation for such 19 customisation. To illustrate this point, consider a run-time customisable system that supports partial reconfiguration: one part of the system continues to be operational, while another part is being reconfigured. As FPGAs get larger, partial reconfiguration is becoming increasingly important as a means of reducing reconfiguration time. To support partial reconfiguration, appropriate circuits must be built at fabrication time as part of the FPGA fabric. Then at compile time, an initial configuration bitstream and incremental bitstreams have to be produced, together with run-time customisation facilities which can be executed, for instance, on a microprocessor serving as part of the run-time system [151]. Run-time customisation facilities can include support for condition monitoring, design optimisation and reconfiguration control. Opportunities for run-time design optimisation include: (a) run-time constant propagation [49], which produces a smaller circuit with higher performance by treating run-time data as constant, and optimising them principally by boolean algebra; (b) library-based compilation – the DISC compiler [33] makes use of a library of precompiled logic modules which can be loaded into reconfigurable resources by the procedure call mechanism; (c) exploiting information about program branch probabilities [165]; the idea is to promote utilisation by dedicating more resources to branches which execute more frequently. A hardware compiler has been developed to produce a collection of designs, each optimised for a particular branch probability; the best can be selected at run time by incorporating observed branch probability information from a queueing network performance model. 2) Soft instruction processors: FPGA technology can now support one or more soft instruction processors implemented using reconfigurable resources on a single chip; proprietary instruction processors, like MicroBlaze and Nios, are now available from FPGA vendors. Often such processors support customisation of resources and custom instructions. Custom instructions have two main benefits. First, they reduce the time for instruction fetch and decode, provided that each custom instruction replaces several regular instructions. Second, additional resources can be assigned to a custom instruction to improve performance. Bit-width optimisation, described in Section IV-B, can also be applied to customise instruction processors at compile time. A challenge of customising instruction processors is that the tools for producing and analysing instructions also need to be customised. For instance, the flexible instruction processor framework [149] has been developed to automate the steps in customising an instruction processor and the corresponding tools. Other researchers have proposed similar approaches [85]. Instruction processors can also run declarative langauges. For instance, a scalable architecture [55], consisting of multiple processors based on the Warren Abstract Machine, has been developed to support the execution of the Progol system [118], based on the declarative language Prolog. Its effectiveness has been demonstrated using the mutagenesis data set containing 12000 facts about chemical compounds. 3) Multi-FPGA compilation: Peterson et al. have developed a C compiler which compiles to multi-FPGA systems [132]. The available FPGAs and other units are specified in a library file, allowing portability. The compiler can generate designs using speculative and lazy execution to improve performance and ultimately they aim to partition a single program between host and reconfigurable resource (hardware/software codesign). Duncan et al. have developed a system with similar capabilities [51]. This is also retargetable, using hierarchical architecture descriptions. It synthesises a VLIW architecture that can be partitioned across multiple FPGAs. Both methods can split designs across several FPGAs, and are retargetable via hardware description libraries. Other C-like languages that have been developed include MoPL-3, a C extension supporting data procedural compilation for the architecture which comprises an array of reconfigurable ALUs [9], and spC, a systolic parallel C variant for the Enable++ board [78]. 4) Hardware/software codesign: Several research groups have studied the problem of compiling C code to both hardware and software. The Garp compiler [25] is intended to accelerate plain C, with no annotations to help the compiler, making it 20 more widely applicable. The work targets one architecture only: the Garp chip, which integrates a RISC core and reconfigurable logic. This compiler also uses the SUIF framework. The compiler uses a technique first developed for VLIW architectures called hyperblock scheduling, which optimises for instruction-level parallelism across several common paths, at the expense of rarer paths. Infeasible or rare paths are implemented on the processor with the more common, easily parallelisable paths synthesised into logic for the reconfigurable resource. Similarly, the NAPA C compiler targets the NAPA architecture [61], which also integrates a RISC processor reconfigurable logic. This compiler can also work on plain C code but the programmer can add C pragmas to indicate large-scale parallelism and the bit-widths of variables to the code. The compiler can synthesise pipelines from loops. 5) Annotation-free compilation: Some researchers aim to compile a sequential program, without any annotations, into efficient hardware design. This requires analysis of the source program to extract parallelism for an efficient result, which is necessary if compilation from languages such as C is to compete with traditional methods for designing hardware. One example is the work of Babb et al. [11], targeting custom, fixed-logic implementation while also applicable to reconfigurable hardware. The compiler uses the SUIF infrastructure to do several analyses to find what computations affect exactly what data, as far as possible. A tiled architecture is synthesised, where all computation is kept as local as possible to one tile. More recently, Ziegler et al. [199] have used loop transformations in mapping loop nests onto a pipeline spanning several FPGAs. A further effort is given by the Garp project [25].

D. Emerging directions

1) Verification: As designs are becoming more complex, techniques for verifying their correctness are becoming increasingly important. Four approaches are described: (1) The InterSim framework [139] provides a means of combining software simulation and hardware prototyping. (2) The Lava system [17] can convert designs into a form suitable for input to a model checker; a number of FPGA design libraries have been verified in this way [156]. (3) The Ruby language [66] supports correctness- preserving transformations, and a wide variety of hardware designs have been produced. Fourth, the Pebble [104] hardware design language has been formally specified [109], so that provably-correct design tools can be developed. 2) Customisable hardware compilation: Recent work [177] explains how customisable frameworks for hardware compi- lation can enable rapid design exploration, and reusable and extensible hardware optimisation. The framework compiles a parallel imperative language like Handel-C, and supports multiple levels of design abstraction, transformational development, optimisation by compiler passes, and metalanguage facilities. The approach has been used in producing designs for applications such as signal and image processing, with different trade-offs in performance and resource usage.

E. Design methods: main trends

We summarise the main trends in design methods for reconfigurable computing below. 1) Special-purpose design: As explained earlier, special-purpose design methods and tools enable both high-level design as well as domain-specific optimisation. Existing methods, such as those compiling MATLAB Simulink descriptions into reconfigurable computing implementations [1], [4], [43], [80], [123], [127], allow application developers without electronic design experience to produce efficient hardware implementations quickly and effectively. This is an area that would assume further importance in future. 21

2) Low-power design: Several hardware compilers aim to minimise the power consumption of their generated designs. Examples include special-purpose design methods such as Right-Size [43] and PyGen [127], and general-purpose methods that target loops for configurable hardware implementation [162]. These design methods, when combined with low-power architectures [59] and power-aware low-level tools [90], can provide significant reduction in power consumption. 3) High-level transformations: Many hardware design methods [19], [67], [183] involve high-level transformations: loop unrolling, loop restructuring and static single assignment are three examples. The development of powerful transformations for design optimisation will continue for both special-purpose and general-purpose designs.

V. SUMMARY

This chapter surveys three aspects of reconfigurable computing: architectures and design methods. The main trends in architectures are coarse-grained fabrics, heterogeneous functions, and soft cores. The main trends in design methods are special- purpose design methods, low-power techniques, and high-level transformations. We wonder what a survey of reconfigurable computing, written in 2015, will cover?

ACKNOWLEDGMENTS Our thanks to Ray Cheung and Sherif Yusuf for their support in preparing this chapter. The support of Celoxica, Xilinx and UK EPSRC (Grant number GR/R 31409, GR/R 55931, GR/N 66599) is gratefully acknowledged.

REFERENCES

[1] A. Abdul Gaffar, O. Mencer, W. Luk, P.Y.K. Cheung and N. Shirazi, “Floating-point bitwidth analysis via automatic differentiation”, Proc. Int. Conf. on Field-Programmable Technology, IEEE, 2002. [2] A. Abdul Gaffar, O. Mencer, W. Luk, and P.Y.K. Cheung, “Unifying bit-width optimisation for fixed-point and floating-point designs”, Proc. Symp. on Field-Programmable Custom Computing Machines, IEEE Computer Society Press, 2004. [3] Actel Corp., ProASIC Plus Family Flash FPGAs, v3.5, April 2004. [4] Altera Corp., DSP Builder User Guide, Version 2.1.3 rev.1, July 2003. [5] Altera Corp., Excalibur Device Overview, May 2002. [6] Altera Corp., Stratix II Device Handbook, February 2004. [7] Altera Corp., Nios II Processor Reference Handbook, May 2004. [8] Annapolis Microsystems, Inc., Wildfire Reference Manual, 1998. [9] A. Ast, J. Becker, R. Hartenstein, R. Kress, H. Reinig and K. Schmidt, “Data-procedural languages for FPL-based machines”, Field-Programmable Logic and Applications, LNCS 849, Springer, 1994. [10] P.M. Athanas and A.L. Abbott, “Real-time image processing on a custom computing platform”, IEEE Computer, Vol. 28, pp. 16–24, 1995. [11] J. Babb, M. Reinard, C. Andras Moritz, W. Lee, M. Frank, R. Barwa and S. Amarasinghe, “Parallelizing applications into silicon”, Proc. Symp. on FPGAs for Custom Computing Machines, IEEE Computer Society Press, 1999. [12] J. Becker and M. Glesner, “A parallel dynamically reconfigurable architecture designed for flexible application-tailored hardware/software systems in future mobile communication”, The Journal of Supercomputing, Vol. 19, No. 1, 2001, pp. 105–127. [13] A. Benedetti and B. Perona, “Bit-width optimization for configurable DSP’s by multi-interval analysis”, Proc. 34th Asilomar Conference on Signals, Systems and Computers, 2000. [14] N.W. Bergmann and Y.Y. Chung, “Video compression with custom computers”, IEEE Transactions on Consumer Electronics, Vol. 43, Aug. 1997, pp. 925–933. [15] G. Berry and M. Kishinevsky, “Hardware Esterel Language Extension Proposal”, http://www.esterel-technologies.com/v2/ solutions/iso_album/hardest.pdf. [16] V. Betz, J. Rose and A. Marquardt, Architecture and CAD for Deep-Submicron FPGAs, Kluwer Academic Publishers, February 1999. [17] P. Bjesse, K. Claessen, M. Sheeran and S. Singh, “Lava: hardware design in Haskell”, Proc. ACM Int. Conf. on Functional Programming, ACM Press, 1998. [18] T. Blum and C. Paar, “High-radix Montgomery modular exponentiation on reconfigurable hardware”, IEEE Transactions on Computers, Vol. 50, No. 7, pp. 759–764, 2001. 22

[19] W. Bohm, J. Hammes, B. Draper, M. Chawathe, C. Ross, R. Rinker and W. Najjar, “Mapping a single assignment programming language to reconfigurable systems”, The Journal of Supercomputing, Vol. 21, 2002, pp. 117–130. [20] K. Bondalapati and V.K. Prasanna, “Dynamic precision management for loop computations on reconfigurable architectures”, Proc. Symp. on Field- Programmable Custom Computing Machines, IEEE Computer Society Press, 1999. [21] K. Bondalapati and V.K. Prasanna, “Reconfigurable computing systems”, Proc. IEEE, Vol. 90, No. 7, pp. 1201-1217, July 2002. [22] V.M. Bove, J. Watlington and J. Cheops, “A reconfigurable data-flow system for video processing”, IEEE Transactions on Circuits and Systems for Video Technology, Vol. 5, 1995, pp. 140–149. [23] M. Butts, A. DeHon and S. Goldstein, “Molecular electronics: devices, systems and tools for gigagate, gigabit chips”, Proc. Int. Conference on Computer-Aided Design, IEEE, 2002. [24] Inc, Palladium Datasheet, 2004. [25] T. Callahan and J. Wawrzynek, “Instruction-Level parallelism for reconfigurable computing”, Field-Programmable Logic and Applications, LNCS 1482, Springer, 1998. [26] M.-A. Cantin, Y. Savaria and P. Lavoie, “An automatic word length determination method”, Proc. IEEE International Symposium on Circuits and Systems, p.V-53-V-56, 2001. [27] M.-A. Cantin, Y. Savaria and P. Lavoie, “A comparison of automatic word length optimization procedures”, Proc. IEEE Int. Symp. on Circuits and Systems, 2002. [28] A.J. Cantle, M. Devlin, E. Lord and R. Chamberlain, “High frame rate, low latency hardware-in-the-loop image generation? An illustration of the Particle Method and DIME”, Proc. Aerosense Conference, SPIE, 2004. [29] Celoxica, Handel-C Language Reference Manual for DK2.0, Document RM-1003-4.0, 2003. [30] Celoxica, RC2000 Development and evaluation board data sheet. [31] , Celoxica, DK Design Suite http://www.celoxica.com/products/tools/dk.asp [32] C.R. Clark and D.E. Schimmel, “A pattern-matching co-processor for network intrusion detection systems”, Proc. Int. Conf. on Field Programmable Technology, IEEE, 2003. [33] D. Clark and B. Hutchings, “The DISC programming environment”, Proc. Symp. on FPGAs for Custom Computing Machines, IEEE Computer Society Press, 1996. [34] R. Cmar, L. Rijnders, P. Schaumont, S. Vernalde and I. Bolsens, “A methodology and design environment for DSP ASIC fixed point refinement”, Proc. Design Automation and Test in Europe, 1999. [35] K. Compton and S. Hauck, “Totem: Custom reconfigurable array generation”, Proc. Symp. on Field-Programmable Custom Computing Machines, IEEE Computer Society Press, 2001. [36] K. Compton and S. Hauck, “Reconfigurable computing: a survey of systems and software”, ACM Computing Surveys, Vol. 34, No. 2, pp. 171-210, June 2002. [37] G.A. Constantinides, P.Y.K. Cheung and W. Luk, “The multiple wordlength paradigm”, Proc. Symp. on Field-Programmable Custom Computing Machines, IEEE Computer Society Press, 2001. [38] G.A. Constantinides, P.Y.K. Cheung and W. Luk, “Optimum wordlength allocation”, Proc. Symp. on Field-Programmable Custom Computing Machines, IEEE Computer Society Press, 2002. [39] G.A. Constantinides, P.Y.K. Cheung and W. Luk, “Optimum and heuristic synthesis of multiple wordlength architectures”, IEEE Trans. on Computer Aided Design of Integrated Circuits and Systems, Vol. 22, No. 10, pp. 1432-1442, October 2003. [40] G.A. Constantinides, P.Y.K. Cheung and W. Luk, “Synthesis of saturation arithmetic architectures”, ACM Trans. on Design Automation of Electronic Systems, Vol. 8, No. 3, July 2003. [41] G.A. Constantinides, P.Y.K. Cheung and W. Luk, Synthesis and Optimization of DSP Algorithms, Kluwer Academic, Dordrecht, 2004. [42] G.A. Constantinides, High Level Synthesis and Word Length Optimization of Digital Signal Processing Systems, PhD thesis, Imperial college London, 2001. [43] G.A. Constantinides, “Perturbation analysis for word-length optimization”, Proc. Symp. on Field-Programmable Custom Computing Machines, IEEE Computer Society Press, 2003. [44] G.A. Constantinides and G.J. Woeginger, “The complexity of multiple wordlength assignment”, Applied Mathematics Letters, Vol. 15, No. 2, pp. 137-140, February 2002. [45] N. Damianou, N. Dulay, E. Lupu and M. Sloman, “The Ponder policy specification language”, Proc. Workshop on Policies for Distributed Systems and Networks, LNCS 1995, Springer, 2001. [46] M. Dao, T. Cook, D. Silver and P. D’Urbano, “Acceleration of template-based ray casting for volume visualization using FPGAs”, in Proc IEEE Symposium on Field-Programmable Custom Computing Machines, 1995. 23

[47] J.G. De Figueiredo Coutinho and W. Luk, “Source-directed transformations for hardware compilation”, Proc. Int. Conf. on Field-Programmable Technology, IEEE, 2003. [48] A. DeHon and M.J. Wilson, “Nanowire-based sublithographic programmable logic arrays”, Proc. Int. Symp. on FPGAs, ACM Press, 2004. [49] A. Derbyshire and W. Luk, “Compiling run-time parametrisable designs”, Proc. Int. Conf. on Field-Programmable Technology, IEEE, 2002. [50] J. Ditmar, K. Torkelsson and A. Jantsch, “A dynamically reconfigurable FPGA-based content addressable memory for internet protocol characterization”, Field Programmable Logic and Applications, LNCS 1896, Springer, 2000. [82] [51] A. Duncan, D. Hendry and P. Gray, “An overview of the COBRA-ABS high-level synthesis system for multi-FPGA systems”, Proc. IEEE Symposium on FPGAs for Custom Computing Machines, IEEE Computer Society Press, 1998. [52] C. Ebeling, D. Conquist and P. Franklin, “RaPiD – reconfigurable pipelined datapath”, Field-Programmable Logic and Applications, LNCS 1142, Springer, 1996. [53] Elixent Corporation, DFA 1000 Accelerator Datasheet, 2003. [54] J. Fender and J. Rose, “A high-speed ray tracing engine built on a field-programmable system”, Proc. Int. conf. on Field-Programmable Technology, IEEE, 2003. [55] A. Fidjeland, W. Luk and S. Muggleton, “Scalable acceleration of inductive logic programs”, Proc. Int. Conf. on Field-Programmable Technology, IEEE, 2002. [56] R. Franklin, D Carver and B.L. Hutchings, “Assisting network intrusion detection with reconfigurable hardware, Proc. Symp. on Field-Programmable Custom Computing Machines, IEEE Computer Society Press, 2002. [57] C. Fraser and D. Hanson, A Retargetable C Compiler: Design and Implementation, Benjamin/Cummings Pub. Co, 1995. [58] J. Frigo, D. Palmer, M. Gokhale and M. Popkin-Paine, “Gamma-Ray pulsar detection using reconfigurable computing hardware”, Proc. Symp. on Field Programmable Custom Computing Machines, IEEE Computer Society Press, 2003.

[59] A. Gayasen, K. Lee, N. Vijaykrishnan, M. Kandemir, M. J. Irwin and T. Tuan, “A dual-VDD low power FPGA architecture”, Field Programmable Logic and Applications, LNCS Series, Springer, 2004. [60] V. George, H. Zhang, and J. Rabaey, “The design of a low energy FPGA”, Proc. Int. Symp. on Low Power Electronics and Design, 1999. [61] M. Gokhale and J. Stone, “NAPA C: compiling for a hybrid RISC/FPGA architecture”, Proc. Symp. on Field-Programmable Custom Computing Machines, IEEE Computer Society Press, 1998. [62] M. Gokhale, J.M. Stone, J. Arnold and M. Kalinowski, “Stream-oriented FPGA computing in the Streams-C high level language”, Proc. Symp. on Field-Programmable Custom Computing Machines, IEEE Computer Society Press, 2000. [63] M. Gokhale, D. Dubois, A. Dubois, M. Boorman, S. Poole and V. Hogsett, Granidt: towards gigabit rate network intrusion detection technology, Field Programmable Logic and Applications, LNCS 2438, Springer, 2002. [64] S.C. Goldstein, H. Schmit, M. Budiu, S. Cadambi, M. Moe and R. Taylor, “PipeRench: a reconfigurable architecture and compiler”, IEEE Computer, Vol. 33, No. 4, 2000, pp. 70-77. [65] Z. Guo, W. Najjar, F. Vahid and K. Vissers, “A quantitative analysis of the speedup factors of FPGAs over processors”, Proc. Int. Symp. on FPGAs, ACM Press, 2004. [66] S. Guo and W. Luk, “An integrated system for developing regular array design”, Journal of Systems Architecture, Vol. 47, 2001. [67] S. Gupta, N.D. Dutt, R.K. Gupta and A. Nicolau, “SPARK: a high-level synthesis framework for applying parallelizing compiler transformations”, Proc. International Conference on VLSI Design, January 2003. [68] N. Gura, S.C. Shantz, H. Eberle, S. Gupta, V. Gupta, D. Finchelstein, E. Goupy and D. Stebila, “An end-to-end systems approach to Elliptic Curve Cryptography”, Proc. Cryptographic Hardware and Embedded Systems, LNCS 2523, 2002. [69] P. Haglund, O. Mencer, W. Luk and B. Tai, “PyHDL: hardware scripting with Python”, Proc. Int. Conf. on Engineering of Reconfigurable Systems and Algorithms, 2003. [70] D. Hankerson, A. Menezes and S. Vanstone, Guide to Elliptic Curve Cryptography, Springer, 2004. [71] P. Hanrahan, “Using caching and breadth-first search to speed up ray-tracing”, Proc. Graphics Interface ’86, May 1986. [72] J.R. Hauser and J. Wawrzynek, “Garp: a MIPS processor with a reconfigurable processor”, IEEE Symposium on Field-Programmable Custom Computing Machines, IEEE Computer Society Press, 1997. [73] T. Harriss, R. Walke, B. Kienhuis and E. Deprettere, “Compilation from Matlab to process networks realized in FPGA”, Design Automation of Embedded Systems, Vol. 7, No. 4, 2002. [74] S.D. Haynes, J. Stone, P.Y.K. Cheung and W. Luk, “Video image processing with the SONIC architecture”, IEEE Computer, Vol. 33, 2000, pp. 50–57. [75] S.D. Haynes, H.G. Epsom, R.J. Cooper and P.L. McAlpine, “UltraSONIC: a reconfigurable architecture for video image processing”, Field Programmable Logic and Applications, LNCS 2438, Springer, 2002. [76] C.A.R. Hoare, Communicating Sequential Processes, Prentice Hall, 1985. 24

[77] A. Hodjat and I. Verbauwhede, “A 21.54 Gbit/s fully pipelined AES processor on FPGA”, Proc. Symp. on Field-Programmable Custom Computing Machines, IEEE Computer Society Press, 2004. [78] H. Hogl,¨ A. Kugel, J. Ludvig, R. Manner¨ , K. Noffz and R. Zoz, “Enable++: a second-generation FPGA processor”, IEEE Symposium on FPGAs for Custom Computing Machines, IEEE Computer Society Press, 1995. [79] G. Holzmann, “The model checker SPIN”, in IEEE Transactions on Software Engineering, Vol. 23, No. 5, May 1997, pp. 279–295. [80] J. Hwang, B. Milne, N. Shirazi and J.D. Stroomer, “System level tools for DSP in FPGAs”, Field-Programmable Logic and Applications, LNCS 2147, Springer, 2001. [81] P.A. Jackson, B.L. Hutchings and J.L. Tripp, “Simulation and synthesis of CSP-based interprocess communication”, Proc. Symp. on Field-Programmable Custom Computing Machines, IEEE Computer Society Press, 2003. [82] P.B. James-Roxby and D.J. Downs, “An efficient content addressable memory implementation using Dnamic routing”, Proc. Symp. on Field- Programmable Custom Computing Machines, IEEE Computer Society Press, 2001. [83] J. Jiang, W. Luk and D. Rueckert, “FPGA-based computation of free-form deformations in medical image registration”, Proc. Int. Conf. on Field- Programmable Technology, IEEE, 2003. [84] H. Kagotani and H. Schmit, “Asynchronous PipeRench: architecture and performance evaluations”, Proc. Symp. on Field-Programmable Custom Computing Machines, IEEE Computer Society Press, 2003. [85] V. Kathail et. al., “PICO: automatically designing custom computers”, Computer, Vol. 35, No. 9, Sept. 2002. [86] H. Keding, M. willems, M. Coors and H. Meyr, “FRIDGE: A fixed-point design and simulation environment”, Proc. Design Automation and Test in Europe, 1998. [87] C. Kulkarni, G. Brebner and G. Schelle, “Mapping a domain specific language to a platform FPGA”, Proc. Design Automation Conference, 2004. [88] K. Kum and W. Sung, “Word-length optimization for high-level synthesis of digital signal processing systems”, Proc. IEEE Int. Workshop on Signal Processing Systems, 1998. [89] K.-I. Kum and W. Sung, “Combined word-length optimization and high-level synthesis of digital signal processing systems”, IEEE Trans. Computer Aided Design, Vol. 20, No. 8, pp. 921–930, August 2001. [90] J. Lamoureux and S.J.E. Wilton, “On the interaction between power-aware FPGA CAD algorithms”, Int. Conf. on Computer-Aided Design, IEEE, 2003. [91] Corp, ispXPGA Family, January 2004. [92] R. Laufer, R. Taylor and H. Schmit, “PCI-PipeRench and the SwordAPI: a system for stream-based reconfigurable computing”, Proc. Symp. on Field-Programmable Custom Computing Machines, IEEE Computer Society Press, 1999. [93] D.U. Lee, W. Luk, C. Wang, C. Jones, M. Smith and J. Villasenor, “A flexible hardware encoder for Low-Density Parity-Check Codes”, Proc. Symp. on Field-Programmable Custom Computing Machines, IEEE Computer Society Press, 2004. [94] E.A. Lee and D.G. Messerschmitt, “Static scheduling of synchronous data flow programs for digital signal processing”, IEEE Trans. Computers, January 1987. [95] T.K. Lee, S. Yusuf, W. Luk, M. Sloman, E. Lupu and N. Dulay, ”Compiling policy descriptions into reconfigurable firewall processors”, Proc. Symp. on Field-Programmable Custom Computing Machines, IEEE Computer Society Press, 2003. [96] T.K. Lee, A. Derbyshire, W. Luk and P.Y.K. Cheung, “High-level language extensions for run-time reconfigurable systems”, Proc. Int. Conf. on Field-Programmable Technology, IEEE, 2003. [97] G. Lemieux and D. Lewis, Design of Interconnect Networks for Programmable Logic, Kluwer Academic Publishers, Feb. 2004. [98] J. Leonard and W.H. Mangione-Smith, “A case study of partially evaluated hardware circuits: key-specific DES”, Field Programmable Logic and Applications, LNCS 1304, Springer 1997. [99] P. Leong, M. Leong, O. Cheung, T. Tung, C. Kwok, M. Wong and K. Lee, “Pilchard – A Reconfigurable Computing Platform with Memory Slot Interface”, Proc. Symp. on Field-Programmable Custom Computing Machines IEEE Computer Society Press, 2001 [100] P.H.W. Leong and K.H. Leung, “A microcoded Elliptic Curve Processor using FPGA technology”, IEEE Transactions on Very Large Scale Integration Systems, Vol. 10, No. 5, 2002, pp. 550-559. [101] J. Liang, R. Tessier and O. Mencer, “Floating point unit generation and evaluation for FPGAs”, Proc. Symp. on Field-Programmable Custom Computing Machines, IEEE Computer Society Press, 2003. [102] W. Luk, “Customising processors: design-time and run-time opportunities”, Computer Systems: Architectures, Modeling, and Simulation, LNCS 3133, Springer, 2004. [103] W. Luk, P.Y.K. Cheung and N. Shirazi, “Configurable computing”, Electrical Engineer’s Handbook, W.K. Chen (ed.), Academic Press, 2004. [104] W. Luk and S.W. McKeever, “Pebble: a language for parametrised and reconfigurable hardware design”, Field-Programmable Logic and Applications, LNCS 1482, Springer, 1998. [105] S. Mahlke et. al., “Bitwidth cognizant architecture synthesis of custom hardware accelerators”, IEEE Trans. on Computer-Aided Design, Vol. 20, No. 11, November 2001. 25

[106] A. Marshall, T. Stansfield, I Kostarnov, J. Vuillemin and B. Hutchings, “A reconfigurable arithmetic array for multimedia applications”, ACM/SIGDA International Symposium on FPGAs, Feb 1999, pp. 135-143. [107] A. Mazzeo, L. Romano and G.P. Saggese, “FPGA-based implementation of a serial RSA processor”, Proc. Design, Automation and Test in Europe Conf., IEEE, 2003. [108] S. McCloud, Catapult C Synthesis-based design flow: speeding implementation and increasing flexibility, White Paper, , 2004. [109] S.W. McKeever, W. Luk and A. Derbyshire, “Compiling hardware descriptions with relative placement information for parametrised libraries”, Formal Methods in Computer-Aided Design, LNCS 2517, Springer, 2002. [110] B. Mei, S. Vernalde, D. Verkest, H. De Man and R. Lauwereins, “ADRES: An architecture with tightly coupled VLIW processor and coarse-grained reconfigurable matrix”, Field-Programmable Logic and Applications, LNCS 2778, Springer, 2003. [111] O. Mencer, “PAM-Blox II: design and evaluation of C++ module generation for computing with FPGAs”, Proc. Symp. on Field-Programmable Custom Computing Machines, IEEE Computer Society Press, 2002. [112] O. Mencer, D.J. Pearce, L.W. Howes and W. Luk, “Design space exploration with A Stream Compiler”, Proc. Int. Conf. on Field Programmable Technology, IEEE, 2003, [113] Mentor Graphics, Vstation Pro: High Performance System Verification, 2003. [114] E. Mirsky and A. DeHon, “MATRIX: a reconfigurable computing architecture with configurable instruction distribution and deployable resources”, Proc. Symp. on Field-Programmable Custom Computing Machines, IEEE Computer Society Press, 1996. [115] P. Montgomery, “Modular multiplication without trial division”, Math. of Computation, Vol. 44, 1985, pp. 519–521. [116] K. Morris, “Virtex 4: Xilinx details its next generation”, FPGA and Programmable Logic Journal, June 2004. [117] H. Muller and J. Winckler, “Distributed image synthesis with breadth-first ray tracing and the Ray-Z-buffer”, Data Structures and Efficient Algorithms - Final Report on the DFG Special Initiative, LNCS 594, Springer, 1992. [118] S.H. Muggleton, “Inverse entailment and Progol”, New Generation Computing, Vol. 13, 1995. [119] K. Nakamaru and Y. Ohno, “Breadth-first ray tracing using uniform spatial subdivision”, IEEE Transactions on Visualization and Computer Graphics, Vol. 3, No. 4, IEEE, 1997. [120] Nallatech, Ballynuey 2: Full-length PCI Card DIME Motherboard with four DIME slots, Nallatech, NT190-0045, 2002. [121] National Bureau of Standards, Announcing the Data Encryption Standard, Technical Report FIPS Publication 46, US Commerce Department, National Bureau of Standards. January 1977. [122] National Insititute of Standards and Technology, Advanced Encryption Standard, http://csrc.nist.gov/publication/drafts/dfips-AES.pdf [123] A. Nayak, M. Haldar, A. Choudhary and P. Banerjee, “Precision and error analysis of MATLAB applications during automated hardware synthesis for FPGAs”, Proc. Design Automation and Test in Europe, 2001. [124] S. Ong, N. Kerkiz, B. Srijanto, C. Tan, M. Langston, D. Newport and D. Bouldin, “Automatic mapping of multiple applications to multiple adaptive computing systems”, Proc. Int. Symp. on Field-Programmable Custom Computing Machines, IEEE Computer Society Press, 2001. m [125] G. Orlando, and C. Paar, “A high performance reconfigurable Elliptic Curve for GF (2 )”, Proc. Workshop on Cryptographic Hardware and Embedded Systems, LNCS 1965, 2000. [126] G. Orlando, and C. Paar, “A scalable GF (p) Elliptic Curve processor architecture for programmable hardware”, Proc. Workshop on Cryptographic Hardware and Embedded Systems, LNCS 2162, 2001. [127] J. Ou and V. Prasanna, “PyGen: a MATLAB/Simulink based tool for synthesizing parameterized and energy efficient designs using FPGAs”, Proc. Int. Symp. on Field-Programmable Custom Computing Machines, IEEE Computer Society Press, 2004. [128] I. Page and W. Luk, “Compiling occam into FPGAs”, FPGAs, Abingdon EE&CS Books, 1991. [129] S. Parker, W. Martin, P. Sloan, P. Shirley, B. Smits and C. Hansen, “Interactive ray tracing”, Proc. 1999 Symposium on Interactive 3D Graphics, ACM Press, April 1999. [130] V. Pasham, and S. Trimberger, “High-speed DES and Triple DES encryptor/decryptor”, Xilinx Application note: XAPP270, 2001. http://www.xilinx.com/bvdocs/appnotes/xapp270.pdf [131] C. Patterson, “High performance DES encryption in Virtex FPGAs using JBits”, Proc. Int. Conf. on Field-Programmable Custom Computing Machines, IEEE, pp. 113–121, 2000. [132] J. Peterson, B. O’Connor and P. Athanas, “Scheduling and partitioning ANSI-C programs onto multi-FPGA CCM architectures”, Int. Symp. on FPGAs for Custom Computing Machines, IEEE Computer Society Press, 1996. [133] D. Plunkett and M. Bailey, “The vectorization of a ray-racing algorithm for increased speed”, IEEE Computer Graphics and Applications, Vol. 5, No. 8, 1985. [134] G.M. Quenot, I.C. Kraljic, J. Serot and B. Zavidovique, “A reconfigurable compute engine for real-time vision automata prototyping”, Proc. IEEE Workshop on FPGAs for Custom Computing Machines, IEEE Computer Society Press, 1994. 26

[135] Quicklogic Corp., Eclipse-II Family Datasheet, January 2004. [136] A. Rahman and V. Polavarapuv, “Evaluation of low-leakage design techniques for Field Programmable Gate Arrays”, Proc. Int. Symp. on Field- Programmable Gate Arrays, ACM Press, 2004. [137] S. Rathman and G. Slavenburg, “Processing the new world of interactive media”, IEEE Signal Processing Magazine, Vol. 15 , Issue 2, March 1998, pp. 108–117. [138] R. Razdan and M.D. Smith, “A high performance microarchitecture with hardware programmable functional units”, International Symposium on Microarchitecture, 1994, pp. 172-180. [139] T. Rissa, W. Luk and P.Y.K. Cheung, “Automated combination of simulation and hardware prototyping”, Proc. Int. Conf. on Engineering of Reconfigurable systems and algorithms, CSREA Press, 2004.. [140] R.L. Rivest, A. Shamir and L. Adleman, “A method for obtaining digital signatures and public-key cryptosystems”, Communications of ACM, Vol. 21, No. 2, 1978, pp. 120-126. [141] A. Royal and P.Y.K. Cheung, “Globally asynchronous locally synchronous FPGA architectures”, Field Programmable Logic and Applications, LNCS 2778, Springer, 2003. [142] C.R. Rupp, M. Landguth, T. Garverick, E. Gomersall, H. Holt, J. Arnold and M. Gokhale, “The NAPA adaptive processing architecture”, IEEE Symposium on Field-Programmable Custom Computing Machines, May 1998, pp. 28-37. [143] G.P. Saggese, A. Mazzeo, N. Mazzocca and A.G.M. Strollo, “An FPGA-based performance analysis of the unrolling, tiling, and pipelining of the AES Algorithm”, Field-Programmable Logic and Applications, LNCS 2778, 2003. [144] T. Saxe and B. Faith, “ Less is More with FPGAs”, EE Times, September 13, 2004 http://www.eetimes.com/showArticle.jhtml? articleID=47203801 [145] A. Satoh and K. Takano, “A scalable dual-field Elliptic Curve Cryptographic Processor”, IEEE Transactions on Computers, Vol. 52, No. 4, 2003, pp. 449-460. [146] P. Schaumont, I. Verbauwhede, K. Keutzer and M. Sarrafzadeh, “A quick safari through the reconfiguration jungle”, Proc. Design Automation Conf., ACM Press, 2001. [147] R. Schreiber et. al., “PICO-NPA: High-level synthesis of nonprogrammable hardware accelerators”, Journal of VLSI Signal Processing Systems, Vol. 31, No. 2, June 2002. [148] P. Sedcole, P.Y.K. Cheung, G.A. Constantinides and W. Luk, “A reconfigurable platform for real-time embedded video image processing”, Field Programmable Logic and Applications, LNCS 2778, Springer, 2003. [149] S.P. Seng, W. Luk and P.Y.K. Cheung, “Flexible instruction processors”, Proc. Int. Conf. on Compilers, Arch. and Syn. for Embedded Systens, ACM Press, 2000. [150] S.P. Seng, W. Luk and P.Y.K. Cheung, “Run-time adaptive flexible instruction processors”, Field-Programmable Logic and Applications, LNCS 2438, Springer, 2002. [151] N. Shirazi, W. Luk and P.Y.K. Cheung, “Framework and tools for run-time reconfigurable designs”, IEE Proc.-Comput. Digit. Tech., May, 2000. [152] Silicon Hive, “Avispa Block Accelerator”, Product Brief, 2003. [153] M. Sima, S.D. Cotofana, S. Vassiliadis, J.T.J. van Eijndhoven and K.A. Vissers, “Pel reconstruction on FPGA-augmented TriMedia”, IEEE Trans. on VLSI, Vol. 12, No. 6, June 2004, pp. 622-635. [154] Simulink, http://www.mathworks.com [155] H. Singh, M-H Lee, G. Lu, F. Kurdahi, N. Bagherzadeh and E. Chaves, “MorphoSys: An integrated reconfigurable system for data-parallel and compute intensive applications”, IEEE Trans. on Computers, Vol. 49, No. 5, May 2000, pp. 465-481. [156] S. Singh and C.J. Lillieroth, “Formal verification of reconfigurable cores”, Proc. Symp. on Field-Programmable Custom Computing Machines, IEEE Computer Society Press, 1999. [157] I. Sourdis and D. Pnevmatikatos, “Fast, large-scale string matching for a 10Gbps FPGA-based network intrusion system”, Field Programmable Logic and Applications, LNCS 2778, Springer, 2003. [158] Spin, http://spinroot.com/spin/whatispin.html. [159] F.X. Standaert, G. Rouvroy, J.J. Quisquater and J.D. Legat, “Efficient implementation of Rijndael encryption in reconfigurable hardware: improvements and design tradeoffs”, Proc. Cryptographic Hardware and Embedded Systems, LNCS 2779, 2003. [160] M. Stephenson, J. Babb and S. Amarasinghe, “Bitwidth analysis with application to silicon compilation”, Proc. SIGPLAN Programming Language Design and Implementation, June 2000. [161] M.W. Stephenson, Bitwise: Optimizing bitwidths using data-range propagation, Master’s Thesis, Massachussets Institute of Technology, Dept. Electrical engineering and Computer Science, May 2000. [162] G. Stitt, F. Vahid and S. Nematbakhsh, “Energy savings and speedups from partitioning critical software loops to hardware in embedded systems”, ACM Trans. on Embedded Computing Systems, Vol. 3, No. 1, Feb. 2004, pp. 218–232. 27

[163] H. Styles and W. Luk, “Customising graphics applications: techniques and programming interface”, Proc. Symp. on Field-Programmable Custom Computing Machines, IEEE Computer Society Press, 2000. [164] H. Styles and W. Luk, “Accelerating radiosity calculations using reconfigurable platforms”, Proc. Symp. on Field-Programmable Custom Computing Machines, IEEE Computer Society Press, 2002. [165] H. Styles and W. Luk, “Branch optimisation techniques for hardware compilation”, Field-Programmable Logic and Applications, LNCS 2778, Springer, 2003. [166] W. Sung and K. Kum, “Word-length determination and scaling software for a signal flow block diagram”, Proc. IEEE Int. Conf. on Acoustics Speech and Signal Processing, 1994. [167] W. Sung and K. Kum, “Simulation-based word-length optimization method for fixed-point digital signal processing systems”, IEEE Trans. Signal Processing, Vol. 43, No. 12, December 1995, pp. 3087–3090. [168] Synplicity, Synplify Pro, http://www.synplicity.com/products/synplifypro/index.html [169] S.H. Tang, K.S. Tsui and P.H.W. Leong, “Modular exponentiation using parallel multipliers,” Proc. Int. Conf. on Field Programmable Technology, IEEE, 2003. [170] M. Taylor et al, “The RAW microprocessor: a computational fabric for software circuits and general purpose programs”, IEEE Micro, vol 22. No. 2, March/April 2002, pp. 25–35. [171] J. Teife and R. Manohar, “Programmable asynchronous pipeline arrays”, Field Programmable Logic and Applications, LNCS 2778, Springer, 2003. [172] N. Telle, C.C. Cheung and W. Luk, “Customising hardware designs for Elliptic Curve Cryptography”, Computer Systems: Architectures, Modeling, and Simulation, LNCS 3133, Springer, 2004. [173] R. Tessier and W. Burleson, “Reconfigurable computing and digital signal processing: a survey”, Journal of VLSI Signal Processing, Vol. 28, May/June 2001, pp. 7–27. [174] D. Thomas and W. Luk, “A framework for development and distribution of ”, Reconfigurable Technology: FPGAs and Reconfigurable Processors for Computing and Communications, Proc. SPIE, Vol. 4867, 2002. [175] T. Todman and W. Luk, “Reconfigurable designs for ray tracing”, Proc. Symp. on Field-Programmable Custom Computing Machines, IEEE Computer Society Press, 2001. [176] T. Todman and W. Luk, “Combining imperative and declarative hardware descriptions”, Proc. 36th Hawaii Int. Conf. on System Sciences, IEEE, 2003. [177] T. Todman, J.G.F. Coutinho and W. Luk, “Customisable hardware compilation”, Proc. Int. Conf. on Engineering of Reconfigurable Systems and Algorithms, CSREA PRess, 2004. [178] M.M. Uddin, Y. Cao and H. Yasuura, “An accelerated datapath width optimization scheme for area reduction of embedded systems”, Proc. Int. Symp. on Systems Synthesis, ACM Press, 2002. [179] K. Underwood, “FPGAs vs. CPUs: trends in peak floating-point performance”, Proc. int. Symp. on FPGAs, ACM Press, 2004. [180] L. Vereen, “Soft FPGA cores attract embedded developers”, Embedded Systems Programming, 23 April 2004, http://www.embedded.com/showArticle.jhtml?articleID=19200183. [181] J. Vuillemin, P. Bertin, D. Roncin, M. Shand, H. Touati and P. Boucard, “Programmable Active Memories: Reconfigurable Systems come of age”, IEEE Transactions on VLSI Systems, Vol. 4, No. 1, March 1996, pp. 56-69. [182] S.A. Wadekar and A.C. Parker, “Accuracy sensitive word-length selection for algorithm optimization”, Proc. International Conference on Computer Design, 1998. [183] M. Weinhardt and W. Luk, “Pipeline vectorization”, IEEE Trans. on Computer-Aided Design, Vol. 20, No. 2, 2001. [184] A. Wenban, J.W. O’Leary and G.M. Brown, “Codesign of communication protocols”, IEEE Computer, Vol. 26, No. 12, pp. 46-52, December 1993. [185] A. Wenban and G. Brown, “A software development system for FPGA-based data acquisition systems”, IEEE Symposium on FPGAs for Custom Computing Machines, IEEE Computer Society Press, 1996. [186] D.C. Wilcox, L.G. Pierson, P.J. Robertson, E.L. Witzke and K. Gass, “A DES ASIC suitable for network encryption at 10 Gbps and beyond”, Proc. Cryptographic Hardware and Embedded Systems, LNCS 1717, 1999. [187] M. Willems, V. Bursgens,¨ H. Keding, T. Grotk¨ er and M. Meyer, “System-level fixed-point design based on an interpolative approach”, Proc 34th Design Automation Conference, June 1997. [188] J.A. Williams, A.S. Dawood and S.J. Visser, “FPGA-based cloud detection for real-time onboard remote sensing”, Proc. Int. Conf. on Field-Programmable Technology, IEEE, 2002. [189] R.S. Williams and P.J. Kuekes, “Molecular ”, Proc. Int. Symp. on Circuits and Systems, IEEE, 2000. [190] R.P. Wilson, R.S. French, C.S. Wilson, S.P. Amarasinghe, J.M. Anderson, S.W.K. Tjiang, S.-W. Liao, C.-W. Tseng, M.W. Hall, M.S. Lam and J.L. Hennessy, “SUIF: an infrastructure for research on parallelizing and optimizing compilers”, ACM SIGPLAN Noticies, Vol. 29, No. 12, December 1994. [191] R.D. Wittig and P. Chow, “OneChip: an FPGA processor with reconfigurable logic”, IEEE Symposium on FPGAs for Custom Computing Machines, 1996. 28

[192] C.G. Wong, A.J. Martin and P. Thomas, “An architecture for asynchronous FPGAs”, Proc. Int. Conf. on Field-Programmable Technology, IEEE, 2003. [193] R. Woods, D. Trainor and J.P. Heron, “Applying an XC6200 to real-time image processing”, IEEE Design and Test of Computers, Vol. 15 , Jan.-March 1998, pp. 30-38. [194] Xilinx, Inc., PowerPC 405 Processor Block Reference Guide, October 2003. [195] Xilinx, Inc., Virtex II Datasheet, June 2004. [196] Xilinx, Inc., Microblaze Processor Reference Guide, June 2004. [197] Xilinx, Inc., Logic Design Products, http://www.xilinx.com/products/design_resources/design_tool/grouping/logic_ design_prod.htm [198] A. Yamada, K. Nishida, R. Sakurai, A. Kay, T. Nomura and T. Kambe, “Hardware synthesis with the Bach system”, Proc. ISCAS, IEEE, 1999. [199] H. Ziegler, B. So, M. Hall and P. Diniz, “Coarse-grain pipelining on multiple-FPGA architectures”, in IEEE Symposium on Field-Programmable Custom Computing Machines, IEEE, 2002, pp77–88. 29

TABLE IV

A COMPARISON OF WORDLENGTH AND SCALING OPTIMIZATION SYSTEMS AND METHODS.

System Analytic / Cyclic Data MSB-optimization LSB-optimizations Error Comments Simulation Flow? Trade off Benedetti analytic none through interval through ‘multi-interval’ no error can be pessimistic [13] arithmetic approach Stephenson analytic for finite loop through forward and none no error less pessimistic [160], [161] bounds backward range than [13] due to propagation backwards propagation Nayak analytic not supported through forward and through fixing number user-specified or fractional parts have [123] for error backward range of fractional bits inferred absolute equal wordlength analysis propagation for all variables bounds on error Wadekar analytic none through forward range through genetic user-specified uses Taylor series at [182] propagation algorithm search absolute bounds limiting values to for suitable determine error wordlengths propagation Keding hybrid with user through user-annotations through user-annotations not automated possible truncation [86], [187] intervention and forward range and forward wordlength error propagation propagation Cmar hybrid for scaling with user through combined wordlength bounded not automated less pessimistic [34] simulation for intervention simulation and forward through hybrid fixed or than [13] error only range propagation floating simulation due to input error propagation Kum simulation (hybrid yes through measurement through heuristics user-specified long simulation [88], [89], for multiply of variable mean and based on simulation bounds and time possible [166], [167] -accumulate signals standard deviation results metric in [88], [89]) Constantinides analytic yes through tight analytic through heuristics user-specified only applicable to [41] bounds on signal range based on an analytic bounds on linear time-invariant and automatic design noise model noise power systems of saturation arithmetic and spectrum Constantinides hybrid yes through simulation through heuristics user-specified only applicable to [43] based on a hybrid bounds on differentiable noise model noise power non-linear and spectrum systems Abdul Gaffar hybrid with user through simulation through automatic user-specified covers both [1], [2] intervention based range differentiation based bounds and fixed-point and propagation dynamic analysis metric floating-point