[3B2-9] mmi2011060058.3d 23/11/011 16:51 Page 58

Prolegomena ...... Department Editors: Kevin Rudd and Kevin Skadron ...... Are Field-Programmable Gate Arrays Ready for the Mainstream?

GREG STITT University of Florida

...... For more than a decade, will attempt to answer this question by configurable logic blocks (CLBs), multi- researchers have shown that field- discussing the major barriers that have pliers, digital signal processors, and on- programmable gate arrays (FPGAs) can prevented more widespread FPGA chip RAMs. Some embedded-systems- accelerate a wide variety of software, usage, in addition to surveying research specialized FPGAs, such as ’s in some cases by several orders of mag- and necessary innovations that could po- Virtex-5Q FXT (http://www.xilinx.com/ nitude compared to state-of-the-art tentially eliminate these barriers. products/silicon-devices/fpga/virtex-5q/ microprocessors.1-3 Despite this perfor- fxt.htm), even contain microprocessor mance advantage, FPGA accelerators FPGA architectures cores. In addition, for FPGAs without have not yet been accepted by main- To understand the barriers to main- these microprocessors, designers can stream application designers and instead stream usage, we must first understand use the lookup tables to implement remain a niche technology used mainly how FPGA architectures differ from mul- numerous soft processors—such as in embedded systems. ticore and GPU architectures. The most Xilinx’s MicroBlaze Soft Processor By contrast, GPUs have quickly fundamental difference is that multi- (http://www.xilinx.com/tools/microblaze. gained wide popularity. Although GPUs cores and GPUs provide general-purpose htm) and ’s Nios II Processor often provide better performance, espe- or specialized processors to execute par- (http://www.altera.com/devices/processor/ cially for floating-point-intensive applica- allel threads, whereas FPGA architec- nios2/ni2-index.html)—in some cases tions, FPGAs commonly have significant tures implement digital circuits by even more than 1,000.8 Although such advantages for bit-level and integer oper- providing numerous lookup tables imple- processors make an FPGA look similar ations.4,5 In addition, FPGAs often have mented in small RAMs. Lookup tables to a GPU or multicore, soft processors better energy efficiency,6 in some cases implement combinational logic by storing are an order of magnitude slower than by as much as 10 times.5 A motivating the corresponding truth table and using most dedicated processors, which has example is the FPGA-based Novo-G sys- the logic inputs as the address into the resulted in niche usage (for example, tem,7 which provides performance for lookup table. Figure 1 shows a simple embedded systems and multicore emu- computational biology applications that’s example of a 32-bit adder decomposed lation8). For mainstream computing, comparable to the Roadrunner and into 32 full adder circuits, which synthe- soft processors would likely be used as Jaguar supercomputers, while only con- sis tools might map onto 32 lookup controllers for custom pipelined circuits suming 8 kW compared to several tables. Similarly, FPGAs enable sequen- in other parts of the FPGA. megawatts. tial circuits by providing flip-flops along To combine these resources into a With the importance of energy effi- lookup table outputs. By providing hun- larger circuit, FPGAs provide reconfigura- ciency growing rapidly, these surprising dreds of thousands of lookup tables ble interconnect similar to Figure 2. In results raise the following question: if and flip-flops, FPGAs can implement between each row and column of FPGAs offer such significant advantages massively parallel circuits. resources (such as lookup tables and over alternative devices, why are they Modern FPGAs also commonly have CLBs), FPGAs contain numerous routing still a niche technology? This column coarser-grained resources such as tracks, which are wires that carry signals ......

58 Published by the IEEE Computer Society 0272-1732/11/$26.00 2011 IEEE [3B2-9] mmi2011060058.3d 23/11/011 16:51 Page 59

compilers in addition to codesign tools pipelined data path in Figure 3b that fully + 32-bit adder for integrating the processors with cus- unrolls the inner loop, resulting in 64 mul- tom circuits in other parts of the FPGA. tiplications and 63 additions (that is, one 32 full adders For FPGAs in PCI Express accelerator iteration of the outer loop) each cycle.

. . . . boards, board vendors provide a commu- Other parts of the circuit (such as control, nication API for transferring data to and buffers, and memory) are not shown. from software code running on a host As you would expect, designing RTL

32 LUTs processor. circuits is time consuming. For the exam- LUT ...... LUT ple in Figure 3b, designers must specify Design methodologies the entire structure of the data path and Figure 1. Field-programmable gate and programming models must also define control for reading arrays (FPGAs) perform computation To use an FPGA for accelerating soft- inputs from memories into buffers, stall- by implementing the corresponding ware, designers typically start by profil- ing the data path when buffers are full logic (for example, a 32-bit adder) ing software, identifying the most or empty, writing outputs to memory, using lookup tables (LUTs). For this computationally intensive regions, and and so on. For a typical FPGA board, a example, synthesis tools divide the then implementing those regions as cir- complete RTL implementation of the sev- 32-bit adder into smaller circuits (for cuits in the FPGA. eral lines of C code in Figure 3a can re- example, 32 full adders) and then To implement a region of code on quire more than 1,000 lines of VHDL. map those circuits onto lookup tables the FPGA, there are several possible Such complexity leads to the frequent with the same or larger numbers of programming models. The most com- question: if compilers can extract parallel- inputs and outputs. mon model is to manually convert the ism from high-level code for multicores code into a semantically equivalent and GPUs, why can’t they do the same RTL circuit, which designers typically for FPGAs? For more than a decade, specify using hardware-description lan- researchers have worked toward this across the chip. Connection boxes pro- guages (HDLs) such as VHSIC Hard- goal with high-level synthesis tools.9 vide programmable connections between ware Description Language (VHDL) or Commercial high-level synthesis tools are resource I/O and routing tracks. Similarly, . Figure 3a shows an example becoming increasingly common (for ex- switch boxes provide programmable of a high-level loop and a corresponding ample, AutoPilot, Catapult C, Impulse C), connections between routing tracks where each row and column intersect, which allows for tools to route a signal to any destination on the device. Such an architecture is commonly called an island-style fabric, where computational resources are analogous to islands in a sea of interconnect. LUT, CLB, DSP, To implement an application on this Mult, RAM, etc. architecture, FPGA tools typically take a Switch box register-transfer-level (RTL) circuit and perform technology mapping to convert Connection box the circuit into device resources (for ex- ample, converting an adder to lookup tables in Figure 1), followed by place- ment to map each technology-mapped component onto physical resources, and finally routing to program the inter- connect to implement all connections in Figure 2. A typical FPGA architecture. FPGAs contain numerous routing tracks, the circuit. When using processors, ei- which carry signals across the chip between computational resources (for ther soft or hard, device vendors provide example, configurable logic blocks [CLBs], multipliers [Mult], and digital signal tool frameworks such as Xilinx’s EDK processors [DSPs]). Connection boxes provide programmable connections (Embedded Development Kit) and between resource I/O and routing tracks, while switch boxes provide Altera’s SOPC (System on a Programma- connections between routing tracks only. ble Chip) Builder that provide software ......

NOVEMBER/DECEMBER 2011 59 [3B2-9] mmi2011060058.3d 23/11/011 16:51 Page 60

...... PROLEGOMENA

void filter(float a[], Such limitations aren’t unique to FPGA usage by inhibiting the emergence float b[], float c[]) { for(int i¼0; FPGAs. GPUs have similar problems of a ‘‘killer app’’ supporting high sales i < OUTPUT_SIZE; iþþ){ with these same higher-level software volumes that subsidize development of c[i]¼0; for (int j¼0; j < 64; constructs and also require a significant new features and provide economies of jþþ){ amount of parallelism. One key difference scale to lower prices. Whereas GPUs c[i] þ¼ a[iþj]*b[j]; } between FPGA and GPU amenability is became widespread owing to computer } that high-performance FPGA applications graphics’ popularity in desktops, laptops, } generally require deep pipelines, such as and gaming consoles, FPGAs evolved (a) the example in Figure 3b, where inputs from a much smaller market of design- and outputs continually stream through ers requiring glue logic or application-

64 multipliers each cycle. FPGA applications must specific integrated circuit (ASIC) proto- often execute hundreds or thousands of typing. Having already achieved wide- . . . . * *** **** operations each cycle to outweigh slow spread usage, GPUs are in an attractive clock frequencies. Although such parallel- position for use in general-purpose + + ...... + + ism can be achieved in many ways, pipe- acceleration, whereas FPGAs are still lining is common because of area considered niche devices. For similar + ...... + efficiency. For the example in Figure 3b, reasons, the discovery of such a killer ...... an alternative approach using thread- app for FPGAs would likely increase + (b) level parallelism could eliminate the pipe- general-purpose usage, but designers line registers and replicate the circuit mul- would still likely resist FPGA usage be- Figure 3. An example of high-level tiple times to execute multiple iterations cause of the other barriers. code (a) converted to a pipelined in separate threads. However, each register-transfer-level data path that thread would require an additional 64 Cost performs 64 multiplications and 63 multipliers and 63 adders, and would sig- Another significant reason for main- additions every cycle (b). This data nificantly slow down the FPGA clock by stream FPGA resistance is high device path fully unrolls the inner loop so removing the registers, while also greatly costs. New FPGAs generally cost more that all operations can be pipelined, increasing memory bandwidth require- than $10,000, whereas new GPUs cost allowing for a single iteration of the ments, making it difficult to reach the several hundred to several thousand dol- outer loop to be executed every cycle. level of parallelism needed to account lars. FPGA prices do fall significantly over for the even-slower clock. Although this time, and there are much cheaper, thread-based approach could save area smaller FPGAs, but a designer wanting and open-source tools have also begun to by sharing resources or by using soft pro- to use the newest, largest FPGA should appear.10,11 cessors, these implementations would expect a price of at least $10,000. Even reduce each thread’s throughput, in turn worse, this price is only for the FPGA de- Barriers to mainstream usage requiring even more parallelism. Alterna- vice; board vendors, IP vendors, and tools FPGAs have numerous limitations, but tively, pipelining lets a designer generate can significantly add to the total cost. the main barriers to mainstream usage one output every cycle with smaller area are amenability, cost, and productivity. requirements and with clock frequencies Productivity that can be several times faster than non- Although device costs are a deter- Amenability pipelined circuits. Furthermore, pipelining rent, prohibitive application design costs One significant barrier to mainstream and thread-level parallelism aren’t mutu- resulting from low productivity12 are the FPGA usage is that not all applications ally exclusive; in many cases, high-perfor- most significant barrier to more wide- are amenable to FPGA implementation. mance FPGA implementations will spread usage. Many FPGA designers be- One reason is that common software replicate pipelines to generate multiple lieve the main reason FPGAs have not constructs (such as recursion, virtual outputs each cycle. Such replication is been more widely accepted is applica- functions, and pointer-based data struc- often referred to as ‘‘wide’’ parallelism, tion design complexity that far exceeds tures) generally don’t have efficient cir- in contrast to ‘‘deep’’ parallelism achieved that of other platforms. This complexity cuit implementations. For competitive by pipelining. GPUs, by contrast, exploit has resulted in poor productivity that performance, FPGA applications also thread-level parallelism by scheduling has raised development costs and generally require significant parallelism hundreds of threads simultaneously onto caused resistance to FPGAs, which in owing to clock speeds that are com- specialized microprocessors. turn has prevented economy-of-scale monly 10 to 20 times slower than The lack of amenability for main- benefits and led to significantly higher CPUs and GPUs. stream applications has also limited device costs. While high-end FPGAs ...... 60 IEEE MICRO [3B2-9] mmi2011060058.3d 23/11/011 16:51 Page 61

will likely never be as cheap as GPUs, constructs and styles, which mainstream have different numbers and types of solving the productivity problems could application designers have resisted. resources, designers targeting a differ- narrow the gap significantly. High-level synthesis tools also target ent product must often make significant There are numerous reasons for low widespread languages such as ANSI C, changes to an RTL circuit. FPGA productivity, but the largest prob- but such tools are highly sensitive to cod- Design tools also suffer from a lack of lem is the requirement for digital-design ing style and constructs in order to effec- portability. For example, high-level syn- expertise and knowledge of low-level de- tively extract parallelism. These tools can thesis tools often only support a particu- vice details. Because FPGAs were origi- force designers into a particular coding lar type of device or a specific board, nally intended for implementing RTL style that shares many disadvantages of which limits their usefulness. This lack circuits, designers have traditionally cre- specialized languages, or alternatively of portability contrasts with software ated FPGA applications using HDLs as results in inefficient circuits for code not compilers, which can map the same ap- discussed earlier, which is a time- following strict guidelines.13 plication code onto significantly different consuming process requiring skills that An increasingly significant productiv- instruction sets. aren’t common to mainstream design- ity bottleneck is lengthy compilation Finally, debugging an FPGA applica- ers. For example, pipelining is generally times due to the complexity of place- tion is significantly more complex than difficult to specify because of the need ment and routing. Although such times debugging software. Designers gener- to create numerous interconnections of have always been a nuisance to FPGA ally debug FPGA applications by simulat- resources with specific timing require- designers, it’s now common for FPGA ing the RTL code or technology-mapped ments. Design reuse can sometimes compilation to take many hours or even circuit and analyzing hundreds of wave- reduce complexity, but in general, RTL days.14 This lengthy process creates a forms. This requires significant effort designs require at least an order of mag- major design bottleneck that prevents and is further complicated by lengthy nitude more code than traditional high- mainstream usage by forcing designers compilation. Even when using high-level level languages. GPU applications would to adopt methodologies that conflict synthesis, designers often debug using also require additional CUDA or OpenCL with common software design practices waveform simulations because high- code to parallelize a function and move on the basis of rapid compilation. For level synthesis tools typically provide in- data back and forth to the GPU, but example, mainstream designers often accurate thread-based simulations.15 most designers consider GPU code to debug applications by rapidly making Even worse, simulations often differ be far less complex than RTL code. changes, recompiling, and re-executing, from actual on-chip behavior,15 which In addition to requiring device exper- which clearly isn’t feasible when compi- forces designers to trace waveforms tise, FPGA applications typically require lation requires hours. Furthermore, of on-chip behavior using tools such as more effort than other devices to im- lengthy compile times prevent advances SignalTap or ChipScope. prove amenability. For example, GPUs in runtime optimization, such as the dy- are generally efficient for floating-point namic compilation features of OpenCL, Needed improvements operations, which allow many designers which designers increasingly use for To enable mainstream use of FPGAs, to ignore precision issues. FPGAs are ca- GPU and multicore applications. OpenCL researchers must achieve progress on pable of handling floating-point opera- and similar tools could potentially make several key problems. tions, but often achieve order of FPGAs transparent, letting designers High-level synthesis will continue to magnitude improvements when using program applications in the same way be a critical enabling technology for fixed-point, integer, or bit-level opera- for different devices. However, stalling mainstream usage. Unfortunately, lim- tions. Therefore, to achieve an FPGA’s an application for several hours to com- ited success after decades of studies full potential, designers might have to pile one FPGA kernel will likely not be suggests that such tools may never ex- perform numerical analysis to determine accepted. For high-level synthesis tools pand beyond their current niche usage. an appropriate precision, and then imple- for other parallel languages, even with However, the recent GPU trend could ment a circuit with that precision, which precompiled FPGA kernels, designers actually help FPGAs with this problem can significantly increase design time. will likely prefer other devices due to by providing standardized high-level lan- Although high-level synthesis tools the ability to make rapid changes. guages that omit problematic constructs. help eliminate the need for creating A lack of portability is an additional Several examples for GPUs are OpenCL, RTL circuits, several limitations have pre- productivity barrier. Unlike desktop CUDA, and Brook, which FPGA research- vented such tools from becoming widely microprocessors that achieve portability ers are now investigating for high-level accepted. First, many of these tools use with a common instruction set, FPGA synthesis. If successful, mainstream specialized high-level languages. Although applications are specific to one product, designers could potentially target any ac- effective for FPGA experts, these special- forcing designers to recompile for other celerator device using similar high-level ized languages restrict common coding products. Even worse, because FPGAs code......

NOVEMBER/DECEMBER 2011 61 [3B2-9] mmi2011060058.3d 23/11/011 16:51 Page 62

...... PROLEGOMENA

Reducing FPGA compilation times is similar reasons. For example, a plat- designers move applications to newer, also critical to enabling mainstream form-support package might limit more powerful physical platforms with usage. Even in cases where FPGAs can memory accesses to a single memory, potentially no effort. Similarly, virtual significantly outperform GPUs, and even when multiple memories are FPGAs enable third-party tools to target where device cost isn’t an issue, some available. Considering that memory any physical platform by providing a designers are still choosing GPUs be- bandwidth is often a bottleneck for platform-support package for one config- cause of the improved productivity FPGA applications, these board varia- urable virtual platform. from higher-level code and rapid compila- tions cause significant performance dif- Another advantage of virtualization is tion. Surprisingly, FPGA compilation ferences between high-level synthesis the ability to mimic architectures used in times have largely been ignored, likely and custom RTL designs. increasingly common programming frame- owing to the common usage of FPGAs When considering how many prob- works such as OpenCL and CUDA. When to implement custom circuits, for which lems must be solved to enable main- combined with high-level synthesis from designersareusedtosuchtimes. stream FPGA usage, many application corresponding languages,17,18 FPGA virtu- Recent work has showed promising designers are pessimistic about the fu- alization lets designers use the same de- results, with compilation speedups ture of FPGAs as mainstream software sign flow as GPUs and multicore CPUs. from 50 times16 to 500 times.14 The cur- accelerators. However, FPGA virtualiza- In fact, OpenCL’s dynamic compilation rent limitation of these approaches is tion can potentially address most of features could potentially hide FPGAs area and performance overhead, which these problems. Virtual FPGA platforms, from mainstream designers, letting will need to be reduced before wide- analogous to virtual machines for high- designers simply specify parallel kernels spread usage is feasible. level languages such as Java, provide a that the OpenCL framework would simul- Portability and debugging also need set of virtual coarse-grained resources taneously schedule onto GPUs, multi- significant improvements, which could be (such as floating-point cores and fast Four- cores, and virtual FPGA platforms. partially achieved with the discussed ier transforms) that hide the physical plat- The main disadvantage of FPGA virtu- high-level synthesis innovations. However, forms’ fine-grained complexity while also alization is the potential for significant one problem facing the FPGA commu- enabling transparent sharing of those performance and area overhead caused nity is the lack of commonality among resources for different applications or dif- by implementing a virtual FPGA on top FPGA boards, which has led to various ferent threads of the same application. of a physical FPGA with significantly board architectures. Because these dif- To use a virtual FPGA platform, a de- different resources. Preliminary results ferent boards are specialized toward tar- signer specifies his or her application show a modest performance overhead geted domains, the variations make it independently of a specific architecture. of less than 10 percent. However, area harder for designers to create a portable During compilation, virtualization tools requirements of some virtual FPGAs application. For example, a GiDEL analyze the application and identify a cus- can waste up to 30 percent of the PROCStar board has four FPGAs, with tomized set of virtual resources based resources on a physical FPGA. Reducing three external memories per FPGA. A on the application requirements. The vir- this overhead is an active area of re- Nallatech H101PCIXM board has one tualization tools then either select an ap- search, and the overhead is predicted to FPGA with five external memories. propriate virtual platform from a library or be reduced by 90 percent with improved The XtremeData XD1000 has a single generate a new platform specifically for compilation and virtual architectures. FPGA with a single external memory. the application. Furthermore, some boards interface By customizing a platform to the re FPGAs ready for the main- with a microprocessor over PCI application, virtualization significantly A stream? For designers expecting Express, while other systems use in- reduces compilation times by hiding the similar design flows as GPUs and multi- socket FPGAs in place of microproces- hundreds of thousands of fine-grained cores, the answer is likely no. For sors. Such variety also complicates lookup tables on a physical FPGA, in ad- designers with digital-design experi- high-level synthesis tools that generally dition to widely varying memory architec- ence, who are willing to accept lower assume a logical architecture that ven- tures. Recent work has shown that such productivity, FPGAs can already provide dors port to a specific board via a virtualization provides compilation times significant advantages for certain do- platform-support package. Because of thataremorethan500timesfaster mains. If the called-for innovations are the number of FPGA boards and the than FPGA vendor tools.14 all realized, FPGAs will be ready to take small size of tool vendors, vendors By compiling applications to a virtual over a larger share of the accelerator can’t provide platform-support FPGA platform, they become portable market. In fact, FPGA virtualization has packages for all boards. Even worse, across any physical platform that can im- the potential to make application design platform-support packages often are in- plement the virtual platform. Such porta- for FPGAs indistinguishable from GPUs. capable of using all board features for bility reduces design complexity and lets Considering the increasing importance ...... 62 IEEE MICRO [3B2-9] mmi2011060058.3d 23/11/011 16:51 Page 63

of power and energy efficiency, usage 8. D. Burke et al., ‘‘RAMP Blue: Imple- 15. J. Curreri, G. Stitt, and A. George, of FPGAs by mainstream designers is mentation of a Many-Core 1008 Pro- ‘‘High-Level Synthesis of In-Circuit an exciting possibility that could signifi- cessor System,’’ Proc. Reconfigurable Assertions for Verification, Debugging, cantly advance the state of the art in Systems Summer Institute (RSSI 08), and Timing Analysis,’’ Int’l J. Reconfig- many domains. MICRO 2008; www.rssi2008.org/proceedings/ urable Computing, vol. 2011, 2011, ...... papers/presentations/19_Burke.pdf. doi:10.1155/2011/406857. References 9. G. Martin and G. Smith, ‘‘High-Level 16. C. Lavin et al., ‘‘HMFlow: Accelerat- 1. S. Craven and P. Athanas, ‘‘Examining Synthesis: Past, Present, and Future,’’ ing FPGA Compilation with Hard Mac- IEEE Design & Test of Computers, ros for Rapid Prototyping,’’ Proc. IEEE the Viability of FPGA Supercomput- vol. 26, no. 4, 2009, pp. 18-25. Symp. Field-Programmable Custom ing,’’ EURASIP J. Embedded Systems, vol. 2007, no. 1, 2007, pp. 13-20. 10.A.Canisetal.,‘‘LegUp:High-Level Computing Machines (FCCM 11), 2. A. DeHon, ‘‘The Density Advantage of Synthesis for FPGA-Based Processor/ IEEE Press, 2011, pp. 117-124. Configurable Computing,’’ Computer, Accelerator Systems,’’ Proc. 19th 17. M. Owaida et al., ‘‘Synthesis of Plat- vol. 33, no. 4, 2000, pp. 41-49. ACM/SIGDA Int’l Symp. Field Pro- form Architectures from OpenCL 3. Z. Guo et al., ‘‘A Quantitative Analysis grammable Gate Arrays (FPGA 11), Programs,’’ Proc. IEEE Symp. Field- of the Speedup Factors of FPGAs ACM Press, 2011, pp. 33-36. Programmable Custom Computing 11. J. Villarreal et al., ‘‘Designing Modular Machines (FCCM 11), IEEE Press, over Processors,’’ Proc. ACM/SIGDA Hardware Accelerators in C with 2011, pp. 186-193. 12th Int’l Symp. Field Programmable Gate Arrays (FPGA 04), ACM Press, ROCCC 2.0,’’ Proc. IEEE Symp. 18. A. Papakonstantinou et al., ‘‘Multi- 2004, pp. 162-170. Field-Programmable Custom Comput- level Granularity Parallelism Synthesis 4. S. Che et al., ‘‘Accelerating Compute- ing Machines (FCCM 10), IEEE Press, on FPGAs,’’ Proc. IEEE Symp. Field- Intensive Applications with GPUs and 2010, pp. 127-134. Programmable Custom Computing FPGAs,’’ IEEE Symp. Application- 12. B.E. Nelson et al., ‘‘Design Productiv- Machines (FCCM 11), IEEE Press, Specific Processors (SASP 08), IEEE ity for Configurable Computing,’’ 2011, pp. 178-185. Proc. Int’l Conf. Eng. Reconfigurable Press, 2008, pp.101-107. Systems and Algorithms (ERSA 08), 5. J. Williams et al., ‘‘Characterization of Greg Stitt is an assistant professor in the Fixed and Reconfigurable Multi-core 2008, pp. 57-66. Electrical and Computer Engineering Devices for Application Acceleration,’’ 13. S.Sirowy,G.Stitt,andF.Vahid,‘‘Cisfor Department at the University of Florida. ACM Trans. Reconfigurable Technol- Circuits: Capturing FPGA Circuits as Se- His research interests include reconfi- ogy and Systems, vol. 3, no. 4, quential Code for Portability,’’ Proc. 16th gurable computing, design automation, 2010, pp. 1-29. Int’l ACM/SIGDA Symp. Field Program- and embedded systems. Stitt has a PhD 6. G. Stitt and F. Vahid, ‘‘Energy Advan- mable Gate Arrays (FPGA 08), ACM in computer science from the University tages of Microprocessor Platforms Press, 2008, pp. 117-126. of California, Riverside. with On-Chip Configurable Logic,’’ 14. J. Coole and G. Stitt, ‘‘Intermediate Direct questions or comments about IEEE Design & Test of Computers, Fabrics: Virtual Architectures for this column to Greg Stitt, gstitt@ece. vol. 19, no. 6, 2002, pp. 36-43. Circuit Portability and Fast Place- ufl.edu. 7.A.George,H.Lam,andG.Stitt, ment and Routing,’’ Proc. IEEE/ACM/ ‘‘Novo-G: At the Forefront of Scalable IFIP Int’l Conf. Hardware/Software Reconfigurable Supercomputing,’’ Codesign and System Synthesis Computing in Science & Engineering, (CODES/ISSS 10), IEEE Press, 2010, vol. 13, no. 1, 2011, pp. 82-86. pp. 13-22.

......

NOVEMBER/DECEMBER 2011 63