Are Field-Programmable Gate Arrays Ready for the Mainstream?
Total Page:16
File Type:pdf, Size:1020Kb
[3B2-9] mmi2011060058.3d 23/11/011 16:51 Page 58 Prolegomena ...............................................................................................................................Department Editors: Kevin Rudd and Kevin Skadron ................................................................................... Are Field-Programmable Gate Arrays Ready for the Mainstream? GREG STITT University of Florida .......For more than a decade, will attempt to answer this question by configurable logic blocks (CLBs), multi- researchers have shown that field- discussing the major barriers that have pliers, digital signal processors, and on- programmable gate arrays (FPGAs) can prevented more widespread FPGA chip RAMs. Some embedded-systems- accelerate a wide variety of software, usage, in addition to surveying research specialized FPGAs, such as Xilinx’s in some cases by several orders of mag- and necessary innovations that could po- Virtex-5Q FXT (http://www.xilinx.com/ nitude compared to state-of-the-art tentially eliminate these barriers. products/silicon-devices/fpga/virtex-5q/ microprocessors.1-3 Despite this perfor- fxt.htm), even contain microprocessor mance advantage, FPGA accelerators FPGA architectures cores. In addition, for FPGAs without have not yet been accepted by main- To understand the barriers to main- these microprocessors, designers can stream application designers and instead stream usage, we must first understand use the lookup tables to implement remain a niche technology used mainly how FPGA architectures differ from mul- numerous soft processors—such as in embedded systems. ticore and GPU architectures. The most Xilinx’s MicroBlaze Soft Processor By contrast, GPUs have quickly fundamental difference is that multi- (http://www.xilinx.com/tools/microblaze. gained wide popularity. Although GPUs cores and GPUs provide general-purpose htm) and Altera’s Nios II Processor often provide better performance, espe- or specialized processors to execute par- (http://www.altera.com/devices/processor/ cially for floating-point-intensive applica- allel threads, whereas FPGA architec- nios2/ni2-index.html)—in some cases tions, FPGAs commonly have significant tures implement digital circuits by even more than 1,000.8 Although such advantages for bit-level and integer oper- providing numerous lookup tables imple- processors make an FPGA look similar ations.4,5 In addition, FPGAs often have mented in small RAMs. Lookup tables to a GPU or multicore, soft processors better energy efficiency,6 in some cases implement combinational logic by storing are an order of magnitude slower than by as much as 10 times.5 A motivating the corresponding truth table and using most dedicated processors, which has example is the FPGA-based Novo-G sys- the logic inputs as the address into the resulted in niche usage (for example, tem,7 which provides performance for lookup table. Figure 1 shows a simple embedded systems and multicore emu- computational biology applications that’s example of a 32-bit adder decomposed lation8). For mainstream computing, comparable to the Roadrunner and into 32 full adder circuits, which synthe- soft processors would likely be used as Jaguar supercomputers, while only con- sis tools might map onto 32 lookup controllers for custom pipelined circuits suming 8 kW compared to several tables. Similarly, FPGAs enable sequen- in other parts of the FPGA. megawatts. tial circuits by providing flip-flops along To combine these resources into a With the importance of energy effi- lookup table outputs. By providing hun- larger circuit, FPGAs provide reconfigura- ciency growing rapidly, these surprising dreds of thousands of lookup tables ble interconnect similar to Figure 2. In results raise the following question: if and flip-flops, FPGAs can implement between each row and column of FPGAs offer such significant advantages massively parallel circuits. resources (such as lookup tables and over alternative devices, why are they Modern FPGAs also commonly have CLBs), FPGAs contain numerous routing still a niche technology? This column coarser-grained resources such as tracks, which are wires that carry signals .............................................................. 58 Published by the IEEE Computer Society 0272-1732/11/$26.00 c 2011 IEEE [3B2-9] mmi2011060058.3d 23/11/011 16:51 Page 59 compilers in addition to codesign tools pipelined data path in Figure 3b that fully + 32-bit adder for integrating the processors with cus- unrolls the inner loop, resulting in 64 mul- tom circuits in other parts of the FPGA. tiplications and 63 additions (that is, one 32 full adders For FPGAs in PCI Express accelerator iteration of the outer loop) each cycle. boards, board vendors provide a commu- Other parts of the circuit (such as control, nication API for transferring data to and buffers, and memory) are not shown. from software code running on a host As you would expect, designing RTL 32 LUTs processor. circuits is time consuming. For the exam- LUT . LUT ple in Figure 3b, designers must specify Design methodologies the entire structure of the data path and Figure 1. Field-programmable gate and programming models must also define control for reading arrays (FPGAs) perform computation To use an FPGA for accelerating soft- inputs from memories into buffers, stall- by implementing the corresponding ware, designers typically start by profil- ing the data path when buffers are full logic (for example, a 32-bit adder) ing software, identifying the most or empty, writing outputs to memory, using lookup tables (LUTs). For this computationally intensive regions, and and so on. For a typical FPGA board, a example, synthesis tools divide the then implementing those regions as cir- complete RTL implementation of the sev- 32-bit adder into smaller circuits (for cuits in the FPGA. eral lines of C code in Figure 3a can re- example, 32 full adders) and then To implement a region of code on quire more than 1,000 lines of VHDL. map those circuits onto lookup tables the FPGA, there are several possible Such complexity leads to the frequent with the same or larger numbers of programming models. The most com- question: if compilers can extract parallel- inputs and outputs. mon model is to manually convert the ism from high-level code for multicores code into a semantically equivalent and GPUs, why can’t they do the same RTL circuit, which designers typically for FPGAs? For more than a decade, specify using hardware-description lan- researchers have worked toward this across the chip. Connection boxes pro- guages (HDLs) such as VHSIC Hard- goal with high-level synthesis tools.9 vide programmable connections between ware Description Language (VHDL) or Commercial high-level synthesis tools are resource I/O and routing tracks. Similarly, Verilog. Figure 3a shows an example becoming increasingly common (for ex- switch boxes provide programmable of a high-level loop and a corresponding ample, AutoPilot, Catapult C, Impulse C), connections between routing tracks where each row and column intersect, which allows for tools to route a signal to any destination on the device. Such an architecture is commonly called an island-style fabric, where computational resources are analogous to islands in a sea of interconnect. LUT, CLB, DSP, To implement an application on this Mult, RAM, etc. architecture, FPGA tools typically take a Switch box register-transfer-level (RTL) circuit and perform technology mapping to convert Connection box the circuit into device resources (for ex- ample, converting an adder to lookup tables in Figure 1), followed by place- ment to map each technology-mapped component onto physical resources, and finally routing to program the inter- connect to implement all connections in Figure 2. A typical FPGA architecture. FPGAs contain numerous routing tracks, the circuit. When using processors, ei- which carry signals across the chip between computational resources (for ther soft or hard, device vendors provide example, configurable logic blocks [CLBs], multipliers [Mult], and digital signal tool frameworks such as Xilinx’s EDK processors [DSPs]). Connection boxes provide programmable connections (Embedded Development Kit) and between resource I/O and routing tracks, while switch boxes provide Altera’s SOPC (System on a Programma- connections between routing tracks only. ble Chip) Builder that provide software .................................................................... NOVEMBER/DECEMBER 2011 59 [3B2-9] mmi2011060058.3d 23/11/011 16:51 Page 60 .......................................................................................................................................................................................................................... PROLEGOMENA void filter(float a[], Such limitations aren’t unique to FPGA usage by inhibiting the emergence float b[], float c[]) { for(int i¼0; FPGAs. GPUs have similar problems of a ‘‘killer app’’ supporting high sales i < OUTPUT_SIZE; iþþ){ with these same higher-level software volumes that subsidize development of c[i]¼0; for (int j¼0; j < 64; constructs and also require a significant new features and provide economies of jþþ){ amount of parallelism. One key difference scale to lower prices. Whereas GPUs c[i] þ¼ a[iþj]*b[j]; } between FPGA and GPU amenability is became widespread owing to computer } that high-performance FPGA applications graphics’ popularity in desktops, laptops, } generally require deep pipelines, such as and gaming consoles, FPGAs evolved (a) the example in Figure