FPGA Programming for the Masses

practice DOI:10.1145/2436256.2436271 used to implement sequential logic. Article development led by queue.acm.org Apart from the homogeneous array of logic cells, an FPGA also contains discrete components such as BRAMs The programmability of FPGAs must improve (block RAMs), digital signal process- if they are to be part of mainstream computing. ing (DSP) slices, processor cores, and various communication cores (for ex- BY DavID F. BACON, RODRIC RabbaH, AND SUNIL SHUkla ample, Ethernet MAC and PCIe). BRAMs, which are specialized memory structures distributed throughout the FPGA fabric in columns, are of par- ticular importance. Each BRAM can FPGA hold up to 36Kbits of data. BRAMs can be used in various form factors and can be cascaded to form a larger logical memory structure. Because of the dis- Programming tributed organization of BRAMs, they can provide terabytes of bandwidth for memory bandwidth-intensive applications. Xilinx and Altera dominate the PLD market, collectively holding more for the Masses than an 85% share. FPGAs were long considered low- volume, low-density ASIC replace- ments. Following Moore’s Law, however, FPGAs are getting denser and faster. Modern-day FPGAs can have up to two million logic cells, 68Mbits WHEN LOOKING AT how hardware influences of BRAM, more than 3,000 DSP slices, computing performance, we have general-purpose and up to 96 transceivers for imple- menting multigigabit communication processors (GPPs) on one end of the spectrum and channels.23 The latest FPGA families application-specific integrated circuits (ASICs) on the from Xilinx and Altera are more like a system-on-chip (SoC), mixing dual- other. Processors are highly programmable but often core ARM processors with program- inefficient in terms of power and performance. ASICs mable logic on the same fabric. Cou- implement a dedicated and fixed function and provide pled with higher device density and performance, FPGAs are quickly re- the best power and performance characteristics, placing ASICs and application-specific but any functional change requires a complete (and standard products (ASSPs) for imple- extremely expensive) re-spinning of the circuits. menting fixed function logic. Analysts expect the programmable integrated Fortunately, several architectures exist between circuit (IC) market to reach the $10 bil- these two extremes. Programmable logic devices lion mark by 2016.20 The most perplexing fact is how an (PLDs) are one such example, providing the best of FPGA running at a clock frequency that both worlds. They are closer to the hardware and can is an order of magnitude lower than be reprogrammed. CPUs and GPUs (graphics processing units) is able to outperform them. In The most prominent example of a PLD is a field several classes of applications, espe- programmable gate array (FPGA). It consists of cially floating-point-based ones, GPU performance is either slightly better look-up tables (LUTs), which are used to implement or very close to that of an FPGA. When combinational logic; and flip-flops (FFs), which are it comes to power efficiency (perfor- 56 COMMUNICATIONS OF THE ACM | APRIL 2013 | VOL. 56 | NO. 4 mance per watt), however, both CPUs implement the data and control path, PCIe-attached FPGAs, to low-end em- and GPUs significantly lag behind thereby getting rid of the fetch and bedded systems that integrate embed- FPGAs, as shown in the available lit- decode pipeline. The distributed on- ded processors directly into the FPGA erature comparing the performance of chip memory provides much-needed fabric or on the same chip. CPUs, GPUs, and FPGAs for different bandwidth to satisfy the demands of classes of applications.6,18,19 concurrent logic. The inherent fine- Programming Challenges The contrast in performance be- grained architecture of FPGAs is very Despite the advantages offered by FPGAs tween processors and FPGAs lies in well suited for exploiting various and their rapid growth, use of FPGA the architecture itself. Processors rely forms of parallelism present in the technology is restricted to a narrow on the Von Neumann paradigm where application, ranging from bit-level segment of hardware programmers. an application is compiled and stored to task-level parallelism. Apart from The larger community of software pro- in instruction and data memory. They the conventional reconfiguration ca- grammers has stayed away from this typically work on an instruction and pability where the entire FPGA fabric technology, largely because of the chal- data fetch-decode-execute-store pipe- is programmed with an image before lenges experienced by beginners trying line. This means both instructions execution, FPGAs are also capable of to learn and use FPGAs. and data have to be fetched from an undergoing partial dynamic reconfig- Abstraction. FPGAs are predomi- external memory into the processor uration. This means part of the FPGA nantly programmed using hardware pipeline. Although caches are used to can be loaded with a new image while description languages (HDLs) such as alleviate the cost of expensive fetch op- the rest of the FPGA is functional. Verilog and VHDL. These languages, erations from external memory, each This is similar to the concept of pag- which date back to the 1980s and have K cache miss incurs a severe penalty. ing and virtual memory in the proces- seen few revisions, are very low level in C The bandwidth between processor and sor taxonomy. terms of the abstraction offered to the memory is often the critical factor in Various kinds of FPGA-based sys- user. A hardware designer thinks about HUTTERSTO S / determining the overall performance. tems are available today. They range the design in terms of low-level build- ONES The phenomenon is also known as from heterogeneous systems target- ing blocks such as gates, registers, and J “hitting the memory wall.”21 ed at high-performance computing multiplexors. VHDL and Verilog are TIEN E FPGAs have programmable logic that tightly couple FPGAs with con- well suited for describing a design at H BY H BY P cells that could be used to implement ventional CPUs (for example, Convey that level of abstraction. Writing an ap- an arbitrary logic function both spa- Computers), to midrange commercial- plication at a behavioral level and leav- PHOTOGRA tially and temporally. FPGA designs off-the-shelf workstations that use ing its destiny in the hands of a synthe- APRIL 2013 | VOL. 56 | NO. 4 | COMMUNICATIONS OF THE ACM 57 practice sis tool is commonly considered a bad tronic design automation) tools. It is Then the generated netlist is connect- design practice. largely up to the EDA tool vendors to ed in a structural manner. Figure 2 This is in complete contrast with decide which language features they shows the top-level wrapper consisting the software programming languages, want to support. of structural instantiation of leaf-level which have evolved over the past 60 While support for data types such as modules; namely, a double-precision years. Programming for CPUs enjoys char, int, long, float, and double is in- floating-point multiplier (fp mult) and the benefits of well-established ISAs tegral to all software languages, VHDL adder (fp add). (For brevity, the leaf- (instruction set architectures) and ad- and Verilog have long supported only level modules are not shown here.) vanced compilers that offer a much bit, array of bits, and integer. A recent Verification. Design and verification simpler programming experience. revision to VHDL standard (IEEE 1076- go hand in hand in the hardware world. Object-oriented and polymorphic pro- 2008) introduced fixed-point and float- Verification is easily the largest chunk gramming concepts, as well as auto- ing-point types, but most, if not all, of the overall design-cycle time. It re- matic memory management (garbage synthesis tools do not support these quires a TB (test bench) to be created collection) are no longer seen just as abstractions yet. either in HDL or by using a high-level desirable features but as necessities. A Figure 1 is a C program that con- language such as C, C++, or SystemC higher level of abstraction drastically verts Celsius to Fahrenheit. The lan- that is connected to the design under increases a programmer’s productivity guage comes with standard libraries test (DUT) using the FLI (foreign lan- and reduces the likelihood of bugs, re- for a number of useful functions. To guage interface) extension provided sulting in a faster time to market. implement the same program in VHDL by VHDL and Verilog. The DUT and TB Synthesizability. Another language- or Verilog for FPGA implementation, are simulated using event-based simu- related issue is that only a narrow sub- the user must generate a netlist (pre- lators (for example, Mentor Graphics set of VHDL and Verilog is synthesiz- compiled code) using FPGA vendor- ModelSim and Cadence Incisive Uni- able. To make matters worse, there specific IP (intellectual property) core- fied Simulator). Each signal transac- is no standardization of the features generator tools (for example, Coregen tion is recorded and displayed as a supported by different EDA (elec- for Xilinx and MegaWizard for Altera). waveform. Waveform-based debugging works for small designs but can quickly Figure 1. Celsius to Fahrenheit conversion in C. become tedious as design size grows. Also, the simulation speed decreases 1 #include <stdio.h> very quickly with the increasing design 2 #include <stdlib.h> size. In general, simulations are many 3 orders of magnitudes slower than the 4 double main (int argc, char* argv[]) { real design running in hardware. Of- 5 double celsius = atof(argv[1]); 6 return (9*celsius/5 + 32); ten, large-system designers have to 7 } turn to prototyping (emulation) using FPGAs to speed up design simulation. This is in complete contrast with the way software verification works.

FPGA Programming for the Masses

A Highly Configurable High-Level Synthesis Functional Pattern Library

Richard S. Uhler

Design Automation for Streaming Systems

Review of FPD's Languages, Compilers, Interpreters and Tools

Generating Configurable Hardware from Parallel Patterns

Generative Programming and Dsls in Performance Critical Systems

A Case for Generative Programming and Dsls in Performance Critical Systems

Sixth ACM SIGPLAN International Conference on Functional Programming (ICFP'01)

In-System FPGA Prototyping of an Itanium Microarchitecture

Cλash: Structural Descriptions of Synchronous Hardware Using Haskell

A Survey and Evaluation of FPGA High-Level Synthesis Tools

Pandora: Facilitating IP Development for Hardware Specialization