practice

doi:10.1145/2436256.2436271 used to implement sequential logic. Article development led by queue.acm.org Apart from the homogeneous array of logic cells, an FPGA also contains discrete components such as BRAMs The programmability of FPGAs must improve (block RAMs), digital signal process- if they are to be part of mainstream computing. ing (DSP) slices, processor cores, and various communication cores (for ex- By David F. Bacon, Rodric Rabbah, and Sunil Shukla ample, Ethernet MAC and PCIe). BRAMs, which are specialized mem- ory structures distributed throughout the FPGA fabric in columns, are of par- ticular importance. Each BRAM can FPGA hold up to 36Kbits of data. BRAMs can be used in various form factors and can be cascaded to form a larger logical memory structure. Because of the dis- Programming tributed organization of BRAMs, they can provide terabytes of bandwidth for memory bandwidth-intensive applica- tions. Xilinx and Altera dominate the PLD market, collectively holding more for the Masses than an 85% share. FPGAs were long considered low- volume, low-density ASIC replace- ments. Following Moore’s Law, how- ever, FPGAs are getting denser and faster. Modern-day FPGAs can have up to two million logic cells, 68Mbits When looking at how hardware influences of BRAM, more than 3,000 DSP slices, computing performance, we have general-purpose and up to 96 transceivers for imple- menting multigigabit communication processors (GPPs) on one end of the spectrum and channels.23 The latest FPGA families application-specific integrated circuits (ASICs) on the from Xilinx and Altera are more like a system-on-chip (SoC), mixing dual- other. Processors are highly programmable but often core ARM processors with program- inefficient in terms of power and performance. ASICs mable logic on the same fabric. Cou- implement a dedicated and fixed function and provide pled with higher device density and performance, FPGAs are quickly re- the best power and performance characteristics, placing ASICs and application-specific but any functional change requires a complete (and standard products (ASSPs) for imple- extremely expensive) re-spinning of the circuits. menting fixed function logic. Analysts expect the programmable integrated Fortunately, several architectures exist between circuit (IC) market to reach the $10 bil- these two extremes. Programmable logic devices lion mark by 2016.20 The most perplexing fact is how an (PLDs) are one such example, providing the best of FPGA running at a clock frequency that both worlds. They are closer to the hardware and can is an order of magnitude lower than be reprogrammed. CPUs and GPUs (graphics processing units) is able to outperform them. In The most prominent example of a PLD is a field several classes of applications, espe- programmable gate array (FPGA). It consists of cially floating-point-based ones, GPU performance is either slightly better look-up tables (LUTs), which are used to implement or very close to that of an FPGA. When combinational logic; and flip-flops (FFs), which are it comes to power efficiency (perfor-

56 communications of the acm | April 2013 | vol. 56 | no. 4 mance per watt), however, both CPUs implement the data and control path, PCIe-attached FPGAs, to low-end em- and GPUs significantly lag behind thereby getting rid of the fetch and bedded systems that integrate embed- FPGAs, as shown in the available lit- decode pipeline. The distributed on- ded processors directly into the FPGA erature comparing the performance of chip memory provides much-needed fabric or on the same chip. CPUs, GPUs, and FPGAs for different bandwidth to satisfy the demands of classes of applications.6,18,19 concurrent logic. The inherent fine- Programming Challenges The contrast in performance be- grained architecture of FPGAs is very Despite the advantages offered by FPGAs tween processors and FPGAs lies in well suited for exploiting various and their rapid growth, use of FPGA the architecture itself. Processors rely forms of parallelism present in the technology is restricted to a narrow on the Von Neumann paradigm where application, ranging from bit-level segment of hardware programmers. an application is compiled and stored to task-level parallelism. Apart from The larger community of software pro- in instruction and data memory. They the conventional reconfiguration ca- grammers has stayed away from this typically work on an instruction and pability where the entire FPGA fabric technology, largely because of the chal- data fetch-decode-execute-store pipe- is programmed with an image before lenges experienced by beginners trying line. This means both instructions execution, FPGAs are also capable of to learn and use FPGAs. and data have to be fetched from an undergoing partial dynamic reconfig- Abstraction. FPGAs are predomi- external memory into the processor uration. This means part of the FPGA nantly programmed using hardware pipeline. Although caches are used to can be loaded with a new image while description languages (HDLs) such as alleviate the cost of expensive fetch op- the rest of the FPGA is functional. Verilog and VHDL. These languages, erations from external memory, each This is similar to the concept of pag- which date back to the 1980s and have cache miss incurs a severe penalty. ing and virtual memory in the proces- seen few revisions, are very low level in The bandwidth between processor and sor taxonomy. terms of the abstraction offered to the memory is often the critical factor in Various kinds of FPGA-based sys- user. A hardware designer thinks about

c k huttersto / S determining the overall performance. tems are available today. They range the design in terms of low-level build- The phenomenon is also known as from heterogeneous systems target- ing blocks such as gates, registers, and “hitting the memory wall.”21 ed at high-performance computing multiplexors. VHDL and Verilog are FPGAs have programmable logic that tightly couple FPGAs with con- well suited for describing a design at cells that could be used to implement ventional CPUs (for example, Convey that level of abstraction. Writing an ap- an arbitrary logic function both spa- Computers), to midrange commercial- plication at a behavioral level and leav-

E tien J ones p h by Photogra tially and temporally. FPGA designs off-the-shelf workstations that use ing its destiny in the hands of a synthe-

april 2013 | vol. 56 | no. 4 | communications of the acm 57 practice

sis tool is commonly considered a bad tronic design automation) tools. It is Then the generated netlist is connect- design practice. largely up to the EDA tool vendors to ed in a structural manner. Figure 2 This is in complete contrast with decide which language features they shows the top-level wrapper consisting the software programming languages, want to support. of structural instantiation of leaf-level which have evolved over the past 60 While support for data types such as modules; namely, a double-precision years. Programming for CPUs enjoys char, int, long, float, and double is in- floating-point multiplier (fp mult) and the benefits of well-established ISAs tegral to all software languages, VHDL adder (fp add). (For brevity, the leaf- (instruction set architectures) and ad- and Verilog have long supported only level modules are not shown here.) vanced compilers that offer a much bit, array of bits, and integer. A recent Verification. Design and verification simpler programming experience. revision to VHDL standard (IEEE 1076- go hand in hand in the hardware world. Object-oriented and polymorphic pro- 2008) introduced fixed-point and float- Verification is easily the largest chunk gramming concepts, as well as auto- ing-point types, but most, if not all, of the overall design-cycle time. It re- matic memory management (garbage synthesis tools do not support these quires a TB (test bench) to be created collection) are no longer seen just as abstractions yet. either in HDL or by using a high-level desirable features but as necessities. A Figure 1 is a C program that con- language such as C, C++, or SystemC higher level of abstraction drastically verts Celsius to Fahrenheit. The lan- that is connected to the design under increases a programmer’s productivity guage comes with standard libraries test (DUT) using the FLI (foreign lan- and reduces the likelihood of bugs, re- for a number of useful functions. To guage interface) extension provided sulting in a faster time to market. implement the same program in VHDL by VHDL and Verilog. The DUT and TB Synthesizability. Another language- or Verilog for FPGA implementation, are simulated using event-based simu- related issue is that only a narrow sub- the user must generate a netlist (pre- lators (for example, Mentor Graphics set of VHDL and Verilog is synthesiz- compiled code) using FPGA vendor- ModelSim and Cadence Incisive Uni- able. To make matters worse, there specific IP (intellectual property) core- fied Simulator). Each signal transac- is no standardization of the features generator tools (for example, Coregen tion is recorded and displayed as a supported by different EDA (elec- for Xilinx and MegaWizard for Altera). waveform. Waveform-based debugging works for small designs but can quickly Figure 1. Celsius to Fahrenheit conversion in C. become tedious as design size grows. Also, the simulation speed decreases

1 #include very quickly with the increasing design 2 #include size. In general, simulations are many 3 orders of magnitudes slower than the 4 double main (int argc, char* argv[]) { real design running in hardware. Of- 5 double celsius = atof(argv[1]); 6 return (9*celsius/5 + 32); ten, large-system designers have to 7 } turn to prototyping (emulation) us- ing FPGAs to speed up design simula- tion. This is in complete contrast with the way software verification works. Figure 2. Celsius to Fahrenheit conversion in Verilog. From simple printf statements to advanced static and dynamic analysis

1 module c2f ( tools, a software programmer has nu- 2 input clk, merous tools and techniques available 3 input rst, to debug and verify programs in a much 4 input [63:0] celsius, simpler yet powerful manner. 5 input input_valid, 6 output [63:0] fahrenheit, Design and tool flow. Apart from 7 output result_valid); the language-related challenges, pro- 8 gramming for FPGAs requires the use 9 localparam double_9div5 = 64’h3FFCCCCCCCCCCCCD; // 1.8 (=9/5) 10 localparam double_32 = 64’h4040000000000000; // 32.0 of complex and time-consuming EDA 11 wire [63:0] temp; tools. Synthesis time—the time taken 12 to generate bitstreams (used to pro- 13 // temp = 1.8 * celsius gram the FPGA) from source code—can 14 fp_mult mult_9div5_i (.clk(clk), .rst(rst), 15 .in1(celsius), .in2(double_9div5), .in_valid(input_valid), vary anywhere from a few minutes to a 16 .out(temp), .out_valid(temp_valid)); few days depending on the code size 17 and complexity, target FPGA, synthesis 18 // fahrenheit = temp + 32 tool options, and design constraints. 19 fp_add add_32_i (.clk(clk), .rst(rst), 20 .in1(temp), .in2(double_32), .in_valid(temp_valid), Figure 3 shows the design flow for 21 .out(fahrenheit), .out_valid(result_valid)); Xilinx FPGAs.22 After functional veri- 22 fication, the source code—consist- 23 endmodule ing of VHDL, Verilog, and netlists—is fed to the synthesis tool. The syn- thesis phase compiles the code and

58 communications of the acm | April 2013 | vol. 56 | no. 4 practice

produces a netlist (.ngc). The trans- from one FPGA vendor to another is an Figure 3. Design flow for Xilinx FPGAs. late phase merges all the synthesized even bigger challenge. netlists and the physical and timing Functional Source Code constraints to produce an native ge- High-Level Synthesis Simulation VHDL, Verilog, Netlists neric database (NGD) file. Themap High-level synthesis (HLS) refers to phase groups logical symbols from the conversion of an algorithmic de- Synthesis the netlists into physical components scription of an application into either (xst) such as LUTs, FFs, BRAMs, DSP slices, a low-level register-transfer level (RTL) Physical and Timing IOBs (input/output blocks), among description or a digital circuit. Over the Constraints others. The output is stored in native past decade or so, several languages and Translate (ngdbuild) circuit description (NCD) format and compiler frameworks were proposed to contains information about switching ease the pain of programming FPGAs. delays. The map phase results in an er- They all aim to raise the level of abstrac- Map ror if the design exceeds the available tion in which the user must write a pro- (map) resources or the user-specified timing gram to compile it down to VHDL or constraints are violated by the switch- Verilog. Most of these frameworks can Place-and-Route ing delay itself. The PAR (place-and- be classified into five categories: HDL- (par) route) phase performs the placement like languages; C-based frameworks; and routing of the mapped symbols Compute Unified Device Architecture Timing Analysis on the actual FPGA device. For some (CUDA)/Open Computing Language (trce) FPGA devices, placement is done in (OpenCL)-based frameworks; high-level the map phase. PAR stores the out- language-based frameworks; and mod- Bitstream put in NCD format and notifies the el-based frameworks. Generation user of timing errors if the total delay HDL-like languages. Although only (bitgen) (switching+routing) exceeds the user- minute incremental changes have been specified timing constraints. made to VHDL and Verilog since they FPGA Following PAR, timing analysis is were proposed, SystemVerilog repre- programming performed to generate a detailed tim- sents a big jump. It has some seman- (impact) ing report consisting of all delays for tic similarity to Verilog, but it is more different paths. Users can refer to the dissimilar than similar. SystemVerilog timing analysis report to determine consists of two components: the syn- scription. A 2011 article in acmqueue which critical paths fail to meet the thesizable component that extends and provides a comprehensive description timing requirements and then fix adds several features to the Verilog-2005 of the language and compiler.14 them in the source code. Most of the standard; and the verification compo- C-based frameworks. Most of the synthesis errors mean the user must nent that heavily uses an object-orient- early work in the field of HLS was done go back to the source code and repeat ed model and is more akin to object-ori- on C-based languages such as C, C++, the whole process. Once the design ented languages such as Java than it is and SystemC. Most of these frame- meets the resource and timing con- to Verilog. SystemVerilog substantially works restrict the programmer to only straints, a bitstream can be generated. improves on the type system and pa- a small subset of the parent language. The bitgen tool takes in a completely rametrizability of Verilog. Unfortunate- Features such as pointers, recursive routed design and generates a configu- ly, EDA tools offer limited support for functions, and dynamic memory allo- ration bitstream that can be used to pro- the language today: the Xilinx synthesis cations are prohibited. Some of these gram the FPGA. Finally, the design can tool does not support SystemVerilog, frameworks require users to anno- be downloaded to the FPGA using the and the Altera synthesis tool supports a tate the program extensively before Impact tool. small subset. compilation. A comprehensive list of The design flow described in Figure Another notable mention in this cat- C-based frameworks can be found in 3 varies from one FPGA vendor to an- egory is BSV (Bluespec SystemVerilog), João M.P. Cardosa and Pedro C. Di- other. Moreover, to make good use of which incorporates concepts such as niz’s book, Compilation Techniques for the synthesis tool, the user must have modules, module instances, and mod- Reconfigurable Architectures.7 good knowledge of the target FPGA ar- ule hierarchies just like Verilog and Interestingly, all the EDA vendors chitecture because several synthesis VHDL, and follows the SystemVerilog have developed or acquired tool suites options are tied to the architecture. The model of interfaces to communicate to enable high-level synthesis. Their way these options are tuned ultimately between modules. In Bluespec, module HLS frameworks (for example, Ca- decides whether the design will meet re- behavior is expressed using guarded dence C-to-Silicon Compiler, Synopsys source and timing constraints. atomic actions that are basically concur- Synphony C Compiler, Mentor Graph- Design portability is another big rent finite state machines (FSMs), a con- ics Catapult C, and Xilinx Vivado), with challenge because porting a design cept well known to hardware program- the exception of the one proposed by from one FPGA to another often re- mers. A program written in Bluespec Altera, are based on C-like languages. quires re-creation of all the FPGA de- is passed on to the Bluespec compiler, The following section looks at the vice-specific IP cores. Migrating designs which generates an efficient RTL de- Xilinx Vivado HLS tool.

april 2013 | vol. 56 | no. 4 | communications of the acm 59 practice

Xilinx Vivado (previously autopilot). ism for running computationally inten- In 2011 Xilinx acquired the AutoPilot sive tasks. In 2008, OpenCL was made framework5,8 developed by AutoESL public as an open standard for paral- and is now offering it as a part of the lel programming on heterogeneous Vivado Design Suite. It supports com- systems. OpenCL supports execution pilation from a behavioral description Despite the on targets such as CPUs and GPUs. To written in C/C++/SystemC to RTL. The advantages offered leverage the popularity of CUDA and behavioral synthesis consists of four OpenCL, a number of compiler frame- phases: compilation and elaboration; by FPGAs and their works have been proposed to con- advanced code transformation; core rapid growth, use vert CUDA16 and OpenCL3,13,15 code to behavioral and communication synthe- VHDL/Verilog. One of these products is sis; and microarchitecture generation. of FPGA technology the Altera OpenCL framework. The behavioral code in C/C++/Sys- is restricted to a Altera OpenCL-to-FPGA framework. temC is parsed and compiled by a GNU Altera proposed a framework for syn- Compiler Collection (GCC)-compatible narrow segment thesizing bitstreams from OpenCL.3 compiler front end. Several hardware- Figure 4 shows the Altera OpenCL- oriented features are added to the front of hardware to-FPGA framework for compiling an end such as bit-width optimization for programmers. OpenCL program to Verilog for co- data types to use exactly the right num- execution on a host and Altera FPGA. ber of bits required to represent a data The framework consists of a kernel type. For SystemC designs, an elabora- compiler, host library, and system in- tion phase extracts processes, ports, tegration component. The kernel com- and other interconnect information, piler is based on the open source LLVM and it forms a synthesis data model. compiler infrastructure to synthesize The synthesis data model under- OpenCL kernel code into hardware. goes several compiler optimizations The host library provides APIs and (for example, constant propagation, bindings for establishing communica- dead-code elimination, and bit-width tion between the host part of the appli- analysis) in the advanced code trans- cation running on the processor and formation phase. the kernel part of the application run- In the core behavioral and commu- ning on the FPGA. The system integra- nication synthesis phase, Vivado takes tion component wraps the kernel code into account the user-specified con- with memory controllers and a com- straints (for example, frequency, area, munication interface (such as PCIe). throughput, latency, and physical loca- The user application consists of a tion constraints) and the target FPGA host component written in C/C++ and device architecture to do scheduling a set of kernels written in OpenCL. The and resource binding. OpenCL kernels are compiled using In the microarchitecture genera- the open source LLVM infrastructure. tion phase, the compiler backend pro- The parser completes the first step in duces VHDL/Verilog RTL together with the compilation, producing an inter- the synthesis constraints, which could mediate representation (IR) consisting be passed as an input to the synthesis of basic blocks connected by control- tool. The tool also generates an equiv- flow edges. alent description in SystemC, which A live variable analysis determines could be used to do verification and the input and output of each basic equivalence checking of the generated block. A basic block consists of three RTL code. types of nodes: the merge node is re- CUDA/OpenCL-based framworks. The sponsible for aggregating data from popularity of GPUs for general-purpose previous nodes; the operational node computing soared with the release of represents load, store, and execute the CUDA framework by Nvidia in 2006. instructions; and the branch node This framework consists of language selects one thread successor among and runtime extensions to the C pro- many basic blocks. gramming language and driver-level After live variable analysis, the IR is APIs for communication between the optimized to generate a control-data host and GPU device. With CUDA, re- flow graph (CDFG). RTL for each basic searchers and developers can exploit block is generated from the CDFG. An massive SIMD (single instruction, mul- execution schedule is calculated for tiple data)-style architectural parallel- each node in the CDFG using the system

60 communications of the acm | April 2013 | vol. 56 | no. 4 practice of difference constraints (SDC) schedul- on conventional CPUs and more exotic global program state (because it is pro- ing algorithm. The scheduler uses lin- architectures such as GPUs.10 The Lime hibitively expensive in an FPGA), and ear equations to schedule instructions language was founded on principles no external references to the task exist while minimizing the cost function. that eschew the complicated program so that its state cannot be mutated ex- The hardware logic generated by analysis that plagues C-based frame- cept from within the task, if at all. Task the kernel compiler for the nodes in works, while also offering powerful graphs result from connecting tasks the CDFG is added with stall, valid, programming models that espouse together (using a first-class connect and data signals that implement the the benefits of functional and stream- operator), so that the output of one be- control edges of the CDFG. The resul- oriented programming. comes the input to another. tant Verilog code is wrapped with the Notable features of the Lime lan- Figure 5 shows the Liquid Metal communication (for example, PCIe) guage include the ability to express compilation and runtime architecture. and memory infrastructure to interact bit literals and computation at the The front-end compiler first translates with the host and the off-chip memory, granularity of individual bits—just as the Lime program into an interme- respectively. A template-based design in HDLs—but with the power of high- diate representation that describes methodology is used where the appli- level abstractions and object-orient- the task graph. Each of the compiler cation-independent communication ed programming. These permit, for backends is a vertically integrated tool and memory infrastructure is locked example, the definition of generic chain that not only compiles the code down as a static part of the system and classes that are parameterized by bit for the intended architecture, but also only the application-dependent kernel width, polymorphic methods and generates an executable artifact suit- part is synthesized. This hierarchical, overloaded operators. able for the architecture. For example, partition-based approach reduces syn- Lime is Java compatible and hence the Verilog backend, which synthesiz- thesis time. The design is then synthe- strongly typed. The language adds es HDL from Lime, will also automati- sized to generate a bitstream. features to aid the compiler in deriv- cally invoke the EDA tools for the target Altera also provides a host library ing important properties for efficient FPGA to perform and consisting of APIs to bind the host behavioral synthesis into HDL. These generate a bit file that can be loaded function calls to the OpenCL kernels include instantiation-based generics, onto the FPGA. Each backend provides implemented in the FPGA. The host- bounded arrays, immutable types, and a set of exclusion rules that restrict the side code written in C/C++ is linked localized side effects that are guaran- Lime language to a synthesizable sub- to these libraries, enabling runtime teed by the type-checker and exploited set. Some notable exclusions are the communication and synchronization by the compiler for the purpose of gen- use of unbounded arrays and non-final between the application code running erating efficient and pipelined circuits. classes. Dynamic memory allocation is on the host processor and the OpenCL Another important feature of the restricted to final fields of objects, and kernels running on the FPGA. The host Lime language is a task-based data-flow recursive method calls may have no library also consists of the Auto-Dis- programming model that allows for more than one argument. These rules covery module, which allows the host program partitioning at task granular- are expected to change over time as the program to query and detect the types ity, often resembling a block-level dia- compilation technology matures. of kernels running on the FPGA. gram of an architecture or FPGA circuit. Lime code is always executable High-level language-based frame- A task in Lime can be strongly isolat- on at least one architecture: the JVM works. Because of the popularity of ed such that it cannot access a mutable (Java Virtual Machine). As such, func- modern programming languages, 3 several recently proposed frameworks Figure 4. Altera OpenCL-to-FPGA framework. use high-level languages as starting points for generating hardware cir- kernel.cl host.c cuits. The languages in this category are highly abstract, usually object- C-Language ACL Host Auto Discovery C compiler oriented and offer high-level features front end Library such as polymorphism and automatic memory management. Some of the noteworthy languages and frame- Live-Value program.exe Analysis works are Esterel,23 Kiwi,12 Chisel,4 and IBM’s Liquid Metal.1 Liquid Metal. The Liquid Metal proj- CDFG ect at IBM aims to provide not only Generation Kernel Compiler Kernel high-level synthesis for FPGAs,1 but also, more broadly, a single and unified Scheduling language for programming heteroge- neous architectures. The project of- System Integration RTL generator Verilog HDL fers a language called Lime,2 which (QSYS/Quartus) can be used to program hardware (FPGAs), as well as software running

april 2013 | vol. 56 | no. 4 | communications of the acm 61 practice

Figure 5. Liquid Metal compilation and runtime architecture.

public static int doWork(int[[]] values) Task { int acc = 0; for (int i=0; i

Backend Toolchain Frontend OpenCL Verilog

Generated Java Code Bytecode Artifacts

Runtime JVM Lime Runtime

Hardware CPU GPU FPGA

tional verification of Lime programs tions that may not be readily achiev- NI LabView is a graphical program- may be carried out entirely in software able through the current behavioral ming and design language. Using using high-level debugging tools, in- synthesis technology. functional blocks and interconnects, cluding the Eclipse-based Lime IDE. Model-based frameworks provide designers can create a graphical pro- Tasks that are mapped to the FPGA an abstract way of designing com- gram that looks like their whiteboard may co-execute with tasks mapped to plex control- and signal-processing designs. LabView provides an inte- other parts of the system (for example, systems. They improve design quality grated environment that consists of GPU or JVM), and the communication and accelerate design and verification a graphical editor for development, throughout the system is orchestrated tasks by using an executable specifi- debugger, compilation framework, by the Lime runtime. cation. The executable specification and middleware for execution on di- The language allows programmers facilitates hardware-software parti- verse targets such as desktop proces- to seamlessly integrate native HDL tioning, verification, and rapid design sors, PowerPC, ARM processors, and code into their Lime code via the iterations. NI (National Instruments) a number of FPGAs. With LabView, the Lime Native Interface. This allows LabView9 and Matlab HDL Coder17 are user can start development and testing the programmer to implement tim- two Model-based frameworks. They on one platform and then incremental- ing- and performance-critical compo- are especially useful for designers ly migrate to another platform. nents of the application in HDL and with strong domain expertise but lit- Matlab HDL coder a high-level lan- the rest of the application in Lime in tle experience with software/hardware guage and interactive environment order to meet performance specifica- programming languages. for numerical and scientific computa-

62 communications of the acm | April 2013 | vol. 56 | no. 4 practice tion and system modeling. It provides quite expensive and prohibit thorough Conference on Field Programmable Logic and Applications (2012), 531–534. a number of tools and function-rich li- design-space exploration. 4. bachrach, J., Richards, B., Vo, H., Lee, Y., Waterman, braries that could be used for quickly HLS presents a promising direc- A., Avidienis, R., Wawrzynek, J. and Asanovic, K. Chisel: Constructing hardware in a Scala-embedded implementing algorithms and model tion, however even after a few de- language. In Proceedings of the 49th ACM/EDAC/IEEE systems for a wide range of applica- cades of research in the area, FPGA Design Automation Conference (2012), 1212–1221. 5. berkeley Design Technology. An independent tion domains. Simulink, a part of programming practices continue to evaluation of the AutoESL AutoPilot high-level the Matlab product suite, provides be fragmented and challenging, plac- synthesis tool. Technical Report, 2010. 6. brodtkorb, A.R., Dyken, C., Hagen, T.R., Hjelmervik, J.M. a block-diagram-based graphical ing a high-engineering burden on re- and Storaasli, O.O. State-of-the-art in heterogeneous environment for system simulation searchers and developers. Some prob- computing. Scientific Programming 18, 1 (2010). 7. Cardoso, J. and Diniz, P. Compilation Techniques for and Model-based design. Simulink lems arise because too many efforts Reconfigurable Architectures. Springer, 2009. 8. Coussy, P. and Morawiec, A. High-level Synthesis: From is basically a collection of libraries are going on simultaneously without Algorithm to Digital Circuit. Springer, 2008. (toolboxes) for different application any standardization, yet none pro- 9. dase, C., Falcon, J. S. and MacCleery, B. Motorcycle control prototyping using an FPGA-based embedded domains. HDL Coder is one such vide a comprehensive ecosystem to control system. IEEE Control Systems 26, 5 (2006), toolbox that generates synthesizable foster adoption and growth. Many of 17–21. 10. dubach, C., Cheng, P., Rabbah, R., Bacon, D.F., Fink, VHDL and Verilog code from Matlab the ongoing endeavors are centered S.J. Compiling a high-level language for GPUs: (via functions and Simulink models. Us- on C-based languages. Only time will language support for architectures and compilers). In 33rd SIGPLAN Symposium for Programming Design ers can model their systems in Simu- tell if C is the right abstraction for the and Implementation (2012), 1–12. link using library blocks and Matlab future since it was designed for a se- 11. edwards, S.A. High-Level Synthesis from the Synchronous Language Esterel. In IEEE/ACM International functions and then easily test the quential execution model rooted in Workshop on Logic & Synthesis (2002), 401–406. models against the functional speci- Von Neumann architectures and ex- 12. greaves, D. and Singh, S. Designing application- specific circuits with concurrent C# programs. In fications by creating a test bed using isting design trends suggest increas- Proceedings of the 8th ACM/IEEE International the same modular approach. After ingly heterogeneous and parallel ar- Conference on Formal Methods and Models for Codesign (2010). verification, users can generate bit- chitectures. 13. jaaskelainen, P.O., de La Lama, C.S., Huerta, P., optimized and cycle-accurate synthe- We also need to revisit the approach Takala, J.H. OpenCL-based design methodology for application-specific processors.Embedded Computer sizable RTL code. of compiling a high-level language to Systems (2010), 223–230. 14. nikhil, R.S. Abstraction in hardware system design. VHDL/Verilog and then using vendor- ACM Queue 9, 8 (2011); http://queue.acm.org/detail. Future Directions synthesis tools to generate bitstreams. cfm?id=2020861. 15. owaida, M., Bellas, N., Daloukas, K. and Antonopoulos, The era of frequency scaling in micro- If FPGA vendors open up FPGA archi- C. Synthesis of platform architectures from OpenCL processor design is largely believed tecture details, third parties could de- programs. In Field-programmable Custom Computing Machines (2012), 186–193. to be over, and programmers are velop new tools that compile a high- 16. Papakonstantinou, A., Karthik, G., Stratton, J. A., Chen, increasingly turning to heteroge- level description directly to a bitstream D., Cong, J. and Hwu, W.-M.W. 2009. FCUDA: Enabling efficient compilation of CUDA kernels onto FPGAs. In neous architectures in search of bet- without going through the intermedi- Application Specific Processors (2009), 35–42. ter performance. ate step of generating VHDL/Verilog. 17. sharma, S. and Chen, W. Using Model-based design to accelerate FPGA development for automotive We have already started to see the This is attractive because current syn- applications. The MathWorks, 2009. integration of general-purpose proces- thesis times are too long to be accept- 18. sirowy, S. and Forin, A. Where’s the beef? Why FPGAs are so fast. Microsoft Research Technical Report MSR- sors and GPUs on the same die. The able in the mainstream. TR-2008-130, 2008. diversity is expected to increase in the 19. thomas, D.B., Howes, L., Luk, W. A comparison of CPUs, GPUs, FPGAs, and massively parallel processor future to include ASSPs, and even pro- arrays for random number generation. In ACM/SIGDA grammable logic (for example, FPGAs). Related articles International Symposium on Field programmable on queue.acm.org Gate Arrays (2009), 22–24. The key is tighter integration of this 20. WinterGreen Research Inc. Programmable logic IC diverse set of compute elements. The Computing without Processors market shares and forecasts, worldwide, 2010 to Satnam Singh 2016. Technical report, 2010. time is not far off when CPUs, GPUs, 21. Wulf, W.A. and McKee, S.A. Hitting the memory wall: http://queue.acm.org/detail.cfm?id=2000516 Implications of the obvious. SIGARCH Computer FPGAs, and other ASSPs will be inte- Architecture News 23, 1 (1995), 20¬–24. Of Processors and Processing grated on the same chip. There are sev- 22. Xilinx. Command line tools user guide. Technical Gene Frantz, Ray Simar Report UG628 (14.3), 2012. eral standards and specifications for http://queue.acm.org/detail.cfm?id=984478 23. Xilinx. 7 series FPGAs overview. Technical Report co-programming CPUs and GPUs, but DS180 (1.13), 2012. Abstraction in Hardware System Design FPGAs have been isolated so far, pri- Rishiyur S. Nikhil marily because of the lack of support http://queue.acm.org/detail.cfm?id=2020861 David Bacon is a research staff member at the IBM T.J. Watson Research Center. His research interests are in for device drivers, programming lan- programming-language design and implementation, and guages, and tools. References he is currently working on the Liquid Metal project. There is a serious need to improve 1. auerbach, J., Bacon, D.F., Burcea, I., Cheng, P., Fink, Rodric Rabbah is a research staff member at the IBM T.J. S.J., Rabbah, R. and Shukla, S. A compiler and the programming aspect of FPGAs if Watson Research Center. He is interested in programming runtime for heterogeneous computing. In Proceedings languages, compilers, and architectures for heterogeneous th they are to join mainstream heteroge- of the 49 ACM/EDAC/IEEE Design Automation computing. He has worked in these areas since the Conference (2012), 271–276. founding of the Liquid Metal project at IBM in 2007. neous computing. Programming tech- 2. auerbach, J., Bacon, D. F., Cheng, P. and Rabbah, R. nology for FPGAs has lagged behind Lime: A Java-compatible and synthesizable language Sunil Shukla is a research staff member at the IBM T.J. for heterogeneous architectures. In Proceedings of Watson Research Center. He is currently working on the the advancements in semiconductor the ACM International Conference on Object-oriented Liquid Metal project, which provides a single-language technology. As discussed here, HDLs Programming Systems Languages and Applications solution to programming heterogeneous architectures (2010), 89–108. (CPU, GPU, FPGA). are too low level. Significant design- 3. aydonat, U., Denisenko, D., Freeman, J., Kinsner, M., Neto, D., Wong, J., Yiannacouras, P. and Singh, cycle time must be spent in coding D.P. From OpenCL to high-performance hardware and verification. Design iterations are on FPGAs. In Proceedings of the 22nd International © 2013 ACM 0001-0782/13/04

april 2013 | vol. 56 | no. 4 | communications of the acm 63