[3B2-8] mdt2009040008.3d 17/7/09 13:24 Page 8

High-Level Synthesis

An Introduction to High-Level Synthesis

Philippe Coussy Michael Meredith Universite´ de Bretagne-Sud, Lab-STICC Forte Design Systems

Daniel D. Gajski Andres Takach University of California, Irvine

today would even think of program- Editor’s note: ming a complex software application High-level synthesis raises the design abstraction level and allows rapid gener- solely by using an assembly language. ation of optimized RTL hardware for performance, area, and power require- In the hardware domain, specification ments. This article gives an overview of state-of-the-art HLS techniques and languages and design methodologies tools. 1,2 Tim Cheng, Editor in Chief have evolved similarly. For this reason, ÀÀ until the late 1960s, ICs were designed, optimized, and laid out by hand. Simula- ŠTHE GROWING CAPABILITIES of silicon technology tion at the gate level appeared in the early 1970s, and and the increasing complexity of applications in re- cycle-based simulation became available by 1979. Tech- cent decades have forced design methodologies niques introducedduring the 1980s included place-and- and tools to move to higher abstraction levels. Raising route, schematic circuit capture, formal verification, the abstraction levels and accelerating automation of and static timing analysis. Hardware description lan- both the synthesis and the verification processes have guages (HDLs), such as (1986) and VHDL for this reason always been key factors in the evolu- (1987), have enabled wide adoption of simulation tion of the design process, which in turn has allowed tools. These HDLs have also served as inputs to logic designers to explore the design space efficiently and synthesis tools leading to the definition of their synthe- rapidly. sizable subsets. During the 1990s, the first generation of In the software domain, for example, machine commercial high-level synthesis (HLS) tools was avail- code (binary sequence) was once the only language able commercially.3,4 Around the same time, research that could be used to program a computer. In the interest on hardware-software codesign including ÀÀ 1950s, the concept of assembly language (and assem- estimation, exploration, partitioning, interfacing, com- bler) was introduced. Finally, high-level languages munication, synthesis, and cosimulation gained mo- ÀÀ (HLLs) and associated compilation techniques were mentum.5 The concept of IP core and platform-based developed to improve software productivity. HLLs, design started to emerge.6-8 In the 2000s, there has which are platform independent, follow the rules of been a shift to an electronic system-level (ESL) para- human language with a grammar, a syntax, and a se- digm that facilitates exploration, synthesis, and verifica- mantics. They thus provide flexibility and portability tion of complex SoCs.9 This includes the introduction by hiding details of the computer architecture. As- of languages with system-level abstractions, such as sembly language is today used only in limited scenar- SystemC (http://www.systemc.org), SpecC (http:// ios, primarily to optimize the critical parts of a www.cecs.uci.edu/~specc), or SystemVerilog (IEEE program when there is an absolute need for speed 1800-2005; http://standards.ieee.org), and the intro- and code compactness, or both. However, with the duction of transaction-level modeling (TLM). The growing complexity of both modern system architec- ESL paradigm shift caused by the rise of system com- tures and software applications, using HLLs and com- plexities, a multitude of components in a product pilers clearly generates better overall results. No one (hundreds of processors in a car, for instance),

0740-7475/09/$26.00 c 2009 IEEE Copublished by the IEEE CS and the IEEE CASS IEEE Design & Test of Computers 8 

Authorized licensed use limited to: Columbia University. Downloaded on January 7, 2010 at 18:37 from IEEE Xplore. Restrictions apply. [3B2-8] mdt2009040008.3d 17/7/09 13:24 Page 9

a multitude of versions of a chip (for better product dif- ferentiation), and an interdependency of component Specification suppliers forced the market to focus on hardware and software productivity, dependability, interoperabil- Compilation ity, and reusability. In this context, processor custom- ization and HLS have become necessary paths to Formal model efficient ESL design.10 The new HLS flows, in addition to reducing the time for creating the hardware, also Allocation Scheduling help reduce the time to verify it as well as facilitate Library Binding other flows such as power analysis. Raising the hardware design’s abstraction level is Generation essential to evaluating system-level exploration for ar- chitectural decisions such as hardware and software RTL architecture design, synthesis and verification, memory organiza- tion, and power management. HLS also enables Logic synthesis reuse of the same high-level specification, targeted to accommodate a wide range of design constraints ... and ASIC or FPGA technologies. Typically, a designer begins the specification of an Figure 1. High-level synthesis (HLS) design steps. application that is to be implemented as a custom processor, dedicated coprocessor or any other cus- tom hardware unit such as interrupt controller, Key concepts bridge, arbiter, interface unit, or a special function Starting from the high-level description of an appli- unit with a high-level description capture of the cation, an RTL component library, and specific design desired functionality, using an HLL. This first step constraints, an HLS tool executes the following tasks thus involves writing a functional specification (an (see Figure 1): untimed description) in which a function consumes 1. compiles the specification, all its input data simultaneously, performs all compu- 2. allocates hardware resources (functional units, tations without any delay, and provides all its output storage components, buses, and so on), data simultaneously. At this abstraction level, varia- 3. schedules the operations to clock cycles, bles (structure and array) and data types (typically 4. binds the operations to functional units, floating point and integer) are related neither to the 5. binds variables to storage elements, hardware design domain (bits, bit vectors) nor to 6. binds transfers to buses, and the embedded software. Realistic hardware imple- 7. generates the RTL architecture. mentation thus requires conversion of floating-point and integer data types into bit-accurate data types Tasks 2 through 6 are interdependent, and for a de- of specific length (not a standard byte or word size, signer to achieve the optimal solution, they would ide- as in software) with acceptable computation accu- ally be optimized in conjunction. To handle real-world racy, while generating an optimized hardware archi- designs, however, the tasks are commonly executed in tecture starting from this bit-accurate specification. sequence to manage the computational complexity of HLS tools transform an untimed (or partially timed) synthesis. The particular order of some of the synthesis high-level specification into a fully timed implementa- tasks, as well as a measure of how well the interdepen- tion.10-13 They automatically or semiautomatically gen- dencies are estimated and accounted for, significantly erate a custom architecture to efficiently implement impacts the generated design’s quality. More details the specification. In addition to the memory banks are available elsewhere.10-13 and the communication interfaces, the generated ar- chitecture is described at the RTL and contains a Compilation and modeling data path (registers, multiplexers, functional units, HLS always begins with the compilation of the and buses) and a controller, as required by the given functional specification. This first step transforms specification and the design constraints. the input description into a formal representation.

July/August 2009 9

Authorized licensed use limited to: Columbia University. Downloaded on January 7, 2010 at 18:37 from IEEE Xplore. Restrictions apply. [3B2-8] mdt2009040008.3d 17/7/09 13:24 Page 10

High-Level Synthesis

This first step traditionally includes several code opti- control dependencies, data dependencies between mizations such as dead-code elimination, false data basic blocks can be added to the CDFG model as dependency elimination, and constant folding and shown in the hierarchical task graph representation loop transformations. used in the SPARK tool.14,16 The formal model produced by the compilation classically exhibits the data and control dependen- Allocation cies between the operations. Data dependencies Allocation defines the type and the number of can be easily represented with a data flow graph hardware resources (for instance, functional units, (DFG) in which every node represents an operation storage, or connectivity components) needed to sat- and the arcs between the nodes represent the input, isfy the design constraints. Depending on the HLS 12 output, and temporary variables. A pure DFG mod- tool, some components may be added during sched- els data dependencies only. In some cases, it is pos- uling and binding tasks. For example, the connectiv- sible to get this model by removing the control ity components (such as buses or point-to-point dependencies of the initial specification from the connections among components) can be added be- model at compile time. To do so, loops are com- fore or after binding and scheduling tasks. The com- pletely unrolled by converting to noniterative code ponents are selected from the RTL component blocks, and conditional assignments are resolved by library. It’s important to select at least one component creating multiplexed values. The resulting DFG explic- for each operation in the specification model. The li- itly exhibits all the intrinsic parallelism of the specifi- brary must also include component characteristics cation. However, the required transformations can (such as area, delay, and power) and its metrics to lead to a large formal representation that requires be used by other synthesis tasks. considerable memory to be stored during synthesis. Moreover, this representation does not support loops with unbounded iteration count and nonstatic Scheduling All operations required in the specification model control statements such as goto. This limits the use of pure DFG representations to a few applications. must be scheduled into cycles. In other words, for each operation such as a b op c, variables b and The DFG model has been extended by adding control ¼ dependencies: the control and data flow graph c must be read from their sources (either storage (CDFG).12-15 A CDFG is a directed graph in which components or functional-unit components) and the edges represent the control flow. The nodes in a brought to the input of a functional unit that can ex- CDFG are commonly referred to as basic blocks and ecute operation op, and the result a must be brought are defined as a straight-line sequence of statements to its destinations (storage or functional units). that contain no branches or internal entrance or Depending on the functional component to which exit points. Edges can be conditional to represent the operation is mapped, the operation can be sched- uled within one clock cycle or scheduled over several if and switch constructs. A CDFG exhibits data dependencies inside basic blocks and captures the cycles. Operations can be chained (the output of an control flow between those basic blocks. operation directly feeds an input of another opera- CDFGs are more expressive than DFGs because tion). Operations can be scheduled to execute in par- they can represent loops with unbounded iterations. allel provided there are no data dependencies However, the parallelism is explicit only within between them and there are sufficient resources basic blocks, and additional analysis or transforma- available at the same time. tions are required to expose parallelism that might exist between basic blocks. Such transformations in- Binding clude for example loop unrolling, loop pipelining, Each variable that carries values across cycles loop merging, and loop tiling. These techniques, by must be bound to a storage unit. In addition, several revealing the parallelism between loops and between variables with nonoverlapping or mutually exclusive loop iterations, are used to optimize the latency or lifetimes can be bound to the same storage units. the throughput and the size and number of memory Every operation in the specification model must accesses. These transformations can be realized auto- be bound to one of the functional units capable matically,14 or they can be user-driven.10 In addition to of executing the operation. If there are several

10 IEEE Design & Test of Computers

Authorized licensed use limited to: Columbia University. Downloaded on January 7, 2010 at 18:37 from IEEE Xplore. Restrictions apply. [3B2-8] mdt2009040008.3d 17/7/09 13:24 Page 11

Control inputs

Control ... RF/Scratch pad signals Bus 1 Bus 2 State register (SR) Next- state Output logic logic ALU MUL ... Memory

Status signals Bus 3 Controller Data path

Control outputs

Figure 2. Typical architecture.

units with such capability, the binding algorithm of functional units (such as ALUs, multipliers, shifters, must optimize this selection. Storage and functional- and other custom functions), and interconnect ele- unit binding also depend on connectivity binding, ments (such as tristate drivers, multiplexers, and which requires that each transfer from component buses). All these register-transfer components can be to component be bound to a connection unit allocated in different quantities and types and con- such as a bus or a multiplexer (see, for example, nected arbitrarily through buses. Each component http://www-labsticc.univ-ubs.fr/www-gaut/). Ideally, can take one or more clock cycles to execute, can high-level synthesis estimates the connectivity delay be pipelined, and can have input or output registers. and area as early as possible so that later HLS steps In addition, the entire data path and controller can can better optimize the design. An alternative be pipelined in several stages. approach is to specify the complete architecture dur- Primary input and output ports of the design inter- ing allocation so that initial floorplanning results can face with the external world to transfer both data and be used during binding and scheduling (see http:// control (used for interface protocol handshaking and www.cecs.uci.edu/~nisc). synchronization). Data inputs and outputs are con- nected to the data path, and control inputs and out- Generation puts are connected to the controller. There are also Once decisions have been made in the preceding control signals from the controller to the data path tasks of allocation, scheduling, and binding, the goal and status signals from the data path to the controller. of the RTL architecture generation step is to apply all However, some architectures may not have all the the design decisions made and generate an RTL connectivity just described, and in general some of model of the synthesized design. the controller functions may be implemented as part of the data path for example, a counter plus ÀÀ Architecture. The RTL architecture is implemented other logic in the data path that generates control by a set of register-transfer components. It usually signals. includes a controller and a data path (see Figure 2). The controller is a finite state machine that A data path consists of a set of storage elements orchestrates the flow of data in the data path by set- (such as registers, register files, and memories), a set ting the values of control signals (also called control

July/August 2009 11

Authorized licensed use limited to: Columbia University. Downloaded on January 7, 2010 at 18:37 from IEEE Xplore. Restrictions apply. [3B2-8] mdt2009040008.3d 17/7/09 13:24 Page 12

High-Level Synthesis

word) such as the select inputs of functional units, When the RTL description includes only partial registers, and multiplexers. The inputs to the control- binding of resources, the logic synthesis step that ler may come from primary inputs (control inputs) follows HLS must perform the binding task and the or from the data path components such as compara- associated optimization. Leaving components un- tors and so on (status signals). The controller con- bound in the generated RTL provides the RTL and sists of a state register (SR), next-state logic, and physical synthesis the flexibility to optimize the bind- output logic. The SR stores the present state of the ings on the basis of updated timing estimates that processor, which is equal to the present state of take into account wire loads due to physical (floor- the finite-state machine (FSM) model describing planning and place-and-route) considerations. the controller’s operation. The next-state logic com- putes the next state to be loaded into the SR, whereas Several design flows the output logic generates the control signals and the Allocation, scheduling, and binding can be control outputs. performed simultaneously or in specific sequence The controller of a simple dedicated coprocessor depending on the strategy and algorithms used. How- is classically implemented with hardwired logic ever, they are all interrelated. If they are performed to- gates. On the other hand, a controller can be pro- gether, the synthesis process becomes too complex to grammable with a read-write or read-only program be applied to realistic examples. The order in which memory for a specific custom processor. In this they are realized depends on the design constraints case, the program memory can store instructions or and the tool’s objectives. For example, allocation just control words, which are longer but require no will be performed first when scheduling tries to min- decoding. In such a circumstance, SR is called a pro- imize the latency or to maximize the throughput gram counter, the next-state logic is an address gener- under a resource constraint. Allocation will be deter- ator, and the output logic is RAM or ROM. mined during scheduling when scheduling tries to minimize the area under timing constraints.17 Resource-constrained approaches are used when a Output model. According to the decisions made in designer wants to define the data path architec- the binding tasks, the description of the architecture ture,18,19 or wants to accelerate an application by can be written on RTL with different levels of detail using an FPGA device with a limited amount of avail- (that is, without binding or with partial or complete able resources.20 Time-constrained approaches are binding). For example, a b c executing in state ¼ þ used when the objective is to reduce a circuit’s area (n) can be written as Figure 3 indicates: while meeting an application’s throughput require- ments, as in multimedia or telecommunication 21 Without any binding: applications. state (n): a = b + c; In practice, the resource-constrained problem can go to state (n + 1); be solved by using a time-constrained approach or With storage binding: tool (and vice versa). In this case, the designer state (n): RF(1) = RF(3) + RF(4); relaxes the timing constraints until the provided cir- go to state (n + 1); cuit area is acceptable. Latency, throughput, resource With functional-unit binding: count, and area are now well-known constraints and state (n): a = ALU1 (+, b, c); go to state (n + 1); objectives. However, recent work has considered fea- tures such as clock period, memory bandwidth, mem- With storage and functional-unit binding: state (n): RF(1)= ALU1 (+, RF(3), RF(4)); ory mapping, power consumption, and so forth that go to state (n + 1); make the synthesis problem even more difficult to solve.10,22-24 With storage, functional-unit, and connectivity binding: state (n): Bus1 = RF(3); Bus2 = RF(4); Another example of how synthesis tasks can be or- Bus3 = ALU1 (+, Bus1, Bus2); dered concerns the allocation and the binding steps. RF(1) = Bus3; go to state (n + 1); The types and numbers of resources determined in the allocation task are taken as input for the binding Figure 3. RTL description written with different task. In practical HLS tools, however, resources are binding details. often allocated only partially. Additional resources

12 IEEE Design & Test of Computers

Authorized licensed use limited to: Columbia University. Downloaded on January 7, 2010 at 18:37 from IEEE Xplore. Restrictions apply. [3B2-8] mdt2009040008.3d 17/7/09 13:24 Page 13

are allocated during the binding step according to approach of both CoWare’s Processor Designer and the design constraints and objectives. These addi- Tensilica’s Xtensa. tional resources can be of any type: functional Other languages, not based on C or C++ but which units, multiplexers, or registers. Hence, functional are tailored to specific domains, have also been pro- units can be first allocated to schedule and bind posed. Esterel is a synchronous language for the de- the operations. Both registers and multiplexers can velopment of reactive systems. Esterel Studio can then be allocated (created) during the variable-to- generate either software or hardware. Bluespec’s register binding step. BSV is language targeted for specifying concurrency The synthesis tasks can be performed manually or with rule-based atomic transactions. automatically. Obviously, many strategies are possible, The input specification to HLS tools must be writ- as exemplified by available EDA tools: these tools ten with some hardware implementation in mind to might perform each of the aforementioned tasks get the best results. For example, video line buffering only partially in automatic fashion and leave the must be coded as part of the algorithm to generate rest to the designer.10 high-throughput designs.25 Ideally, such code restruc- turing still preserves much of the abstraction level. It Industrial tools is beyond the scope of this article to provide an over- Here, we take a brief look first at commercially view of the specific writing styles required in the spec- available HLS tools for specifying the input descrip- ification for various tools. As tools’ capabilities evolve tion, and then we more carefully examine two state- further, fewer modifications will be required to con- of-the-art industrial HLS tools. vert an algorithm written for software into an algo- rithm that is suitable as input to HLS. An analogous Input languages and tools evolution has occurred with RTL synthesis for exam- ÀÀ The input specification must capture the intended ple, with the optimization of arithmetic expressions. design functionality at a high abstraction level. Rather than coding low-level implementation details, the de- Catapult C synthesis signer uses the automation provided by the HLS tool Catapult takes an algorithm written in ANSI C++ to guide the design decisions, which heavily depend and a set of user directives as input and generates on performance goals and the target technology. For an RTL that is optimized for the specified target tech- instance, if there is parallelism in an HLS specifica- nology. The input can be compiled by any standard tion, it is extracted using dataflow analysis in accor- compiler compliant with C++; pragmas and directives dance with the target technology’s capabilities and do not change the functional behavior of the input performance goals (on a slow technology, more paral- specification. lelism is required to achieve the performance goal). The latest generation of HLS tools, in most cases, Synthesis input. The input specification is sequen- uses either ANSI C, C++, or languages such as SystemC tial and does not include any notion of time or explicit that are based on C or C++ that add hardware-specific parallelism: it does not hardcode the interface or the constructs such as timing, hardware hierarchy, inter- design’s architecture. Keeping the input abstract is es- face ports, signals, explicit specification of parallelism, sential because any hard-coding of interface and ar- and others. Some HLS tools that support C or C++ chitectural details significantly limits the range of or derivatives are Mentor’s Catapult C (C, C++), designs that HLS could generate. Required directives Forte’s Cynthesizer (SystemC), NEC’s CyberWorkbench specify the target technology (component library) (C with hardware extensions), Synfora’s PICO (C), and and the clock period. Optional directives control inter- Cadence’s C-to-Silicon (SystemC). Other languages face synthesis, array-to-memory mappings, amount of used for high-level modeling are MathWork’s Matlab parallelism to uncover by loop unrolling, loop pipelin- and Simulink. Tools that support Matlab or Simulink ing, hardware hierarchy and block communication, are ’ AccelDSP (Matlab) and ’ Simplify- scheduling (latency or cycle) constraints, allocation DSP (Simulink); both use an IP approach to gener- directives to constrain the number or type of hardware ate the hardware implementation. An alternative resources, and so on. approach for generating an implementation is to use Native C++ integer types as well as C++ bit-accurate a configurable processor approach, which is the integer and fixed-point data types are supported

July/August 2009 13

Authorized licensed use limited to: Columbia University. Downloaded on January 7, 2010 at 18:37 from IEEE Xplore. Restrictions apply. [3B2-8] mdt2009040008.3d 17/7/09 13:24 Page 14

High-Level Synthesis

for synthesis. The generated RTL faithfully reflects the communication channels, and the required hand- bit-accurate behavior specified in the source. Publicly shaking interfaces are generated to guarantee the cor- available integer and fixed-point data types provided rect execution of the specified behavior. The blocks by Mentor Graphics’ (http://www.mentor.com/esl) can be synthesized to be driven by different clocks. Algorithmic C data types library (ANSI C++ header The clock-domain-crossing logic is generated by Cat- files) as well as the synthesizable subset of the apult. Communication is optimized according to SystemC integer and fixed-point data types are sup- user directives to enable maximal block-level concur- ported for synthesis. The support of C++ language rency of execution of the blocks using FIFO buffers constructs meets and exceeds the requirements (for streamed data) and ping-pong memories to en- stated in the most current draft of the Synthesis Subset able block-level pipelining and thus improve the OSCI standard (http://www.systemc.org). throughput of the overall design. All the HLS steps consider accurate component Generating hardware from ANSI C or C++. One area and timing numbers for the target ASIC or of the advantages of keeping the source untimed is FPGA technology for the designer’s RTL synthesis that a very wide range of interfaces and architectures tool of choice. Accurate timing and area numbers can be generated without changing the input source for components are essential to generate RTL that specification. Another advantage of an untimed meets timing and is optimized for area. During syn- source is that it avoids errors resulting from manual thesis, Catapult queries the component library so coding of architectural details. The interface and that it can allocate various combinational or pipelin- the architecture of the generated hardware are all ing components with different performance and area under the control of the designer via synthesis direc- trade-offs. The queried component library is prechar- tives. Catapult’s GUI provides an interactive environ- acterized for the target technology and the target RTL ment with both the control and the analysis tools to synthesis tool. Component libraries can also be built enable efficient exploration of the design space. by the designer to incorporate specific characteriza- Interface synthesis makes it possible to map the tion for memories, buses, I/O interfaces, or other transfer of data that is implied by passing of C++ func- units of functionality such as pipelined components. tion arguments to various hardware interfaces such as wires, registers, memories, buses, or more complex Verification and power estimation flows. The user-defined interfaces. All the necessary signals synthesis process generates the required verification and timing constraints are generated during the syn- infrastructure in SystemC so that the input stimuli thesis process so that the generated RTL conforms from the original C++ testbench can be applied to and is optimized to the desired interfaces. the generated RTL to verify its functionality against For example, an array in the C source might result the (golden) source C++ specification using simu- in a hardware interface that streams the data or trans- lation. The synthesis process also generates the fers the data through a memory, a register bank, and required verification infrastructure (wrappers and so forth. Selection of a streaming interface implies scripts) to enable the use of sequential equivalence that the environment provides data in sequential checking between the source C++ specification and index order whereas selection of a memory interface the generated RTL. Automatic generation of the verifi- implies that the environment provides data by writing cation infrastructure is essential because the interface the array into memory. The granularity of the transfer of the generated hardware heavily depends on inter- size for example, number of array elements pro- face synthesis. ÀÀ vided as a stream transfer or as a memory word is Power estimation flows with third-party tools let ÀÀ also specifiable as a user directive. the designer gather switching activity for the design Hierarchy (block-level concurrency) can be speci- and obtain RTL and gate-level power estimates. By fied by user directives. For example, a C function can exploring various architectures, the designer can rap- be synthesized as a separate hardware block instead idly converge to a low-power design that meets the of being inlined. Hierarchy can also be specified in a required performance and area goals. style that corresponds to the Kahn process network Catapult has been successfully used in more than computation model (still in sequential ANSI C++). 200 ASIC tape-outs and several hundred FPGA designs. The blocks are connected with the appropriate Typical applications include computation-intensive

14 IEEE Design & Test of Computers

Authorized licensed use limited to: Columbia University. Downloaded on January 7, 2010 at 18:37 from IEEE Xplore. Restrictions apply. [3B2-8] mdt2009040008.3d 17/7/09 13:24 Page 15

algorithms in communications and video and image processing. Ports Clock SC_MODULE required for SC_CTHREAD Reset Cynthesizer SystemC synthesis Cynthesizer takes a SystemC module containing hi- SC_CTHREAD SC_METHOD erarchy, multiple processes, interface protocols, and Signal-level Signal-level algorithms and produces RTL Verilog optimized to a ports for ports for reading data specific target technology and clock speed. The target writing data technology is specified by a user-provided .lib file or, Submodule Submodule for an FPGA implementation, by the user’s identifying Signals the targeted Xilinx or Altera part. Member Data members functions (storage) Synthesis input. The input to the HLS flow used with Cynthesizer is a pin- and protocol-accurate Sys- temC model. Because SystemC is a C++ class library, no language translation is required to reuse algo- Figure 4. SystemC input for synthesis. rithms written in C++. The synthesizable subset is quite broad, including classes and structures, opera- The computation code is written without any wait() tor overloading, and C++ template specialization. statements and scheduled by the tool to satisfy la- Constructs that are not supported for synthesis tency, pipelining, and other constraints given by the include dynamic allocation (malloc, free, new, and designer. delete), pointer arithmetic, and virtual functions. Triggered methods implemented as SystemC The designer puts untimed high-level C++ into a SC_METHOD instances can also be used to implement hardware context using SystemC to represent the behaviors that are triggered by activity on signals in hardware elements such as ports, clock edges, struc- a sensitivity list, similar to a Verilog always block. tural hierarchy, bit-accurate data types, and concur- This allows a mix of high- and low-level coding styles rent processes. As Figure 4 shows, a synthesizable to be used, if needed. SC_MODULE can contain multiple SC_CTHREAD instan- Complex subsystems are built and verified by com- ces and multiple SC_METHOD instances along with sub- bining modules using structural hierarchy, just as in module instances and signals for internal connections. Verilog or VHDL. The high-level models used as the I/O ports are signal-level SystemC sc_in and sc_out input to synthesis can be simulated directly to vali- ports. SC_MODULE instances are C++ classes, so they date both the algorithms and the way the algorithm can also contain C++ member functions and data code interacts with the interface protocol code. Mul- members (variables), which represent the module be- tiple modules are simulated together to validate that havior and local storage respectively. they interoperate correctly to implement the function- Clocked thread processes implemented as ality of the hierarchical subsystem. SystemC SC_CTHREAD instances are used for the major- ity of the module functionality. They contain an Targeting a specific process technology. To en- infinite loop that implements the bulk of the function- sure that the synthesized RTL meets timing require- ality along with reset code that initializes I/O ports ments at a given clock rate using a specific foundry and variables. The SystemC clocked thread construct and process technology, the HLS tool requires accu- provides the needed reset semantics. Within a thread, rate estimates of the timing characteristics of each op- the designer can combine untimed computation eration. Cynthesizer uses an internal data path code with cycle-accurate protocol code. The de- optimization engine to create a library of gate-level signer determines the protocol by writing SystemC adders, multipliers, and so on. This takes a few code containing port I/O statements and wait() state- hours for a specific process technology and clock ments. Cynthesizer uses a hybrid scheduling ap- speed and can be performed by the designer given proach in which the protocol code is scheduled in any library file. Cynthesizer uses the timing and a cycle-accurate way, honoring the clock edges speci- area characteristics of these components to make fied by the designer as SystemC wait() statements. trade-offs and optimize the RTL. Designers have the

July/August 2009 15

Authorized licensed use limited to: Columbia University. Downloaded on January 7, 2010 at 18:37 from IEEE Xplore. Restrictions apply. [3B2-8] mdt2009040008.3d 17/7/09 13:24 Page 16

High-Level Synthesis

option of using the gates for implementation or of interaction of hardware and software becomes both using RTL representations of the data path compo- a challenge and an opportunity for further automa- nents for logic synthesis. tion. Additional research is vital, because we are still a long way from HLS that automatically searches Synthesis output. Cynthesizer produces RTL Veri- the design space without the designer’s guidance and log for use with logic synthesis tools provided by delivers optimal results for different design constraints EDA vendors for ASIC and FPGA technology. and technologies. Š The RTL consists of an FSM and a set of explicitly instantiated data path components such as multipliers, ŠReferences adders, and multiplexers. More-complex custom data 1. A. Sangiovanni-Vincentelli, ‘‘The Tides of EDA,’’ IEEE path components that implement arithmetic expres- Design & Test, vol. 20, no. 6, 2003, pp. 59-75. sions used in the design are automatically created, 2. D. MacMillen et al., ‘‘An Industrial View of Electronic and the designer can specify sections of C++ code Design Automation,’’ IEEE Trans. Computer-Aided to be implemented as data path components. The Design of Integrated Circuits and Systems, vol. 19, multiplexers directing the dataflow through the data no. 12, 2000, pp. 1428-1448. path components and registers are controlled by a 3. D.W. Knapp, Behavioral Synthesis: Digital System conventional FSM consisting of a binary-encoded or Design Using the Synopsys Behavioral Compiler, one-hot state register and next-state logic imple- Prentice Hall, 1996. mented in Verilog always blocks. 4. J.P. Elliot, Understanding Behavioral Synthesis: A Practical Guide to High-Level Design, Kluwer Strengths of the SystemC flow. SystemC is a good Academic Publishers, 1999. fit for HLS because it supports a high level of abstrac- 5. W. Wolf, ‘‘A Decade of Hardware/Software Co-design,’’ tion and can directly describe hardware. It combines Computer, vol. 36, no. 4, 2003, pp. 38-43. the high-level and object-oriented features of C++ with 6. H. Chang et al., Surviving the SoC Revolution: A Guide hardware constructs that let a designer directly repre- to Platform-Based Design, Kluwer Academic Publishers, sent structural hierarchy, signals, ports, clock edges, 1999. and so on. This combination of characteristics pro- 7. M. Keating and P. Bricaud, Reuse Methodology Manual vides a very efficient design and verification flow in for System-on-a-Chip Designs, Kluwer Academic Pub- which behavioral models of multiple modules can lishers, 1998. be concurrently simulated to verify their combined al- 8. D. Gajski et al., SpecC: Specification Language and gorithm and interface behavior. Most functional Methodology, Kluwer Academic Publishers, 2000. errors can be found and eliminated at this high- 9. B. Bailey, G. Martin, and A. Piziali, ESL Design and speed behavioral level, which eliminates the need Verification: A Prescription for Electronic System Level for time-consuming RTL simulation to validate interfa- Methodology, Morgan Kaufman Publishers, 2007. ces and system-level operation, and substantially 10. P. Coussy and A. Morawiec, eds., High-Level Synthesis: reduces the overall number of slow RTL simulations From Algorithm to Digital Circuit, Springer, 2008. required. Once the behavior is functionally correct, 11. D. Ku and G. De Micheli, High Level Synthesis of ASICs the models that were simulated are used directly for under Timing and Synchronization Constraints, Kluwer synthesis, eliminating opportunities for mistakes or Academic Publishers, 1992. misunderstanding. 12. D. Gajski et al., High-Level Synthesis: Introduction to Chip and System Design, Kluwer Academic Publishers, ONE INDICATOR OF HOW FAR HLS has come since its 1992. early days in the late 1970s is the preponderance of 13. R.A. Walker and R. Camposano, eds., A Survey of High- different HLS tools that are available, both academi- Level Synthesis Systems, Springer, 1991 cally and commercially. However, many features 14. S. Gupta et al., SPARK: A Parallelizing Approach to the must yet be added before these tools become as High-Level Synthesis of Digital Circuits, Kluwer Academic widely adopted as layout and logic synthesis tools. Publishers, 2004. Moreover, many specific embedded-system applica- 15. A. Orailoglu and D.D. Gajski, ‘‘Flow Graph Representa- tions need particular attention. As HLS moves from tion,’’ Proc. 23rd Design Automation Conf. (DAC 86), block-level to subsystem to full-system design, the IEEE Press, pp. 503-509.

16 IEEE Design & Test of Computers

Authorized licensed use limited to: Columbia University. Downloaded on January 7, 2010 at 18:37 from IEEE Xplore. Restrictions apply. [3B2-8] mdt2009040008.3d 17/7/09 13:24 Page 17

16. M. Girkar and C.D. Polychronopoulos, ‘‘Automatic Extrac- research group. His research interests include system- tion of Functional Parallelism from Ordinary Programs,’’ level design and methodologies, HLS, CAD for SoCs, IEEE Trans. Parallel and Distributed Systems, vol. 3, embedded systems, and low-power design for FPGAs. no. 2, 1992, pp.166-178. He has a PhD in electrical and computer engineering 17. P. Paulin and J.P. Knight, ‘‘Algorithms for High-Level from the Universite´ de Bretagne-Sud. He is a member Synthesis,’’ IEEE Design and Test, vol. 6, no. 6, 1989, of the IEEE and the ACM. pp. 18-31. 18. M. Reshadi and D. Gajski, ‘‘A Cycle-Accurate Compilation Daniel D. Gajski holds the Henry Samueli Endowed Algorithm for Custom Pipelined Datapaths,’’ Proc. Int’l Chair in Computer System Design at the University of Symp. Hardware/Software Codesign and System Synthe- California, Irvine, where he is also the director of the sis (CODES+ISSS 05), ACM Press, 2005, pp. 21-26. Center for Embedded Computer Systems. His research 19. I. Auge and F. Petrot, ‘‘User Guided High Level interests include embedded systems and informa- Synthesis,’’ High-Level Synthesis: From Algorithm to tion technology, design methodologies and e-design Digital Circuit, P. Coussy and A. Morawiec, eds., environments, specification languages and CAD soft- Springer, 2008, pp. 171-196. ware, and the science of design. He has a PhD in com- 20. D. Chen, J. Cong, and P. Pan, FPGA Design Automa- puter and information sciences from the University of tion: A Survey, Now Publishers, 2006. Pennsylvania. He is a life member and Fellow of the 21. W. Geurts et al., Accelerator Data-Path Synthesis for IEEE. High-Throughput Signal Processing Applications, Kluwer Academic Publishers, 1996. Michael Meredith is the vice president of technical 22. L. Zhong and N.K. Jha, ‘‘Interconnect-Aware Low-Power marketing at Forte Design Systems and serves as High-Level Synthesis,’’ IEEE Trans. Computer-Aided president of the Open SystemC Initiative. His research Design of Integrated Circuits and Systems, vol. 24, no. 3, interests include development of printed-circuit board 2005, pp. 336-351. layouts, schematic capture, timing-diagram entry, ver- 23. M. Kudlur, K. Fan, and S. Mahlke, ‘‘Streamroller: Auto- ification, and high-level synthesis tools. matic Synthesis of Prescribed Throughput Accelerator Pipelines,’’ Proc. Int’l Conf. Hardware/Software Codesign Andres Takach is chief scientist at Mentor Graph- and System Synthesis (CODES+ISSS 06), ACM Press, ics. His research interests include high-level synthesis, pp. 270-275. low-power design, and hardware-software codesign. 24. M.C. Molina et al., ‘‘Area Optimization of Multi-cycle He has a PhD from Princeton University in Electrical Operators in High-Level Synthesis,’’ Proc. Design, Auto- and Computer Engineering. He is chair of the OSCI mation and Test in Europe Conf. (DATE 07), IEEE CS Synthesis Working Group. Press, 2007, pp. 1-6. 25. G. Stitt, F. Vahid, and W. Najjar, ‘‘A Code Refinement ŠDirect questions and comments about this article Methodology for Performance-Improved Synthesis from to Philippe Coussy, Lab-STICC, Centre de recherche, C,’’ Proc. IEEE/ACM Int’l Conf. Computer-Aided Design rue de saint maude, BP 92116, Lorient 56321, France; (ICCAD 06), ACM Press, 2006, pp. 716-723. [email protected].

Philippe Coussy is an associate professor in the For further information about this or any other comput- Lab-STICC at the Universite´ de Bretagne-Sud, France, ing topic, please visit our Digital Library at http://www. where he leads the high-level synthesis (HLS) computer.org/csdl.

July/August 2009 17

Authorized licensed use limited to: Columbia University. Downloaded on January 7, 2010 at 18:37 from IEEE Xplore. Restrictions apply.