Compiler Aspects of Hardware Accelerators

Xiaochu Liu Department of Computer Science and Engineering University of California San Diego La Jolla, California 92092 Email: [email protected]

Abstract Param. Description Relation Classical Leakage —Hardware accelerators are first-class building blocks scaling scaling in modern computing devices. Ranging from data centers B power budge 1 1 to mobile devices, hardware accelerators are designed and A chip size 1 1 developed in order to run certain applications more efficiently. Vt threshold voltage 1/S 1 V V The interaction between hardware accelerator design and dd supply voltage t 1/S 1 tox oxide thickness 1/S 1/S compiler support has continued to become critical in order W, L transistor dimen- 1/S 1/S to make systems more efficient. In this report, I will describe sion key efforts in hardware accelerators and their philosophy for Isat saturation current WVdd/tox 1/S 1 2 interacting with compilers. Then I will describe my current p device power Isat Vdd 1/S 1 Cgate capacitance WL/tox 1/S 1/S work in building tools to generate and integrate hardware F device frequency Isat S S accelerators automatically. CgateVdd D device per chip A/(WL) S2 S2 P power D× p 1 S2 1. INTRODUCTION U utilization B/P 1 1/S2 Table 1: Technology scaling table [3]. Chip technology scaling is now a limiting factor of the hardware system efficiencies [1], [2], [3], [4]. Previously in classical scaling era, as the transistor size shrinks, the power to drive more transistors on the same chip area does [9]. Adding hardware accelerators becomes the consensus in not change [5]. Hence the frequency is increased for free. industry and academia to increase system efficiency facing However in the leakage scaling era (where we are right the problem of dark silicon. now), the threshold voltage (Vt) does not shrink which stops These hardware accelerators can be built as extensions the supply voltage (Vdd) from shrinking any more (refer to for CPUs/GPUs through either ASIC (Application Specific Table 1). Given a constraint of the supply power of the chip, Integrated Circuit) or FPGA (Field-programmable Gate Ar- only part of the chips can be actively switched on and off ray). CPUs and GPUs themselves are general-purpose and at the same time. The other portions that have to be off trading efficiency for flexibility. For CPUs, the parallelism due to the limitation of supply power are idle chip area and memory bandwidth are limited. GPUs have a massive and are referred to as dark silicon. Trading the dark silicon amount of parallelism, however they are power hungry and for hardware accelerators generally is a profitable choice in show variant performance for different domains. Accelera- terms of reducing power consumption. tors built in ASIC or FPGA can remedy the shortcomings Hardware accelerators are common building blocks of CPUs and GPUs. Though ASIC requires higher design nowadays. In addition to specialized Functional Unit like and manufacture costs, it can be specialized for high per- Floating-point Unit, processors added AVX for vector formance and low power for a particular application. Since processing and AES-NI for cryptography encryption oper- it is rare to do any change after computations are hardened ations [6], [7]. ARM has instructions set extension to sup- into ASIC, flexibility of designs becomes very important port AES encrypt/decrypt and advanced SIMD (NEON) for for domain-specific ASIC accelerators. FPGA is less energy mapping into hardware accelerators [8]. ARM also supports efficient than ASIC but is re-programmable which lowers the co- (hardware accelerators) integration by physi- cost and provides an approach to prototype the hardware cal register mapping. GPUs are essentially large external before the actual manufacturing. accelerators to perform intensive parallel tasks. The latest High-Level Synthesis (HLS) converts programs written Apple A9 chip has M9 as an acceler- in high-level programming languages into hardware written ator to gather data from sensors (accelerometer, gyroscope, in Hardware Descriptive Languages (HDL). It automates compass, and barometer) or even receive commands the process of hardware designing by shifting the burden of hardware design to its software counterpart. Several It supports rotations which makes the reuse of neurons to HLS tools are available in production or research proto- avoid reloading the same neurons again and again. types - BlueSpec, Catapult C, C-To-Silicon, LegUp, Vivado [10], [11], [12], [13], [14]. The generated hardware can be *' mapped to FPGA or ASIC. Leveraging HLS tools to gen- Control#Processor#(CP)# Inst.# erate hardware accelerators andweight integrate' neuronthem into systems

DMA# Instruc:ons# output output can significantlylayer reduce the design efforts. This report explores the design decision of key hardware Tn# NFU)1% NFU)2% NFU)3% accelerator-based systems and focuses on their compiler

aspects. I synthesized the compiler aspects of these efforts. NBin% DMA# hidden +' x Inst.# layer I divide these accelerators into three categories+' based on Memory#Interface# their compiler aspects - ISA-based,synapses configuration-based ,and Inst.# DMA# Tn# *' bi automatically generated. A brief introductiona ofi my current work of generatinginput and integrating hardware accelerators neuron table' Tn#x#Tn# NBout% automatically is also presented in the end. x The restsynapse of the report is organized as following. In Section 2, 3, 4, designs of key hardware accelerator-based architecturesFigure 9. Full are hardware described implementation and the compiler of neuralnetworks. aspects are SB% summarized. In Section 5, I introduce my recent works on building tools to generate hardware accelerators for irregular code. In Section 6, I summarize my content and give some Critical Path (ns) Figure 1: DiannaoFigure 11. systemAccelerator. architecture [15]. future directionsArea on (mm^2) improving compiler aspects of hardware accelerators. Energy (nJ) 2.2.such asConvolution-operation the Intel ETANN [18] at the accelerator beginning of the 1990s, 2. ISA-based accelerators not because neural networks were already large at the time, butConvolution because hardware Engine resources (CE) is (number a customized of transistors) hardware were ac- Firstly, I describe key efforts in ISA-based hardware celeratornaturally for much convolution-operation more scarce. The principle composed was of tomap time- and accelerators. They serve various domains and have certain reduceshare the [18]. physical Using SIMD neurons machines and use to the compute on-chip convolutions RAM to ways of ISA designs inside. The ISA designs target for high needsstore synapses many registers and intermediate (quadratic neurons to the size values of the of hidden block). flexibility while maintain specializations of the hardware GPGPUslayers. However, increase at performance that time, many by 10 neural timescomparing networks were with 012345 SIMD machine but cost 100 times more energy [19]. CE is accelerator. I8x8 summarize 16x16the 32x32 design 32x4 decisions 64x8 128x16 on compiler small enough that all synapses and intermediate neurons aspects at the end of this section. designedvalues could to perform fit in the this neural computation network RAM. pattern Since efficiently this is no by Figure 10. Energy, critical path and area of full-hardware layers. reducinglonger the the case, unnecessary one of the operations. main challenges for large-scale Convolution pattern is widely used in computational neural network accelerator design has become the interplay 2.1. Large-scale neural network accelerator photography, image and video processing. A standard dis- neuron to a neuron of the layer, and from one synap- cretebetween 2-dimensional the computational convolution and the has memory a general hierarchy. formula as tic latchDiannao to the is associated a hardware neuron. accelerator For instance, targeting an executionlarge-scale such: timeCNNs of (Convolutional 15ns and an energy Neural reduction Networks) of 974x and DNNs over a (Deep core 5. Accelerator for Large Neural Networks hasNeural been Networks) reported for [15], a 90-10-10 [16], [17]. (90 It inputs, focuses 10 on hidden, increasing 10 In this section, we draw∞ from∞ the analysis of Sections 3 and the performance and energy efficiency by reducing the run- X X outputs) perceptron [38]. 4(Img to designf)[ ann, accelerator m] = for large-scaleImg[k] neuralf[n networks.k, m l] time memory footprint. Diannao hardware accelerator is × × − − The main componentsl=−∞ k of=−∞ the accelerator are the fol- synthesized into hardware designs in order to benchmark (1) 4.2 Maximum Number of Hardware Neurons ? lowing: an input buffer for input neurons (NBin), an out- the hardware for the power . Function f is a filter and Img is a mapping from location However,Diannao the area,is designed energy andfor large-scaledelay grow quadratically CNNs and DNNs. with toput pixel buffer values. for output It contains neuronsmap (NBout),(the product and a of third filter buffer func- theFor number small-scale of neurons. neural We networks, have synthesized memory is the only ASIC used ver- to tionfor synaptic and pixels) weights and reduce (SB),(summation connected to of a all computational the products) sionsstore ofinput neural and network output result. layers All of thevarious neurons dimensions, and synapses and operation.block (performing Abstracting both the synapses two operations and neurons to acomputations) more general weare report hardened their in area, the accelerators critical path which and energy minimizes in Figure the execu- 10. format:which we call the Neural Functional Unit (NFU), and the Wetion have and used communication Synopsys ICC overhead for the place between and neurons. route, and How- the control logic (CP), see Figure 11. We first describe the NFU TSMCever, this 65nm low-overhead GP library, design standard does VT. not A scale hardware since neuron a large below, and then we focus on and explain(Img thef)[ rationalen, m] = for the performsneuron network the following will take operations: too much multiplication hardware die of area. inputs A storage elements of the accelerator. × (2) R R Map(Img[k], f[n k, m l]) andscalable synapses, design addition should of involve all such memory multiplications, access in the followed middle |l|

2.3. Vector accelerator

HWACHA is a vector extension for RISC-V processors [20], [21], [22]. It is invoked in the commit stage in the Figure 2: 10x10 tiled architecture [24]. processor pipeline. The vector instruction is issued to Hazard and Sequencer in HWACHA andFig. stays 1: there 10x10 until itChip is ready. includes tiled 10x10 cores; each is a federation of Micro-engines. Then the instruction is fed into expander which breaks a grammers interact with the micro-engines in a fine-grained single vector instruction into bank micro-ops. Each bank way by calling intrinsics. has an SRAM buffer and vector functional unit (for simple II. FEDERATED HETEROGENEOUS 10X10 ARCHITECTURE The BnB micro-engine is often partnered with other micro- operations). Computation flows from one bank to the next in 2.5. Video processingengines accelerator such as DLT and VFP to improve performance, energy a systolicA. 10x10 way. Larger Core functional units (integer multipliers, floating point operators) are shared across banks. RISC-V This ASIC designefficiency improves energy and workload efficiency coverage. by Finally, we are working and vectorA 10x10machines architecture have separated (Figure instruction 1) memories employs workload-drivenalgorithm-specific optimizationsin an compiler [19]. Other implementationthan designing that support compilation of whileco-design they share the to same customize data memory a set (cache). of micro-enginesenergy-efficient for energy CPUs [25],OpenCL it attempts code to create into Ccustomized code with BnB intrinsics. Basedefficiency, on the and public then source federates code in them their to compiler, achieve general-purposeASIC using new approaches. It uses an extensible processor- vector operations are identified and mapped by compiler 2) Fast Fourier Transform Micro-engine: Fast Fourier coverage. Current 10x10 micro-engines are classifiedbased fully chipas 1) multiprocessor generator to explore the de- automatically. The auto-vectorization compiler pass is able sign space. [26] Five functionsTransform are extensively (FFT) usedis essential in H.264 kernel for applications involving compute-intensive (BnB, FFT and VFP) micro-engines and 2) to convert loop into vector primitives based on static code application. Integer Motionimage, Estimation audio, (IME) and matches digital eachsignal processing. It motivates design analysis.data-intensive Then vector instructions (DLT, GenPM are mapped and to Sort). HWACHA The 7thimage micro-engine block to a previous image to represent the motion. and integration of an FFT micro-engine in 10x10 to provide vectoris instructions a conventional by the compiler RISC core. backend. Fractional Motion Estimation (FME) refines the result from programmability and energy-efficiency [9]. The FFT compute A 10x10 core includes a federated set of micro-enginesIME. Intra Prediction shar- (IP) produces a prediction for cur- 2.4. Heterogeneous accelerator rent block based on predictionskernel of is its an neighbouring optimized blocks. custom logic and associated storage ing local memory, L1 data-cache, L2-unified (FigureTransformation 1). Each and Quantizationstructure(DCT/Quant) generated by generates Spiral [10] and tightly-integrated into Bigmicro-engine challenges exist uses in building customized next-generation instructions, energy- tightlycoefficients integrated for an imagethe block pipeline to construct for the high pixels. programmability. Con- efficiencyso as chip to to multi-processors accelerate at [23]. fine-grain. 10X10 is Each a tiled- micro-enginetext Adaptive has Binary its Arithmetic Coding (CABAC) encodes architectureown general each of whose purpose tiles and includes vector six registers,micro-engines and onlythese one coefficients micro- from previousThe stage. FFT The micro-engine ASIC is designed computes on 64 complex points (16- to pipeline these functions into customized logic stages. The 2 and aengine general-purpose executes RISC at core a time. [24]. The Unlike micro-engines traditional accelerators, bit integer real and imaginary ) in 16 cycles, replacing a can be roughly divided into two categories - compute- optimizations on each stagelarge focus number on three of aspects compute and each and memory instructions with a the customized micro-engines are tightly coupled,of which sharing is done an on register file, instruction decoding and intensive and data-intensive. Compute-intensive engines in- single instruction, thereby significantly improving performance cludeL1 Bit-Nibble-Byte data cache (BnB), and Fast-Fourier-Transformlocal memory. Switching (FFT) betweendata path. micro- Firstly, SIMD and precision is customized for and Vector-Floating-Pointengines is achieved (VFP). with BnB a special is a flexible instruction SIMD (underparallelism. software Secondly, instructionand energy fusion efficiency. for better instruc- The input data of FFT micro-engine is vector processing unit as width as 256 bytes. FFT is a tion density. Thirdly, operationsloaded from are grouped local memory into blocks into vector registers. DLT can be control), that transfers to a different micro-engine.for better All instruction state density. After these optimizations, the widelyneeded used operator is passed in image, through audio the and caches digital and signal local memory. The used to move data efficiently from off-chip memory to local processing applications. VFP supports 2048-bit vectorized overhead of instruction fetch and register file access in program exists as a single image in memory, butgeneral-purpose as a collection CMP arememory, removed and entirely. transpose data in local memory. The ISA of FFT Floating-point operations. Data-intensive engines include micro-engine (Table I) can be used to program and accelerate Sort,of Data-Layout-Transformation each micro-engines specialized (DLT) and Generalized- instruction sequences.This architecture is generated based on Tensilica’s com- Pattern-MatchingTable I shows(GenPM). the DLT intrinsics facilitates for data each movement micro-engine’spiler. Tensilica’s custom TIE languagelarger is FFTs used to with customize low memory software overhead. parameters and datapath. The ISA extension allows to spec- (scatter,operations. gather) among The main micro-engines memory and and local their memory. ISA are described in 3) Vector Floating Point Micro-engine: Floating point num- GenPM is designed for Finite-State Machine based applica- ify VLIW slots, SIMD width, register file and instruction tions.detail Sort is in an Sections universal algorithm 2.2 and in 2.3. many applications. extensions. The TIE languagebers providegenerates intrinsics high-dynamic automat- range, error-tolerance, and easy The micro-engines share the same L1-data and L2-unified ically for customized instructionsprogramming, (if any). compared Original algo- to fixed point. While the cost of cache. Each micro-engine has customized instruction de- rithms can be adapted tofloating call these point intrinsics in to terms leverage of ISA silicon area and power is decreasing enhancement in addition to architectural benefits directly. signedB. for Compute-Intensive easy integration and programmability. Micro-engines in advanced CMOS process, the VFP micro-engine enables The compiler invokes each micro-engine by the ISA floating point applications (e.g. inner-product) to execute with extensions1) provided BnB micro-engine: by each micro-engine.Transistor The instructions density scaling2.6. Database enables accelerator of eachwide micro-engine SIMD vectorinclude both architectures data movement that and can actual increase both com- high performance on a 10x10 heterogeneous system. processingputation which throughput helps improve and communication energy efficiency efficiency. of streamingData Processing appli- Unit (DPU)The isVFP a pipelined micro-engine architecture is an 2048-bit wide SIMD ar- Eachcations. instruction The is built BnB in [8] the compiler micro-engine as intrinsics. is a fine-grain Pro- designed wide for SIMD handling datachitecture processing requeststhat computes (queries) [27] 64 addition or multiplications of architecture that processes as many as 256 one-byte elements single precision (IEEE-754) floating point values in a single with one instruction. instruction, to deliver 64 to 256 GFLOPS in 32nm and 7nm The BnB micro-engine includes sixteen 32-bit3 scalar and processes respectively. Like other micro-engines, VFP gets sixteen 2048-bit vector registers, supporting typical vector operands from a 16 entry, vector register file, and is tightly instructions (data movement, logic, and fixed point operations coupled to enable efficient acceleration of small stretches of (Table I)). Furthermore, the BnB micro-engine supports 4, 8, computation. 16, 32, 64, 128 bit element sizes, significantly more flexible than SSE (8, 16, 32 bits), and as a result offers higher performance for “short-data” (< 8 bits) computations namely 2In [11], we have evaluated and compared relative energy of using 32-bit encryption (AES), compression (Run Length Encoding), etc. floating and fixed point FFT accelerators.

ACM SIGARCH News 3 Vol. 43 No. 2 May 2015 [28]. It designs an Instruction Set Architecture (ISA) and of the host processors (10x10). Instructions are stored in explores the design space of the implementation of architec- instruction memories (host processor’s instruction memory tures. The architecture contains eleven tiles for performing or small DRAMs attached to the accelerators). Domain- eleven data-processing related operations. The partitioner specific computations (SQL, etc.) need no support for com- splits large table into smaller ones. The joiner performs plex control flow (Q100). Arbitrary computations (C++, etc.) inner-join operation of two tables. The ALU performs op- needs support for arbitrary instruction jumping and it can erations on two SIMD column registers to produce one be built as ISA extensions of host processors (Convolution column register. The boolGen compares a column with a Engine). In the case of ISA extension, the accelerators constant and generates a new column for the result. The need no initialization (configuring computations) time. In columnfilter takes a column and a bool column as input the other case of stand-alone ISA, the accelerators still have and produces a new column with rows ruled out based on to spend the time to be initialized with instructions. ISA- the bool column. The aggregator achieves The column based accelerators can only perform operations whatever selector and stitcher extracts columns from a table or ISA supports. aggregates columns into a table. The column concatenator concatenates two columns into a new column with entries 3. Configuration-based accelerators from both columns. The appender appends one table to another table with the exact same schema. Memory is used In this section, I describe another category of hardware for communication among tiles (storing temporary result and accelerators - configuration-based accelerators. They target final result). different application domains and leverage different domain DPU is used for data processing and so does its pro- knowledge and technologies for the configurations. I sum- gramming model. Programmers have to program each query marize the design decisions on compiler aspects at the end using assembly instructions since it does not have compiler of this section. or parser to parse the queries written in Structure Query Lan- guage (SQL). The benchmark (TPC-H) has to be modified in order to satisfy the hardware constraints. 3.1. Dynamically configurable accelerator

2.7. Summary Dynamically Specializing Execution Resources (DySER) is a hardware accelerator integrated with The accelerators described above have different consid- an out-of-order processor pipeline [29]. It achieves erations for ISAs. Diannao has blurring interface to the specialization in both functionality and parallelism. compiler. The input, output and intermediate results are Functionality specialization has dedicated datapath for stored in scratchpad memory. ControlPath (CP) supports 5 certain functions [30], [31], [32], [33], [34]. DySER uses control instructions to issue DMA request to buffers and an array of heterogeneous functional units with configurable perform arithmetic computations on NFU. These instruc- switch in order to achieve functionality specialization. Such tions can control the topology of the Neural Networks. a design eliminates the decode, fetch, commit, register These instructions have to be hand-coded and stored in file read/write for any instruction otherwise executed in a an SRAM attached with the CP. Convolution engine adds processor by combining sequences of instructions together instruction set extensions and leverages compiler intrinsics into a big operation. A valid-credit system is used in a to use these extensions. The intrinsics are most essentially a DySER to support pipelined execution. In order to achieve one-to-one mapping from programming language constructs parallelism in DySER, a few compiler optimizations are to architecture instructions. Configuration instructions are applied to the original code. Loop unrolling (UNR) and used to set the size of the kernel and ALU operation scalar expansion (SCX) are used to grow the parallel code types. Load/store instructions are used to orchestrate data region when the code region is too small to fully utilize the movement between the appropriate register files and mem- DySER. Subgraph mapping is used to break large regions ory. Convolution instructions are used to perform actual into smaller ones to fit into a DySER. Strip-mining (STR) convolutions on the data in the register files. HWACHA uses and Vector port mapping (VEC) are used to vectorize the intrinsic-style approaches (like convolution engine). 10X10 DySER communication. uses another intrinsic-style approaches. But it uses Tensilica The compiler plays the role of optimizing and generating LISA compiler to ease the development. DPU designs an the DySER configurations. It first performs transformations ISA but applications are hand coded without query compil- on the code to increase its parallelism. Then it identifies the ers. Each dynamic instruction of the ISA corresponds to an region of code and generates DySER configurations out of instance of one type of tile (accelerator). H.264 uses another it. Finally the communication instructions are filled in the intrinsic-style approaches. It uses Tensilica TIE compiler to original code to configure and invoke DySER. generate hardware ISA-extensions and compiler intrinsics for programmers to use. 3.2. General-purpose neural networks accelerator ISA-based accelerators borrow the design wisdom from general-purpose processors. These accelerators have a de- This particular system includes programming language coder in the accelerator (DianNao) or share the decoder support [35], compiler, NPUs (neural processing units), and

4 the interface between host processor and the NPUs [36]. The system consists of NPUs that are integrated into the Neural networks are found to be very useful in emulating pipeline of the host processor. The NPU communicates with imperative code on software as well [37]. The program- the host processor via FIFOs. Three FIFOs were used for ming language and compilation framework train the neural sending/receiving the configurations of NPUs, sending the networks and transform the original code to take advantage input to NPUs and receiving the outputs from NPUs. Four of this neural network. The host processor which has the instructions were added as ISA extensions to manipulate non-approximate part of the program running on it invokes the 3 FIFOs (enq.c, deq.c, enq.d and deq.d). The instruction the NPUs by special instruction set extensions. scheduler treats all NPU instructions to be access to a single The programming model allows the programmers to architectural register which makes sure that all the NPU have total control over the approximation. The approximate instructions are issued in order. It issues the enqueue request part has to be hot code otherwise the overhead of invoking only when the queue is not full and the dequeue request only NPU might contradict the benefit it brought in. The target when the queue is not empty. The NPU starts execution is annotated by programmers as the potential approximate when the input FIFO is full. part. The compiler can test output quality of each of the NPU is implemented as digital design on FPGA or annotated parts to decide whether to approximate it or not. ASIC. Two levels of approximation is reflected by the nature The annotated part has to be pure which means no function of neural networks and approximated hardware circuit. In call or global data access is allowed. The input and output this particular design, each NPU has eight PEs (Process- size has to be fixed and known. Pointer input is transformed ing engines), config FIFO, input FIFO, output FIFO and into the data it points to and size of which is fixed and scheduling buffer. This design is determined by speedup known. testing. The scheduling buffer is used to organize the ex- In order to train the neural networks, training data is ecution order of each neuron which is allocated to a PE. collected for the annotated kernel. The compiler instruments Each PE contains weight buffer, input buffer, output buffer the annotated functions and records the input-output pairs and a sigmoid unit which computes the activation function. of executions given training data. An MLP (multi-layer per- The valuation is performed on a benchmark of diverse ceptron) model is trained using back-propagation algorithm domains. It achieves a 2.3 speedup and energy saving of 3.0 with quality loss less× than 10% for whole application. on the recorded data. The compiler generates a configuration × for the NPU and instructions to invoke it in the main pro- gram. The configuration contains topology parameters and 3.3. Neural network SoC accelerator weights. The NPUs are configured by compiler generated configurations and the main program invokes them when SNNAP (systolic neural network accelerator in pro- they run into the annotated functions. grammable logic) leverages the FPGA on an SoC to ac- celerate computations that are error-tolerant using neural networks [38]. FPGA has been demonstrated to be good at accelerating certain algorithms [39]. It uses one of the two compilation approaches proposed to transform regions Imperative in the source code to generate code to invoke neural network source code accelerators (with the same parameters as in the original Programmer source code). The software batches the invocations and sends them to the accelerators to increase their throughput Annotated source due to the latency between host processor and accelerators. code Then a callback function was written to copy the result back CPU to the host processor. The invocation is asynchronous which means the execution of host processor and accelerators can Training inputs interleave. Two compilation approaches were introduced. The first Trainer approach uses an approximate programming language to annotate the approximate part in the program. The annotated Trained neural networks part which is usually a function is replaced with an invoca- tion to the neural processing unit instead of the original func- Code generator tion. The second approach provides a lower-level interface Instrumented or an API for programmers to specify the communication NPU config CPU binary between CPU and the accelerators and the computations on accelerators. This approach uses an asynchronous model where the program configures and starts the accelerator and CPU NPU waits until the results come back from it. The first approach is more automatic while the second one provides more fine- grained control over how exactly CPU interacts with the Figure 3: The Parrot transformation flow for NPU [36]. accelerators.

5 The architecture design is built based on PSoC (Pro- grammable System-on-Chips). It has a Dual Core ARM 푤11 푤21 푤31 푥1 푥1′ Coretex-A9 processor and an FPGA. The accelerator was 푓 푤12 푤22 푤32 ∙ 푥2 = (푥2′) 푤 푤 푤 푥 configured on the FPGA chip. The host processor (ARM) 13 23 33 3 푥3′ configures the accelerator using GPIOs (General Purpose I/Os) interface. Specifically AXI(Advanced Extensible In- (a) terface) is used to implement DMA transfer to memory w31 w mapped-registers between ARM and FPGA. The inputs (pa- 21 w32 w w rameters) were sent to accelerator using ACP (Accelerator 11 22 w33 w Coherency Port). It allows the FPGA to request data from 12 w23 . w . . host processor’s cache. To invoke the accelerator, the host 13 processor uses ARMv7 SEV/W F E instructions. 0 x ’ x ’ x ’ MLP (multi-layer perceptron) neural networks is used to x1 x2 x3 f 1 2 3 be implemented as accelerator. MLP has an input, an output and many hidden layers in between. A layer is composed of neurons and each neuron is calculated as the sum of (b) weights of neurons of its previous layer after an activation function (sigmoid often). Figure 5: Systolic scheduling of neuron network layers [38].

x0 them is less sophisticated in terms of design. Re-applying x4 x7 neural accelerators designed for CPUs to GPUs directly will y0 take too much die area which makes the architecture less x1 profitable. The workloads that usually run on GPUs fit for x5 x8 approximate computing [40], [41]. GPU neural accelerators y1 x2 eliminate costly operations instead of increasing parallelism [42]. It eliminates fetch/decode logic and memory access. It x6 x9 reuses multiply-add functional unit inside each GPU stream- x3 ing multiprocessor(SM). It simulates sigmoid operators in each neuron with a look-up tables. Input Layer Hidden Layers Output Layer The system contains a programming language, compi- lation flow, ISA extensions and a GPU-accelerator archi- tecture. It leverages CUDA programming languages with pragma extensions to specify the code section to approxi- mate. For the specified code section, inputs are the values Figure 4: MLP example with 2 hidden layers. x7 = that are alive or referred inside the section and outputs are sigmoid(w47 x4 + w57 x5 + w67 x6), where wxy stands the values that are alive or modified in the section. To get the for the weight· on edge between· x and· y [38]. training/testing data set, the code section was compiled and executed for input/output pairs. The neural network model to The accelerator hardware implements a neural network be trained is a limited-sized Multi-layer Perceptron (MLP). using systolic arrays. It contains a scheduler, bus and many The compilation flow replaces the original code section PUs (Processing Unit). Each PU has a scratchpad memory, with a invocation to the trained neural accelerators (built controller, sigmoid unit and many PEs (Processing Element). in hardware). One PU computes the weight product of one neuron and uchar4 p=tex2D(img, x, y); passes the accumulated result to next PU for the next ... neuron on the same layer. Weights are stored in BRAMs #pragma ( b e g i n a p p r o x ) (block-RAMs) and temporary results are stored in a FIFO a=min(r, min(g,b)); b=max(r , max(g,b)); accumulator until one layer is completed. z = ( ( a+b ) >254)? 255:0; The evaluation demonstrates that SNNAP gains an av- #pragma ( end approx ) erage of 3.8 speedup and 2.8 energy savings with a less ... than 10% error× rate. × dst[img.width ∗y+x ]= z ; The architecture integrates accelerators to each GPU 3.4. GPU neural networks accelerator streaming multiprocessors (SMs) instead of each processing unit (PU). Each SM has a set of accelerators with a shared Unlike previous neural accelerators attached to CPUs, weight buffer and control logic. Each PU inside an SM has accelerators attached to GPUs have a different set of design its own sigmoid operators (LUT simulated) and input/output constraints. GPUs have more cores than CPUs and each of buffers (for intermediate result between neuron layers and

6 The FPGA performs DMA request to the buffer by periodi- Metadata SW Pre-Core Post-Core HW Structures cally checking the full bit. A shell is developed to manage (user-def) (user-def) Core Core Control all the FPGA board resources. Developers only need to L2 Spill Partition FSM write the role part which is the actual application logic. Reader Though programmers do not need to have knowledge of Pre-Core Post-Core Partition Allocator board resource of Catapult, the Hardware Description Lan- DRAM Pre-Core Post-Core + Writer guage (HDL) requirement is necessary. From this point of view, Catapult is programming friendly mostly for hardware Figure 7: LINQitsFigure hardware 6: LINQits template hardware for partitioning, templates grouping, [43]. and hashing. designers. The Join operator also poses similar challenges, as shown Free List Partition Next in Figure 6. In a Join, two large data collections are matched Header Partition by a common key followed by a user-defined function that Partition Key 3.7. Summary takes both valuesfinal associated result). with the commonAll the key PUs and com- inside an SM executesNext in a lock- Meta DataRoot putes an output (alsostep based mode on a user-defined controlled function). by theFor shared logic. Size ~1-4KB very large data set sizes, a hash join is typically used to avoid Data Array Next The accelerators described above have different consid- Data an expensive nested, pairwiseThe qualitycomparison. degradation In the hash join, is controlled by invocation rate Arrays Data Array Nexterations for configuration designs. DySER heavily depends elements from the smaller of the two partitions are inserted of the neural accelerators. The invocation rate is the numberData Array into a hash table. Elements of the larger partition are then on the compilers to generate code. The compiler identifies used to probe theof hash warps table for invoking matching keys. the Like neural in accelerators over the total GroupBy, implementing a naive hash join similarly exhibits Figure 8: Data Structures for LINQits partitions.memory sub-regions which includes load/store instructions poor caching andnumber memory behavior. of warps. Based on the evaluation, neural accel- and converts them into instructions to invoke DySER. The out the final hash operation needed in a multi-level GroupBy Partitioning. Partitioningerators iscan a well-known 2.4x performance software strat- speedup and 2.8x less energy or Join. We also observe that both hashing and partition-compiler also identifies computation sub-regions and con- egy for improvingconsumption the locality and performance within of lessGroupBy than 10%ing can quality leverage the loss. same physical structures with minimal and Join queries in databases [16, 12, 41] and distributed impact to hardware complexity and area. verts them into configurations that can be used to configure run-times [24]. As illustrated in Figure 5 (bottom), a pre- Figure 7 illustrates the physical organization ofDySER. our pro- The configurations happen sometime before the processing step is first carried out by computing a key par- posed HAT architecture, spanning both hardware and soft- tition function on3.5. each key Query value. In the processing example, the par- acceleratorware on the ZYNQ platform. The heart of the HATinvocations. is a NPU uses compiler to generate code as well tition function divides the initial collection into two disjoint data steering and storage engine that enables fine-grained re- sub-partitions, each with non-overlapping keys. In the final partitioning of data into multiple hash entries, implementedas generating configurations. The compiler transforms the GroupBy, the two disjointLINQits sub-partitions is are a processed Hardware with Templateusing a network-on-chip (HAT) based and physical accel- queues. annotated code (a function) into a neural network by per- independent hash tables with non-overlapping key values. At run-time, data is streamed in from main memory and During the partitioningerator steps, for fewer data random processing accesses are languageprocessed [43].by a collection The of accelerator pre- and post-core modulesforming that an off-line training process on some training data. made to DRAM becausecan be sub-partitions implemented are built up using contigu- eitherimplement FPGA-based the user-defined SoC functions or ASIC described in SectionThe 2;generated configuration includes the weights of each ously (versus writing in one pass to the final partitions di- the outputs of the pre-cores are used by the network-on-chip rectly). Partitioningwith comes different at the cost ofdesignO(np) operations, flavors. Itfor leverages steering key-value LINQ pairs which into destination is an queues.neuron When (neural network topology is fixed). The compiler where n is the collectionadvanced size and queryp is the language number of parti- from Microsoft’sa queue reaches capacity, .NET it framework is streamed out into main mem- tioning passes. At the end of the partitioning phase, each in- ory to form new sub-collections that become the sourcesinstruments for the CPU code to configure, send data to and dividual sub-partition[44]. would In be addition inserted into to a private traditional hash SQL,subsequent it invocations supports of the some HAT. user- invoke the accelerators. ISA extensions are added to support table scaled to fit within on-chip memories. The same opti- For operators that associate computation with items al- mization can be applieddefined to Join functions(bottom, Figure and 6). supports Here, ready lazy sorted evaluation into queues to (e.g., avoidAggregate un-), post-coresthe communication are of data and configurations. SNNAP uses two disjoint sub-partitionsnecessary are created query. followed The by compilation smaller placed process in the back-end includes of the HAT a query pipeline to handle post- hash-joins that can potentially fit in smaller data structures processing operations. Finally, the metadata structuresa similar approach but in addition it builds a C library to such as caches. plan optimizer, mapping to hardwareshown in templateFigure 7 contain and information run-time about createdwrap parti- the invocations into an asynchronous streaming model. tions (i.e., partition ID and address table), groups, keys, ele- scheduling. The reconfigurable partment counts, of the and accelerator other relevant information resides requiredA to similar tra- approach goes to GPU neural accelerator as well 6. GENERALin HARDWARE post and pre-cores. APPROACH Pre-coresverse are the used data structures to synthesize for the various the passes of theand LINQ it chooses to annotate a code region in a function instead Our key idea behind the LINQits hardware template is operators. In the next sections, we discuss two important to adopt the sameanonymous divide-and-conquer functions approaches while used in themodes post-cores of operation: are partitioning used for and hashing.the Sectionof the 6.3 entire function. Catapult does not leverage compiler software to makeaggregate hardware-based and hash join. tables practical to will later describe how these building blocks implement six implement for queries that operate on large-scale dataset out of the seven essential LINQ operators from Sectionat the 2. moment. The programmer has to have domain knowl- sizes. In the context ofLINQits a small SoC leverages such as ZYNQ, domain-specific the languages (LINQ) edge of both application and hardware. But the knowledge HAT must operate with a limited amount of on-die mem- 6.1 Partitioning Mode ory storage (typicallyfor 1MB compilation. or less). Naively The implementing ARM core is responsible for initial- of this particular platform is unnecessary since a shell is Figure 9 (top) shows the hardware structures for operating a hardware-based hash table using this limited storage can izing the partitions of the Partitionthe HAT Reader in partitioning which mode. has to be hurt performance due to excessive conflicts and spills. A developed to hide the details. LINQits leverages HLS to key strategy we takeexplicitly is to develop coded hardware by that the performs user (thePartition paper Reader.does notThe explain partition reader this is a specializedgenerate configurations for pre and post cores. in-place, multi-passpart partitioning in details). prior to actuallyHigh-level carrying synthesisDMA engine tool responsible AutoESL for reading is bulkused data stored in par-Configuration-based accelerators need a configuration to generate post and pre-cores. stage before performing the actual computation. The con- figurations can be stored in registers (DySER, NPU, etc.) in 3.6. Datacenter search engine accelerator the accelerators. An interface is used or designed to send the configurations to the accelerators efficiently. With the ability Catapult is an FPGA-based reconfigurable architecture of configurations, the accelerators become more flexible to accelerate large-scale datacenter applications [45]. Instead and gain the capabilities to perform more operations. A of using native bus provided by the host CPU like other challenge is to schedule the configurations stage to overlap FPGA-based CPU systems [46], [47], Catapult designed with computations so that it will not affect overall efficiency. their own PCIe bus driver. The FPGA was designed on a Correctly designing the configuration point is essential to board along with 8 GB of DRAM, PCIe and inter-FPGA reduce the total amount of configurations. connections. A rack consists of a tuple of 24 1U server Configuration-based accelerators are flexible and effi- with a torus network connection between FPGA boards. cient. They remove instruction-related overhead by having The inter-FPGA connection speed can go up to 20 Gb/s. no ISA at all. They achieve flexibility by setting the re- This infrastructure is deployed to production server in the configurable parameters in the accelerators. A design space company. exists to balance reconfigurable parameters. For accelerators The programming interface needs to consider software attached to host processor’s pipeline (DySER) or the ones interface and board resource interface. CPU allocates buffer attached to the same bus with the host (NPU, SNNAP), in user-level memory space to communicate with FPGA. configuration has different designs. The former is more

7 sensitive towards configuration overhead if the accelerators are designed to change configuration regularly. The latter is less sensitive since usually configuration happens once and can be reused over iterations.

4. Automatically generated accelerators OCN The third category of hardware accelerators are the auto- matically generated ones. These accelerators use toolchain I-Cache D-Cache to convert software pieces into their functional-equivalent hardware counterpart.

4.1. ASIC-based accelerators C-core

The utilization wall prohibits scaling of frequencies of C-core micro-processor designs. Due to the limitation of supply CPU voltage (related to threshold voltage), power does not scale down along with the scaling-down of transistor dimensions. C-core As transistor dimensions shrink, power budget reaches a point where only a portion of transistors on a chip die can be powered up to switch no at one time. A real-world indication from Intel is that their CPUs frequency became stable since 2004 and the other is the support for turbo mode which Figure 7: Conservation core system architecture [3]. boosts one core’s frequency by turning off all other cores. An effective approach to conquer this problem is Con- servation Cores proposed by UCSD [3], [48], [49]. It trades send copy and a receive copy). The instructions were given chip area for less energy consumption and uses application priorities based on the each critical path. Then based on pri- specific accelerators to remove inefficiencies in general pur- ority, instructions are scheduled among each fast clock. The pose processors (instruction fetch, register file access, etc.). policy is a heuristic which assigns as many instructions as It contains a general purpose host processor and a group of possible inside a fast clock under certain timing constraints. c-cores each of which performs a particular function that is Then deferred instructions (missing timing constraints) are offloaded from the host. The host and c-cores communicate attempted to be scheduled again in next fast clock. All by L1 cache and scan chains. The L1 cache consistency the two-stage operations have state registers to store their between them is maintained explicitly by forcing a particular temporary values across the boundary of fast clocks. memory access order. The scan chains is used by the host to Based on the experiments, functions on c-cores can change any state inside the c-core (register values, control achieve up to 16x more energy efficient while the system signals, etc.). Patch is also supported by scan chains in case energy efficiency can increase by up to 2.1X. the applications are updated. The c-cores are generated by a toolchain from high-level 4.2. FPGA-based accelerators application source code. A profiling tool identifies the hot regions (functions) of the program and the toolchain turns AutoPilot HLS is the predecessor of Vivado HLS them into c-cores. At runtime, the host processor initiates the tool [14], [50]. It provides a high-level synthesis system c-cores with arguments and other data through scan chains for integration and verification of the designs. AutoPilot and then starts the c-core. The functions executed on c-cores outputs RTL descriptions which can be used for simulation are more energy-efficient than on the host since the over- and verification. It also generates report for power/perfor- head incurred by the pipeline stages and instruction mem- mance/area like any other IC compilers. It supports C, ory access (instruction fetch latency, branch mis-prediction C++ ,and SystemC as frontend high-level programming lan- penalty) in general-purpose processors is removed. guages. Floating-point operations are mapped to precision- The essential part of the tool chain is a translation tool. It variant IP blocks. takes arbitrary functions written in C language as input and AutoPilot uses commonly-used compiler techniques to produces c-core hardware written in Hardware Description optimize and generate code. It leverages the Static Single- Language (HDL). It uses selective depipelining technique Assignment (SSA) of LLVM infrastructure to performance to schedule instructions [48]. Each c-core has two clocks - optimizations and code generation. It uses llvm-gcc as the a slow clock and a fast clock. The slow clock drives the compiler frontend. Several LLVM passes are mentioned by execution of basic blocks and the fast clock drives each this work to be important for high-level synthesis optimiza- instruction inside a basic block. One instruction could take tions. Global value numbering-based approaches including more than a cycle to finish (load/store, floating point, etc.) constant propagation, dead code elimination and redundant in which case two copies of this instruction are scheduled (a code elimination are generally useful. Strength reduction

8 replaces expensive operators with cheap operators to reduce hardware modules. Then the enhanced software runs on the the design constraints. Range analysis shows the potential MIPS- which is hardened into the reconfigurable to reduce the precision (bit width) of operations. Loop fabric. related optimizations can expose more parallelism which

improves performance. Memory related optimizations can .... FPGA reduce memory access frequencies. The HLS process needs y[n] = 0; for (i = 0; i < 8; i++) { Self-Profiling Hardware Hardware MIPS Processor to resolve an efficient scheduling, optimizations under con- y[n] += coeff[i] * x[n-i]; 1 MIPSProcessor Processor Accelerator Accelerator C Compiler } (MIPS) straints, efficient resource sharing and memory operation .... CONG et al.:HIGH-LEVELSYNTHESISFORFPGAS:FROMPROTOTYPINGTODEPLOYMENT 477 2 optimizations. AVALON INTERCONNECT Program code second-generation of HLS tools showed interesting capabilities LegUp 5 On-Chip Memory Controller Cache to raise the level of design abstraction, most designers were Altered SW binary (calls HW accelerators) Profiling Data: reluctant to take the risk of moving away from the familiar 3 Execution Cycles Off-Chip Memory RTL design methodology to embrace a new unproven one, 4 High-level Power synthesis Suggested Hardened Cache Misses µP program despite its potential large benefits. Like any major transition program segments to Figure 2: Target system architecture. segments in the EDA industry, designers needed a compelling reason or 6 FPGA fabric target to HW processor/accelerator communication across the Avalon in- event to push them over the “tipping point,” i.e., to adopt the terface or through memory. HLS design methodology. FigureFigure 9: 1: Design Design flow flow of with LegUp LegUp. [13]. The architecture depicted in Fig. 2 represents the target Another important lesson learned is that tradeoffs must be system most natural for an initial release of the tool. The ar- chitecture of processor/accelerator systems is an important made in the design of the tool. Although a designer might wish runsThe on HLS an FPGA-based tools takes three MIPS steps: processor. allocation, We evaluated schedul- direction for future research. for a tool that takes any input program and generates the “best” ing,several and publicly-available binding [51]. Scheduling MIPS processor assigns implementations each software hardware architecture, this goal is not generally practical for instructionand selected to athe particular Tiger MIPS clock processor cycle. LegUp from schedules the University each 4. DESIGN AND IMPLEMENTATION HLS to achieve. Whereas compilers for processors tend to instructionof Cambridge as soon [11], as based all itson dependenciesits full support are of met. the MIPS Bind- instruction set, established tool flow, and well-documented 4.1 High-Level Hardware Synthesis focus on local optimizations with the sole goal of increasing ing determines which hardware resource (functional units, registers,modular Verilog.etc.) the operation of an instruction uses. LegUp High-level synthesis has traditionally been divided into performance, HLS tools must automatically balance perfor- solvesThe binding MIPS processor as a bipartite has been matching augmented problem. with In extra practice, cir- three steps [4]: allocation, scheduling and binding. Alloca- cuitry to profile its own execution. Using its profiling abil- tion determines the amount of hardware resources available mance and implementation cost using global optimizations. high-cost functional units (integer multiplier, floating-point However, it is critical that these optimizations be carefully ity, the processor is able to identify sections of program code for use, and manages other hardware constraints (e.g., speed, units)that would are shared benefit more from frequently hardware than implementation. registers and low-cost Specif- area, and power). Scheduling assigns each operation in the implemented using scalable and predictable algorithms, keep- functionalically, the units profiling (integer results adder, drive etc.) the are selection rarely shared. of program program being synthesized to a particular clock cycle (state) ing tool runtimes acceptable for large programs and the results code segments to be re-targeted to custom hardware from and generates a finite state machine. Binding saves area understandable by designers. Moreover, in the inevitable case 4.4.the C Summary source. Profiling a program’s execution in the proces- by sharing functional units between operations, and sharing that the automatic optimizations are insufficient, there must sor itself provides the highest possible accuracy. Presently, registers/memories between variables. be a clear path for a designer to identify further optimization we profile program run-time at the function level. LegUp leverages the low-level virtual machine (LLVM) Fig. 1. AutoESLFigure 8: and AutoPilot Xilinx C-to-FPGA HLS tool design design flow. flow [50]. HavingThe hardware chosen program accelerators segments described to target above to are custom all gener- hard- compiler framework. At the core of LLVM is an inter- opportunities and execute them by rewriting the original source atedware, automatically at step ➂ LegUp from is software. invoked to C-Core compile targets these for segments ASIC mediate representation (IR), which is essentially machine- code. andto synthesizeable has been built Verilog to integrate RTL. into LegUp’s a host hardware processor. synthe- Au- independent assembly language. C code is translated into Hence, it is important to focus on several design goals for and4.3. synthesis SoC based optimizations accelerators to generate optimized synthesiz- toPilotsis and targets software for compilation FPGA. LegUp are part targets of the for same FPGA-based compiler LLVM’s IR then analyzed and modified by a series of com- a HLS tool. able RTL. SoC.framework. Presently, LegUp HLS operates at the function piler optimization passes. LLVM IR instructions are sim- AutoPilot outputs RTL in Verilog, VHDL or cycle-accurate level: entire functions are synthesized to hardware from the ple enough to directly correspond to hardware operations a) Capture designs at a bit-accurate, algorithmic level. The LegUp is a high-level synthesis system to build FPGA- Csource.TheRTLproducedbyLegUpissynthesizedto (e.g., an arithmetic computation). Our HLS tool operates source code should be readable by algorithm specialists. SystemCbased CPU-accelerators for simulation and architectures verification. [13]. To It enable is composed automatic of 5.an Our FPGA work implementation using standard commercial tools directly with the LLVM IR, scheduling the instructions into b) Effectively generate efficient parallel architectures with co-simulation,a MIPS soft processor AutoPilot and creates an automatic test bench generated (TB) wrappers hardware and at step ➃.Instep➄, the C source is modified such that specific clock cycles. LegUp HLS algorithms have been im- We are developing a framework to build accelerators minimal modification of the source code, for paralleliz- transactorsaccelerator. in The SystemC soft processor so that the and designers the accelerator can leverage reside theon the functions implemented as hardware accelerators are re- plemented as LLVM passes that fit neatly into the existing andplaced integrateby wrapper them into functions existing that systemscall the automatically. accelerators Our(in- framework. Implementing the HLS steps as distinct passes able algorithms. originalthe same test reconfigurable framework in fabric C/C++/SystemC and communicate to verify through the cor- a workstead offalls doing into computations the category in software). of automatically This new generated modified also allows easy experimentation with different HLS algo- c) Allow an optimization-oriented design process, where rectnessbus interface. of the RTL The soft output. processor These SystemChandles irregular wrappers code connect like any other CPU-based architectures while the co-processor accelerators;source is compiled It can tobe a potentially MIPS binary used executable. in micro-controllers. Finally, in rithms; for example, one could modify LegUp to “plug in” a a designer can improve the QoR by successive code high-level interfacing objects in the behavioral TB with pin- step ➅ the hybrid processor/accelerator system executes on new scheduling algorithm. levelaccelerates signals in the RTL. more AutoPilot regular code. also generates appropriate sim- the FPGA. The initial release of LegUp uses as-soon-as-possible (ASAP) modification, refactoring and refinement on synthesis The compilation process takes several stages. The source 5.1. System overview options/directives. ulation scripts for use with third-party RTL simulators. Thus, Our long-term vision is to fully automate the flow in Fig. 1, scheduling [5], which assigns an operation to the first state code (written in modular C++) is compiled using standard thereby creating a self-accelerating adaptive processor in which after its dependencies are available. In some cases, we can d) Generate implementations that are competitive with designers can easily use their existing simulation environment toC++ verify MIPS the generated compiler into RTL. binary code. This binary code is profiling,A toolchain hardware is developed synthesis to and turn acceleration software written happen in trans- high- schedule an instruction into the same state as one of its de- synthesizable RTL designs after automatic and manual executed on a MIPS soft processor which is enhanced with levelparently programming without user languages awareness. (CIn language) the first into release hardware of our pendencies. This is called operation chaining.Chainingcan In addition to generating RTL, AutoPilot also creates syn- optimization. a profiling module. The profiling module reports execution designtool, however, represented the userby Hardware must manually Description examine Language the profiling (Ver- reduce hardware latency (# of cycles for execution) without thesis reports that estimate FPGA resource utilization, as well results and place the names of the functions to be acceler- impacting the clock period. We believe that the tipping point for transitioning to HLS cycles, power, cache miss rates, etc. as results. Based on ilog). The toolchain goes through several stages including as the timing, latency, and throughput of the synthesized ated in a file that is read by LegUp. Binding consists of two tasks: assigning operators from methodology is happening now, given the reasons discussed in this profiling results, certain code regions are targeted. A instruction scheduling, basic block generation, control block design. The reports include a breakdown of performance Fig. 2 elaborates on the target system architecture. The the program being synthesized to specific hardware units, Section I and the conclusions by others [14], [85]. Moreover, High-level synthesis tool is built to convert target code generationprocessor connects ,and datapath to one generation or more custom in order hardware to generate accel- and assigning program variables to registers (register allo- and area metrics by individual modules, functions, and loops we are pleased to see that the latest generation of HLS region into Register-Transfer Level (synthesizable Verilog) anerators intermediate through representation a standard on-chip (IR) forinterface. digital As circuits. our initial The cation). When multiple operators are assigned to the same in the source code. This allows users to quickly identify tools has made significant progress in providing wide lan- representation of the corresponding hardware accelerator IRhardware is then platform processed is the by Altera a converter DE2 Development to industry and standard Edu- hardware unit, or when multiple variables are bound to specificmodules. areas The for original QoR improvement software source and code then adjustis enhanced synthesis by hardwarecation board design (containing specifications a 90 nmCycloneIIFPGA),weuse (Verilog). This product can the same register, multiplexers are required to facilitate the guage coverage and robust compilation technology, platform- directivesreplacing or the refine targeted thesource code region design with accordingly. code of invoking the bethe used Altera as Avalon input to interface any CAD for processor/acceleratortool to generate hardware com- sharing. We make two FPGA-specific observations in our based modeling, and advanced core HLS algorithms. We shall Finally, the generated HDL files and design constraints feed munication [2]. A shared memory architecture is used, with approach to binding. First, multiplexers are relatively ex- discuss these advancements in more detail in the next few the processor and accelerators sharing an on-FPGA data pensive to implement in FPGAs using LUTs. A 32-bit mul- into the Xilinx RTL tools for implementation. The Xilinx sections. cache and off-chip main memory. The on-chip cache memory tiplexer implemented in 4-LUTs is the same size as a 32-bit integrated synthesis environment (ISE) tool chain (such as 9 is implemented using block RAMs within the FPGA fabric adder. Consequently, there is little advantage to sharing all CoreGen, XST, and PAR) and Embedded Development Kit (M4K blocks on Cyclone II). Access to memory is handled but the largest functional units, namely, multipliers and di- (EDK) are used to transform that RTL implementation into a by a memory controller. The architecture in Fig. 2 allows viders. Likewise, the FPGA fabric is register rich and shar- III. State-of-Art of HLS Flow for FPGAs complete FPGA implementation in the form of a bitstream for AutoPilot is one of the most recent HLS tools, and is repre- programming the target FPGA platform. sentative of the capabilities of the state-of-art commercial HLS tools available today. Fig. 1 shows the AutoESL AutoPilot development flow targeting Xilinx FPGAs. AutoPilot accepts IV. Support of High-Level Programming Models synthesizable ANSI C, C++, and OSCI SystemC (based on the In this section, we show that it is important for HLS to synthesizable subset of the IEEE-1666 standard [115]) as input provide wide language coverage and leverage state-of-the-art and performs advanced platform-based code transformations compiler technologies to achieve high-quality synthesis results. int foo(int *p,int a,int b,int c) { int t0=p[a]*p[b]; t0=t0*p[c]; return t0; } (a)

state 0 p a b c

define i32 @foo(i32* p, i32 a, i32 b, i32 c) { state 1 t3 t14 t25 call_conv: br label entry state 2 t11(send)

entry: t3 = getelementptr p, a state 3 t6(send) t11(recv) t6 = load t3, t14 = getelementptr p, b state 4 t6(recv) t17(send) t11 = load t14 t12 = mul t11, t6 t12(send) t17(recv) t25 = getelementptr p, c state 5 t17 = load t25 t18 = mul t12, t17 state 6 t12(recv) ret i32 t18 t18(send) } (b) state 7

state 8 t18(recv)

Basic Block Module (call_conv) (c) Control Block Module Basic BlockModule (entry)

(d) Figure 10: Code generation process. (a) C source code; (b) Software-IR; (c) automatic instruction schedule for basic block ’entry’; (d) architecture diagram of generated hardware accelerator. design ready to ship for fabrication. The generated hardware built for all instructions. The instruction scheduling is two- accelerator is compatible with host processor in terms of pass process. In the first-pass, each instruction is assigned a calling conventions. It can be invoked by software running priority based on their latest starting time in the critical path. on host processor with similar overhead comparing invoking In the second-pass, it uses a work-list algorithm to iterate a software function. The calling convention interface is through each instruction. For each instruction, it schedules designed to be flexible so that the efforts of re-targeting the instruction to the timing slot based on its priority and the hardware accelerator to other host processor is mini- updates the work-list. After instruction scheduling, the soft- mized. The generated hardware accelerator is able to lever- ware basic blocks are mapped into hardware basic blocks. age the Split-Phase Multiplexed Operator (SPMO which The translation heuristics is described in Figure 10. is explained in later section) interface to access peripheral modules (memory, etc.) like the host processor. This gives 5.3. Control logic and execution model the hardware accelerator enough opportunity to replace more functionality usually performed by the host general-purpose The control logic drives the execution of the entire hard- processors. ware accelerator module. The state machine inside control flow keeps a state for each instruction as well each basic 5.2. Granularity and instruction scheduling block. It is generated after instruction scheduling is done. It has state signals indicating the current basic block and the The hardware accelerator is generated in the unit of particular instruction inside that basic block. The control function. A hardware module is generated for the targeted flow is either sequential inside a basic block or jumps to function. Its interface supports reading and writing states a new basic block based on the jump instruction in the in hardware accelerator from host processor, reading and previous block. It supports multiple instruction issuing and writing peripherals from the hardware accelerator. execution at the same cycle as long as they do not have The toolchain schedules instructions in the unit of basic dependencies or share the same functional unit. The control blocks. Inside each basic block, a data dependency graph is logic inside a basic block is designed so that the execution

10 model can be as efficient as a data flow fashion with the only However, it is automatically generated which is fundamen- limiting factor being the multiplexed external modules. tally different from manually designed configuration-based accelerators. Examples of the configuration parameters are 5.4. Interface between hardware accelerator the calling conventions in traditional ISA from the host pro- and host processor cessors. This reduces the manual design efforts of hardware rooted from the reason that hardware code can rarely be The accelerator has a standard interface to interact with reused in the scale as software. This enables a quick design the host processor and memory system. The interface is cus- exploration of integrating hardware accelerators in a new tomizable in order to achieve load overhead access to periph- system for a new application. erals. We designed an interface to handle an arbitrary calling convention at run-time. Taking MIPS ISA for example, at References invoking time the host processor sets the stack pointer, global pointer and registers in the accelerator. The host [1] Michael B Taylor. Is dark silicon useful?: harnessing the four processor is able to change the state (parameters, temporary horsemen of the coming dark silicon apocalypse. In Proceedings of the 49th Annual Design Automation Conference, 1131–1136. values holding between cycles) of hardware accelerator via ACM, 2012. a tree structure pipelined multiplexer. Each register in the [2] Michael Bedford Taylor. A landscape of the new dark silicon design hardware accelerator has a unique address in its own register regime. Micro, IEEE, 33(5):8–19, 2013. address space. The interface provides support for the host [3] Ganesh Venkatesh, Jack Sampson, Nathan Goulding, Saturnino Gar- processor to read or modify any register in that space. cia, Vladyslav Bryksin, Jose Lugo-Martinez, Steven Swanson, and The more complicated calling conventions (passing more Michael Bedford Taylor. Conservation cores: reducing the energy than four parameters, passing union/struct, etc.) are handled of mature computations. In ACM SIGARCH Computer Architecture News, volume 38, pages 205–218. ACM, 2010. by shared data memory cache in the host processor. Then the host processor keeps probing the attention signal from [4] Hadi Esmaeilzadeh, Emily Blem, Renee St Amant, Karthikeyan Sankaralingam, and Doug Burger. Dark silicon and the end of accelerators for the execution finish. multicore scaling. In Computer Architecture (ISCA), 2011 38th Annual International Symposium on, pages 365–376. IEEE, 2011. 5.5. Split-phase multiplexed operator (SPMO) [5] Robert H Dennard, VL Rideout, E Bassous, and AR LeBlanc. Design of ion-implanted mosfet’s with very small physical dimensions. Solid- Split-phase Multiplexed Operator (SPMO) is a scalable State Circuits, IEEE Journal of, 9(5):256–268, 1974. and efficient mechanism for hardware accelerators to access [6] Chris Lomont. Introduction to intel advanced vector extensions. Intel the shared resources. It achieves silicon sharing across accel- White Paper, 2011. erator, host processor and external peripherals. The toolchain [7] Shay Gueron. Advanced encryption standard (aes) instructions set. generates SPMO automatically for the hardware accelera- Intel, http://softwarecommunity. intel. com/articles/eng/3788. htm, ac- cessed, 25, 2008. tors. It uses a valid/enable protocol to access peripherals. For each instruction, it generates a send and receive copy [8] http://www.arm.com/products/processors/. of it for the purpose of scheduling. The only restrictions for [9] https://en.wikipedia.org/wiki/Apple motion coprocessors. achieving maximum parallelism are the number of shared [10] http://www.bluespec.com/high-level-synthesis-tools.html. resources of each type (ports as well). [11] https://www.mentor.com/hls-lp/. [12] http://www.cadence.com/products/sd/silicon compiler/pages/default. 6. Conclusion aspx. [13] Andrew Canis, Jongsok Choi, Mark Aldham, Victor Zhang, Ahmed Leakage-dominated technology scaling has to resolve to Kammoona, Jason H Anderson, Stephen Brown, and Tomasz Cza- jkowski. Legup: high-level synthesis for fpga-based processor/accel- hardware accelerators for taking system efficiency to the erator systems. In Proceedings of the 19th ACM/SIGDA international next level. In order to leverage dark silicon, designers are symposium on Field programmable gate arrays, pages 33–36. ACM, adding more and more accelerators into their system. The 2011. problem of integrating and managing increasing amount of [14] http://www.xilinx.com/products/design-tools/vivado/integration/ hardware accelerators demands more design considerations esl-design.html. from compiler aspects. Hardware accelerators have contin- [15] Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang, Chengyong Wu, ued to be redesigned from compiler aspects. The supports Yunji Chen, and Olivier Temam. Diannao: A small-footprint high- throughput accelerator for ubiquitous machine-learning. In ACM hardware accelerators provide to and the supports they re- Sigplan Notices, volume 49, pages 269–284. ACM, 2014. quire from compilers interact to make system architecture [16] Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, more efficient and cost-effective. Proposing new design of Ling Li, Tianshi Chen, Zhiwei Xu, Ninghui Sun, et al. Dadiannao: compiler aspects or choosing between existing designs plays A machine-learning . In Proceedings of the 47th a critical role in accelerators’ overall quality. Annual IEEE/ACM International Symposium on , Automatically generated hardware accelerators provides pages 609–622. IEEE Computer Society, 2014. another way of designing hardware accelerators. Instead of [17] Zidong Du, Robert Fasthuber, Tianshi Chen, Paolo Ienne, Ling Li, Tao Luo, Xiaobing Feng, Yunji Chen, and Olivier Temam. Shidian- manually designing hardware, it synthesizes the programs nao: shifting vision processing closer to the sensor. In Proceedings of to be accelerated into accelerator directly. The generated ac- the 42nd Annual International Symposium on Computer Architecture, celerators can fall into the category of configuration-based. pages 92–104. ACM, 2015.

11 [18] Wajahat Qadeer, Rehan Hameed, Ofer Shacham, Preethi Venkatesan, [35] Adrian Sampson, Werner Dietl, Emily Fortuna, Danushen Gnanapra- Christos Kozyrakis, and Mark A Horowitz. Convolution engine: gasam, Luis Ceze, and Dan Grossman. Enerj: Approximate data balancing efficiency & flexibility in specialized computing. In ACM types for safe and general low-power computation. In ACM SIGPLAN SIGARCH Computer Architecture News, volume 41, pages 24–35. Notices, volume 46, pages 164–174. ACM, 2011. ACM, 2013. [36] Hadi Esmaeilzadeh, Adrian Sampson, Luis Ceze, and Doug Burger. [19] Rehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Neural acceleration for general-purpose approximate programs. In Alex Solomatnikov, Benjamin C Lee, Stephen Richardson, Christos Proceedings of the 2012 45th Annual IEEE/ACM International Sym- Kozyrakis, and Mark Horowitz. Understanding sources of inefficiency posium on Microarchitecture, pages 449–460. IEEE Computer Soci- in general-purpose chips. In ACM SIGARCH Computer Architecture ety, 2012. News, volume 38, pages 37–47. ACM, 2010. [37] Lawrence McAfee and Kunle Olukotun. Emeuro: A framework for [20] Yunsup Lee, Andrew Waterman, Rimas Avizienis, Henry Cook, Chen generating multi-purpose accelerators via . In Pro- Sun, Vladimir Stojanovic, and Krste Asanovic. A 45nm 1.3 ghz 16.7 ceedings of the 13th Annual IEEE/ACM International Symposium on double-precision gflops/w risc-v processor with vector accelerators. Code Generation and Optimization, pages 125–135. IEEE Computer In European Solid State Circuits Conference (ESSCIRC), ESSCIRC Society, 2015. 2014-40th, pages 199–202. IEEE, 2014. [38] Thierry Moreau, Mark Wyse, Jacob Nelson, Adrian Sampson, Hadi Esmaeilzadeh, Luis Ceze, and Mark Oskin. Snnap: Approximate [21] Albert Ou, Quan Nguyen, Yunsup Lee, and Krste Asanovic. A case computing on programmable socs via neural acceleration. In High for mvps: Mixed-precision vector processors. Performance Computer Architecture (HPCA), 2015 IEEE 21st Inter- [22] Huy Vo, Yunsup Lee, Andrew Waterman, and Krste Asanovic. A national Symposium on, pages 603–614. IEEE, 2015. case for os-friendly hardware accelerators. Proc. of WIVOSCA, 2013. [39] Scott Sirowy and Alessandro Forin. Where’s the beef? why fpgas are [23] Shekhar Borkar and Andrew A Chien. The future of microprocessors. so fast. Microsoft Research, Microsoft Corp., Redmond, WA, 98052, Communications of the ACM, 54(5):67–77, 2011. 2008. [40] Mehrzad Samadi, Janghaeng Lee, D Anoushe Jamshidi, Amir Hor- [24] Andrew A Chien, Tung Thanh-Hoang, Dilip Vasudevan, Yuanwei mati, and Scott Mahlke. Sage: Self-tuning approximation for graphics Fang, and Amirali Shambayati. 10x10: A case study in highly- engines. In Proceedings of the 46th Annual IEEE/ACM International programmable and energy-efficient heterogeneous federated architec- Symposium on Microarchitecture, pages 13–24. ACM, 2013. ture. ACM SIGARCH Computer Architecture News, 43(3):2–9, 2015. [41] Mehrzad Samadi, Davoud Anoushe Jamshidi, Janghaeng Lee, and [25] James Balfour, William J Dally, David Black-Schaffer, Vishal Parikh, Scott Mahlke. Paraprox: Pattern-based approximation for data par- and JongSoo Park. An energy-efficient processor architecture for allel applications. In ACM SIGARCH Computer Architecture News, embedded systems. Computer Architecture Letters, 7(1):29–32, 2008. volume 42, pages 35–50. ACM, 2014. [26] Alex Solomatnikov, Amin Firoozshahian, Wajahat Qadeer, Ofer [42] Amir Yazdanbakhsh, Jongse Park, Hardik Sharma, Pejman Lotfi- Shacham, Kyle Kelley, Zain Asgar, Megan Wachs, Rehan Hameed, Kamran, and Hadi Esmaeilzadeh. Neural acceleration for gpu and Mark Horowitz. Chip multi-processor generator. In Proceedings throughput processors. In Proceedings of the 48th International of the 44th Annual Design Automation Conference, pages 262–263. Symposium on Microarchitecture, pages 482–493. ACM, 2015. ACM, 2007. [43] Eric S Chung, John D Davis, and Jaewon Lee. Linqits: Big data [27] Lisa Wu, Andrea Lottarini, Timothy K Paine, Martha A Kim, and on little clients. In ACM SIGARCH Computer Architecture News, Kenneth A Ross. Q100: the architecture and design of a database volume 41, pages 261–272. ACM, 2013. processing unit. In ACM SIGPLAN Notices, volume 49, pages 255– [44] Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, U´ Erlingsson, 268. ACM, 2014. Pradeep Kumar Gunda, Jon Currey, Frank McSherry, and Kannan [28] Lisa Wu, Raymond J Barker, Martha A Kim, and Kenneth A Ross. Achan. Some sample programs written in dryadlinq. Technical report, Navigating big data with high-throughput, energy-efficient data par- Tech. Rep. MSR-TR-2008-74, Microsoft Research, 2008. titioning. ACM SIGARCH Computer Architecture News, 41(3):249– [45] Andrew Putnam, Adrian M Caulfield, Eric S Chung, Derek Chiou, 260, 2013. Kypros Constantinides, John Demme, Hadi Esmaeilzadeh, Jeremy Fowers, Gopi Prashanth Gopal, Jordan Gray, et al. A reconfigurable [29] Venkatraman Govindaraju, Chen-Han Ho, Tony Nowatzki, Jatin fabric for accelerating large-scale datacenter services. In Computer Chhugani, Nadathur Satish, Karthikeyan Sankaralingam, and Architecture (ISCA), 2014 ACM/IEEE 41st International Symposium Changkyu Kim. Dyser: Unifying functionality and parallelism spe- on, pages 13–24. IEEE, 2014. cialization for energy-efficient computing. IEEE Micro, (5):38–51, 2012. [46] David Slogsnat, Alexander Giese, Mondrian Nussle,¨ and Ulrich Bruning.¨ An open-source hypertransport core. ACM Transactions [30] Zhi Alex Ye, Andreas Moshovos, Scott Hauck, and Prithviraj Baner- on Reconfigurable Technology and Systems (TRETS), 1(3):14, 2008. jee. CHIMAERA: a high-performance architecture with a tightly- [47] Liu Ling, Neal Oliver, Chitlur Bhushan, Wang Qigang, Alvin Chen, coupled reconfigurable functional unit, volume 28. ACM, 2000. Shen Wenbo, Yu Zhihong, Arthur Sheiman, Ian McCallum, Joseph [31] Timothy J Callahan, John R Hauser, and John Wawrzynek. The garp Grecco, et al. High-performance, energy-efficient platforms using architecture and c compiler. Computer, 33(4):62–69, 2000. in-socket fpga accelerators. In Proceedings of the ACM/SIGDA international symposium on Field programmable gate arrays, pages [32] Nathan Clark, Jason Blome, Michael Chu, Scott Mahlke, Stuart Biles, 261–264. ACM, 2009. and Krisztian Flautner. An architecture framework for transparent instruction set customization in embedded processors. In ACM [48] Jack Sampson, Ganesh Venkatesh, Nathan Goulding-Hotta, Saturnino SIGARCH Computer Architecture News, volume 33, pages 272–283. Garcia, Steven Swanson, and Michael Bedford Taylor. Efficient com- IEEE Computer Society, 2005. plex operators for irregular codes. In High Performance Computer Architecture (HPCA), 2011 IEEE 17th International Symposium on, [33] Seth Copen Goldstein, Herman Schmit, Mihai Budiu, Srihari pages 491–502. IEEE, 2011. Cadambi, Matt Moe, and R Reed Taylor. Piperench: A reconfigurable architecture and compiler. Computer, 33(4):70–77, 2000. [49] Ganesh Venkatesh, Jack Sampson, Nathan Goulding-Hotta, Sravan- thi Kota Venkata, Michael Bedford Taylor, and Steven Swanson. Qs- [34] Mahim Mishra, Timothy J Callahan, Tiberiu Chelcea, Girish cores: Trading dark silicon for scalable energy efficiency with quasi- Venkataramani, Seth C Goldstein, and Mihai Budiu. Tartan: evaluat- specific cores. In Proceedings of the 44th Annual IEEE/ACM In- ing spatial computation for whole program execution. ACM SIGOPS ternational Symposium on Microarchitecture, pages 163–174. ACM, Operating Systems Review, 40(5):163–174, 2006. 2011.

12 [50] Jason Cong, Bin Liu, Stephen Neuendorffer, Juanjo Noguera, Kees Vissers, and Zhiru Zhang. High-level synthesis for fpgas: From prototyping to deployment. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, 30(4):473–491, 2011. [51] Philippe Coussy, Daniel D Gajski, Michael Meredith, and Andres Takach. An introduction to high-level synthesis. IEEE Design & Test of Computers, (4):8–17, 2009.

13