Compiler Aspects of Hardware Accelerators

Compiler Aspects of Hardware Accelerators Xiaochu Liu Department of Computer Science and Engineering University of California San Diego La Jolla, California 92092 Email: [email protected] Abstract Param. Description Relation Classical Leakage —Hardware accelerators are first-class building blocks scaling scaling in modern computing devices. Ranging from data centers B power budge 1 1 to mobile devices, hardware accelerators are designed and A chip size 1 1 developed in order to run certain applications more efficiently. Vt threshold voltage 1/S 1 V V The interaction between hardware accelerator design and dd supply voltage t 1/S 1 tox oxide thickness 1/S 1/S compiler support has continued to become critical in order W, L transistor dimen- 1/S 1/S to make systems more efficient. In this report, I will describe sion key efforts in hardware accelerators and their philosophy for Isat saturation current WVdd/tox 1/S 1 2 interacting with compilers. Then I will describe my current p device power Isat Vdd 1/S 1 Cgate capacitance WL/tox 1/S 1/S work in building tools to generate and integrate hardware F device frequency Isat S S accelerators automatically. CgateVdd D device per chip A/(WL) S2 S2 P power D× p 1 S2 1. INTRODUCTION U utilization B/P 1 1/S2 Table 1: Technology scaling table [3]. Chip technology scaling is now a limiting factor of the hardware system efficiencies [1], [2], [3], [4]. Previously in classical scaling era, as the transistor size shrinks, the power to drive more transistors on the same chip area does [9]. Adding hardware accelerators becomes the consensus in not change [5]. Hence the frequency is increased for free. industry and academia to increase system efficiency facing However in the leakage scaling era (where we are right the problem of dark silicon. now), the threshold voltage (Vt) does not shrink which stops These hardware accelerators can be built as extensions the supply voltage (Vdd) from shrinking any more (refer to for CPUs/GPUs through either ASIC (Application Specific Table 1). Given a constraint of the supply power of the chip, Integrated Circuit) or FPGA (Field-programmable Gate Ar- only part of the chips can be actively switched on and off ray). CPUs and GPUs themselves are general-purpose and at the same time. The other portions that have to be off trading efficiency for flexibility. For CPUs, the parallelism due to the limitation of supply power are idle chip area and memory bandwidth are limited. GPUs have a massive and are referred to as dark silicon. Trading the dark silicon amount of parallelism, however they are power hungry and for hardware accelerators generally is a profitable choice in show variant performance for different domains. Accelera- terms of reducing power consumption. tors built in ASIC or FPGA can remedy the shortcomings Hardware accelerators are common building blocks of CPUs and GPUs. Though ASIC requires higher design nowadays. In addition to specialized Functional Unit like and manufacture costs, it can be specialized for high per- Floating-point Unit, Intel processors added AVX for vector formance and low power for a particular application. Since processing and AES-NI for cryptography encryption oper- it is rare to do any change after computations are hardened ations [6], [7]. ARM has instructions set extension to sup- into ASIC, flexibility of designs becomes very important port AES encrypt/decrypt and advanced SIMD (NEON) for for domain-specific ASIC accelerators. FPGA is less energy mapping into hardware accelerators [8]. ARM also supports efficient than ASIC but is re-programmable which lowers the co-processor (hardware accelerators) integration by physi- cost and provides an approach to prototype the hardware cal register mapping. GPUs are essentially large external before the actual manufacturing. accelerators to perform intensive parallel tasks. The latest High-Level Synthesis (HLS) converts programs written Apple A9 chip has M9 motion coprocessors as an acceler- in high-level programming languages into hardware written ator to gather data from sensors (accelerometer, gyroscope, in Hardware Descriptive Languages (HDL). It automates compass, and barometer) or even receive Siri commands the process of hardware designing by shifting the burden of hardware design to its software counterpart. Several It supports rotations which makes the reuse of neurons to HLS tools are available in production or research proto- avoid reloading the same neurons again and again. types - BlueSpec, Catapult C, C-To-Silicon, LegUp, Vivado [10], [11], [12], [13], [14]. The generated hardware can be *' mapped to FPGA or ASIC. Leveraging HLS tools to gen- Control#Processor#(CP)# Inst.# erate hardware accelerators andweight integrate' neuronthem into systems DMA# Instruc:ons# output output can significantlylayer reduce the design efforts. This report explores the design decision of key hardware Tn# NFU)1% NFU)2% NFU)3% accelerator-based systems and focuses on their compiler aspects. I synthesized the compiler aspects of these efforts. NBin% DMA# hidden +' x Inst.# layer I divide these accelerators into three categories+' based on Memory#Interface# their compiler aspects - ISA-based,synapses configuration-based ,and Inst.# DMA# Tn# *' bi automatically generated. A brief introductiona ofi my current work of generatinginput and integrating hardware accelerators neuron table' Tn#x#Tn# NBout% automatically is also presented in the end. x The restsynapse of the report is organized as following. In Section 2, 3, 4, designs of key hardware accelerator-based architecturesFigure 9. Full are hardware described implementation and the compiler of neuralnetworks. aspects are SB% summarized. In Section 5, I introduce my recent works on building tools to generate hardware accelerators for irregular code. In Section 6, I summarize my content and give some Critical Path (ns) Figure 1: DiannaoFigure 11. systemAccelerator. architecture [15]. future directionsArea on (mm^2) improving compiler aspects of hardware accelerators. Energy (nJ) 2.2.such asConvolution-operation the Intel ETANN [18] at the accelerator beginning of the 1990s, 2. ISA-based accelerators not because neural networks were already large at the time, butConvolution because hardware Engine resources (CE) is (number a customized of transistors) hardware were ac- Firstly, I describe key efforts in ISA-based hardware celeratornaturally for much convolution-operation more scarce. The principle composed was of tomap time- and accelerators. They serve various domains and have certain reduceshare the [18]. physical Using SIMD neurons machines and use to the compute on-chip convolutions RAM to ways of ISA designs inside. The ISA designs target for high needsstore synapses many registers and intermediate (quadratic neurons to the size values of the of hidden block). flexibility while maintain specializations of the hardware GPGPUslayers. However, increase at performance that time, many by 10 neural timescomparing networks were with 012345 SIMD machine but cost 100 times more energy [19]. CE is accelerator. I8x8 summarize 16x16the 32x32 design 32x4 decisions 64x8 128x16 on compiler small enough that all synapses and intermediate neurons aspects at the end of this section. designedvalues could to perform fit in the this neural computation network RAM. pattern Since efficiently this is no by Figure 10. Energy, critical path and area of full-hardware layers. reducinglonger the the case, unnecessary one of the operations. main challenges for large-scale Convolution pattern is widely used in computational neural network accelerator design has become the interplay 2.1. Large-scale neural network accelerator photography, image and video processing. A standard dis- neuron to a neuron of the next layer, and from one synap- cretebetween 2-dimensional the computational convolution and the has memory a general hierarchy. formula as tic latchDiannao to the is associated a hardware neuron. accelerator For instance, targeting an executionlarge-scale such: timeCNNs of (Convolutional 15ns and an energy Neural reduction Networks) of 974x and DNNs over a (Deep core 5. Accelerator for Large Neural Networks hasNeural been Networks) reported for [15], a 90-10-10 [16], [17]. (90 It inputs, focuses 10 on hidden, increasing 10 In this section, we draw1 from1 the analysis of Sections 3 and the performance and energy efficiency by reducing the run- X X outputs) perceptron [38]. 4(Img to designf)[ ann; accelerator m] = for large-scaleImg[k] neuralf[n networks.k; m l] time memory footprint. Diannao hardware accelerator is × × − − The main componentsl=−∞ k of=−∞ the accelerator are the fol- synthesized into hardware designs in order to benchmark (1) 4.2 Maximum Number of Hardware Neurons ? lowing: an input buffer for input neurons (NBin), an out- the hardware for the power numbers. Function f is a filter and Img is a mapping from location However,Diannao the area,is designed energy andfor large-scaledelay grow quadratically CNNs and DNNs. with toput pixel buffer values. for output It contains neuronsmap (NBout),(the product and a of third filter buffer func- theFor number small-scale of neurons. neural We networks, have synthesized memory is the only ASIC used ver- to tionfor synaptic and pixels) weights and reduce (SB),(summation connected to of a all computational the products) sionsstore ofinput neural and network output result. layers All of thevarious neurons dimensions, and synapses and operation.block (performing Abstracting both the synapses two operations and neurons to acomputations) more general weare report hardened their in area, the accelerators critical path which and energy minimizes in Figure the execu- 10. format:which we call the Neural Functional Unit (NFU), and the Wetion have and used communication Synopsys ICC overhead for the place between and neurons. route, and How- the control logic (CP), see Figure 11. We first describe the NFU TSMCever, this 65nm low-overhead GP library, design standard does VT. not A scale hardware since neuron a large below, and then we focus on and explain(Img thef)[ rationalen; m] = for the performsneuron network the following will take operations: too much multiplication hardware die of area.

Load more