Exploring the Programmability for Processors: from Architecture to Tensorization Chixiao Chen, Huwan Peng, Xindi Liu, Hongwei Ding and C.-J. Richard Shi Department of Electrical Engineering, University of Washington, Seattle, WA, 98195 {cxchen2,hwpeng,xindil,cjshi}@uw.edu

ABSTRACT Custom tensorization explores problem-specific computing data This paper presents an instruction and Fabric Programmable Neuron flows for maximizing energy efficiency and/or throughput. For Array (iFPNA) architecture, its 28nm CMOS chip prototype, and a example, Eyeriss proposed a row-stationary (RS) data flow to reduce compiler for the acceleration of a variety of deep learning neural memory access by exploiting feature map reuse [1]. However, row networks (DNNs) including convolutional neural networks (CNNs), stationary is only effective for convolutional layers with small recurrent neural networks (RNNs), and fully connected (FC) net• strides, and shows poor performance on the Alexnet CNN CONVl works on chip. The iFPNA architecture combines instruction-level layer and RNNs. Systolic arrays, on the other hand, feature both programmability as in an Instruction Set Architecture (ISA) with efficiency and high throughput for matrix-matrix computation [4]. logic-level reconfigurability as in a Field-Prograroroable Gate Anay But they take less advantage of convolutional reuse, thus require (FPGA) in a sliced structure for scalability. Four data flow models, more data transfer bandwidth. namely weight stationary, input stationary, row stationary and Fixed data flow schemes in deep learning processors limit their tunnel stationary, are described as the abstraction of various DNN coverage of advanced algorithms. A flexible data flow engine is data and computational dependence. The iFPNA compiler parti• desired. Normally, instruction set architectures (ISA) are utilized to tions a large-size DNN to smaller networks, each being mapped to, improve such flexibility. The first deep learning specific ISA was optimized and code generated for, the underlying iFPNA proposed in [5] with both vector and matrix instructions, and their using one or a mixture of the four data-flow models. Experimental execution units. A single instruction multiple data (SIMD) proces• results have shown that state-of-art large-size CNNs, RNNs, and sor in [2] enhances the instruction-level parallelism by combining FC networks can be mapped to the iFPNA processor achieving the multiple sub-instructions into one. However, all these ISAs do not near ASIC performance. change the tensor-level data flow by compilation. To alleviate these issues, this paper presents an architecture that CCS CONCEPTS realizes flexible tensorization by a data-flow enhanced ISA. A pro• totype processor chip has been designed and fabricated in 28nm • Computer systems organization --+ Neural networks; Data CMOS. A compiler has been developed to map a given deep learning flow architectures; • Software and its engineering --+ Compliers; network to the underlying iFPNA processor. Adaptive tensorization KEYWORDS is proposed to achieve the best performance according to the hard• ware constraints. The prototyped processor and its compiler have Deep Leaming Processor, Neural Network, Data flow, Domain Spe• been run successfully on a set of AI applications, and demonstrated cific Instruction Set, Tensorization the near ASIC performance on large-size deep learning networks including AlexNet and a 1024-cell long short term memory (LSTM) 1 INTRODUCTION network, a representative modem RNN. Recent success of artificial intelligence (AI), in particolar, deep learn• This paper is organized as follows: Section II describes the back• ing algorithms, in various applications especially in understanding ground and related work. Section m introduces the iFPNA architec• images, videos, natural languages and human intentions, has drived ture. The four DNN data flow models and the iFPNA compiler are intensified interests in developing custom-silicon deep learning pro• presented in Sections IV and V, respectively. Experimental results cessors (DLPs). Existing work includes dedicated accelerators for are described in Section VI. Section VII concludes the paper. convolutional neural networks (CNNs) [1, 2] and recurrent neu• ral networks (RNNs) [3]. Most progresses have been made on the 2 BACKGROUND AND RELATED WORK efficient implementation of computing primitives beyond scalar All state-of-the-art neural networks, including CNNs, RNNs, logistic and vector operations, such as matrix-vector and matrix-matrix regression and fully connected (FC) networks, follow the the basic operations, which are referred to as tensor primitives. neuron expression as follows: Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed ij = Act. Fct(W · x + b), (1) for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM where Ydenotes neuron outputs, X stands for feature/image inputs, must be honored. Abstracting with credit is pennitted. To copy otherwise, or republish, to post on servers or to redisbibute to lists, requires prior specific permission and/or a W is a weight (also called filter) matrix, b is a biased threshold fee. Request permissions from [email protected]. vector, and Act.Fct is a vector of non-linear activation functions. DAC '18, June 24-29, 2018, San Francisco, CA, VSA A matrix-vector product used in (1) can be unfolded to a number c 2018 Association for Computing Maclrinery. ACM ISBN 978-1-4503-5700-5/18/06 ... $15.00 of vector products in parallel and can be computed in different httpa:J/doiorg/10.1145/3195970.3196049 blocks simultaneously. A single-instruction-multiple-data (SIMD) .··1 HWMetric.s ~ Denotation A<.. s Data Inst Parameter aao M~ i I I DOD ... Mem Mem 1 M J/O Bandwidth B 0001 - 1 D RISC Weight ~ ... Central w R Controller MemSizc - F I 1. Scratcll p..t - (S) Scratchpad I Output s I I MemSir.e Neuron Slk:e Interface # Multiplier M Weight Memory (W) J (b) (c) (a)

Figure 1: Proposed hutruction-and-fabric programmable DLP architec:ture. (a) Dft!l'all architecture (b) parameterized design table (c) multi-chip paralleliam diagram. Table 1: iFPNA lmtruct:lon Set. ia mi.table for auch a model. The block for vector product computation is usually referred to as Processing Instruction (Opcode) Operandi Operandll Operandm Bkment (PE). Most atate-of-the-art DI.Pa have an array ofP&. Weight Load (WL) Length Src. Addr. Dst.Addr. In gem:ral, more parallelism requirea more data communication, Vector Collect (VC) Sn:.Addr. DstReg. Sld./Sps. thua higher memory acceu bandwidth. Data rcuac in DNNs de• Computing Execute (F.X) Mode Dst.Addr. crease• off-chip memory acce11. Three types of rewie are involved in CNNs [6], input reuse, weight reuse and canvolutional reuse. Among three, the previmu two are also deployed for RNNs. FC only 3.1 iFPNA Instruction Set supports input reuse. One critical criteria is bow much the mapped iFPNA adopta an inatructiona set architecture (ISA), rather than a data ftow decreues repeated memory aca!SS without sacrificing finite state machine (FSM), as the controller. FSM restricts possible PE utilization. The extreme case is that all data input only once. upgrades fur any data-flow not discovered at the time of chip design. Various data flaw1 were propo.ed to explore data reuse, including In contrut, instructions sets with sufficient flexibility are adaptive input stationary and row stationary [6]. to any data flow u long as the compiler, in partku1ar the mapping Temorization that maps efficient data 8owa exploit tbeae reusea library, iJ updated. 1111d achieve the bett performance with the constnlnecl hardware. In addition to common RISC instructions, three new instructions a neural network requires hundreds of ofoperation&, Often. milliona an: introduced to perform data movement and computing execution. while the hardware only lw tem or hundreds ofPEs. Large temora They are summarized in Table 1 and explained below. are partitioned into small on.es by the compiler and executed by Weight Load. The iFPNA can invoke a direct memory access PEs in a time-multiplexed manner. (DMA) operation to load weights from the off-chip memory without interrupt, referring to as weight load. Its operands include the 3 INSTRUCTION AND FABRIC memory acce11 length. starting source address and destination address (mapped to an address space among slices). PROGRAMMABLE ARCHITECTURE Vector Collect. The SIMD register file has four 128-bit regis• This work proposes iFPNA: an lnstructlon-ond-fahrk progrommable ters to store the input features and output computing results. The neuron aTTGY architecture as ahown in Pig. l(a). lt consist& of a registers are filled by the vector collect instruction. It can load central conb:oller with dedicated inatructiom to enhance data reuae, an input feature from the memory (for CNNs/FCs) or directly copy an array of programmable neuron slices supporting diverse neuron from another register (for RNN1). Note that a elictinglsparse flag is computation, and a single-instruction-multiple-data global register uted to arrange data more efficiently. The tliding operation partially file connecting the controller and the .slicca. A. the main computing update. the register and shifts the remain&. The sparse operation PE, a neuron slice includes an SRAM based weight memory, a inaerb zeros for convolution padding and sparse network.. register-baaed partial sum scratchpad memory, a programmable Computing Eucate. Most instructions are proce1sed in1ide MAC/activation/pooling engine reconfigurable with switches. the contra~. except the computing excute instruction that in• Figure l(b) defines computing metrics of .slices, including 1/0 vokes each neuron alice to compute vector product I activation bandwidth B, weight memory volume W, partial aum. scratchpad J pooling. Ill ope:randa include mode configuration and destina• volume 5, and the number of multipliers M. The mapping object tion addreu. Note that the datination of COllPtlting excute alto is from the neural network layer description, as summarized in covers the local partial sum scratch pad index. Table 2, suc:h u the kernel number K, the width a kernel channel of Uling the extended in1truction set, all the data reuse and PE feature size to the hardware resource, C, and planar H . According operations can be defined in a previously compiled program. The the compiler partition.s large-sized ke.melJ to obtain smaller ones major task of the iFPNA compiler is to re-order these three instruc• suitable fur hardware mapping. com:aponding partitioned pa• The tiona (aometimel aa1iated by other RISC instructiom) to maximi?:e rameters arc denoted u ~. F$, C,. data l'CUIC. Input feature lnput feature Weight Output futurt lI(JJ~ · Output featufe l H c:JEJ c::J c:J C•2

Sllce1 $$$$ Slloe 1 $$$$ c11 c12 c21 c22 •11 c21 c12 c22 11 t2 ·- tn ll " " " Ume (a) (b) Figure 2: Mapping schemes on the JFPNA architecture. (a) Weight Stationary (b) In.put Stationary. 3.2 Multi-Level Reconfigurable Fabric Table 2: Shape and mappmg parameters for CNNs The iFPNA supports multi-level :reconfigurability at fabric. Firstly, MACs are designed to accommodate various neural network quanti• Parameter Description zation schemes, including 16 bit, 8 bit, 6 bit and 4 bit modes. All the K,Ks # of3D weights,# of3D weightl meach partition multipliers are co11nected by a programmable adder tree to obtain c,c. # ofweight channels, # of channel in each partition the final re.suit H.~ input features height/width, height/width in Secondly, a reconfigurable activation and pooling engine is de• each partition signed. Configured by the central controller, the engine can accom• Ps # of input partition, ~ HI~ modate different activation functions (direct-output, Rel.U, Sigmoid, L weights height/width tanh, etc) and pooling modes (no pooling. 2x2 pooling. 3x3 pooling v =L x L x C8 , 3D-kemel size and average pooling). In addition, element-wi11e modes, including dot product, element-wise MAC, are also adopted to support ISI'M operations and batch nmmalizationa. flow models for DNN.s, namely weight stationary, input stationary, Finally, the fabric supports inter-slice partial sum transferring. row stationary, and tumlel stationary. These four data flow models First appearing in [1] and recently emphasized in systolic designs work for various CNNs with key mapping parameters sU11'llll81ized [4], flexible accmnul.ation among PEs are c:ommon for emerging in 18ble 2. RNNs and FC networks use only the first two data flow data lows. A ring fabric ofmter-slice communication is capable of models. transferring partial sums between neighboring blocks. 4.1 Data Flow I: Weight Stationary 3.3 Scalability The weight stationary (WS) data 8ow is the most straightforward, To handle a large number of kemels. the iFPNA architecture sup• due to weight sharing in one layer. On each neuron slice, WS holds ports the scalability with multi-chip parallelism in the package or the weight memory and 8lides the feature to obtain the result board level. A stacking package diagram ia shown mPig. l(c). Each Fig. 2(a) shows the WS mapping scheme on a single slice. At t1, iFPNA chlp reeervee four pins to be used for chip identification, weight Wl and feature Pl are loaded in slice 1 for convolution, so which connect to dilf'erent series numbers at the package level 16 dili'ezent weights are loaded in 16 neuron slices simultaneously Once an iFPNA chip is rea:t, on-chip chip identification registers and feature Fl is shared acrose 16 slices. Then a stride is taken and load the aeries number. These numbers are enc:oded as the four feature F2 is loaded with the same 16 weights at t2. After completing most significant bits (MSBs) in the iFPNA address space. all the convolutions of the first 16 weights, the next 16 weights are To perform parallel tasks coherently, instructions such as computing computed. execute, are applied to all the chips in the package simultaneously, Jn the iFPNA, the weight memory reuses the values until a combi• except memory access (weight load and output) instructions. To nation of vector collect and co111>uting execute goes through synchronize, an FPGA host is adopted. Effectively, multi-chip par• the whole map. alleliam expands the number ofmultipliers M and weight memory index W in Pig. l(b) by a factor of the chip number. With four chip 4.2 Data Flow Il: Input Stationary identification pins, a total of 16 iFPNA chips can be stacked at most Input stationary (IS) improves the MAC utilization bued on the reme ofinput feature .mapa [ 6). The IS scheme holds the input while 4 DNN DATA FLOW MAPPING shuffling diffen:nt weight. aa much as possible. Given the hardware architecture, the next issue is how to map Consider an example ofconvolution with 2 kernels sequentially, various neural networks efficiently onto the hardware. The key to as shown in Fig. 2(b). At tl, input Fl and weight Wl are loaded achieve the efficiency is to utilize the underlymg data &ow inherent and slice 1 holds the value of Fl to the next computing cycle t2 for in various deep learning networks. This section describes four data convolution with WZ. While each slice aton:s 2 difl'e:rcnt weights, Input feature

Output feature Weight • •

1J~C1T2 f1 F2 T3 T4 F3 F4 J1 H C = 1 C= 2 ...... '"T""' ~LJ 1 '""":"'·

Slice 1 Sllce18EJBB8B Slice2 lt u b

S lice1 c11 c12 c21 c22 I• ,, " time (a) (b) Figure 3: Data fl.ow mapping with ccmvolutional reue. (a) Raw Stationary (b) Tmmel Stationary.

Pl is held stationary until t2, thwi convolution is performed on 2 Other Compilation I Front-End : different weights with the same input in each slice from t1 to t2. API Worl< Compilation I

Then from t3 to t4. a stride is taken and F2 is loaded for convolution ------This work ------' with Wl and W2 in slic:e 1. Therefore, the percentage of non-MAC instructi.om is attenuated by shuffling kemela. Recent neural network models tend to have hundreds ofkernels HW in each layer. Therefore, JS is the main strategy by which iPPNA Constrains & reuses the feature value until a combination ofkemel shufiling and Resources computing execute goes through K$ kernels. The latency pemlty ofkernel shuffiing is negligible. Kemel Shuffling .____ --J (K) 4.3 Data Flow m: Row Stationary Figure 4: The iFPNA compilation strategy. The weight stationary and input stationary data flow1 do not ex• plon: convolutional reuse. The row stationary (RS) data flow model waa introduced [1] to reduce the movement of all types of data, mherited by RS, TS splits a 3D weight with deeper fragmentation. including input features, weights, and output features. The iFPNA OrigiDally fragments of1X3X3 in RS are transformed into fragments mapper supports an inherent RS data flow model sequentially as of 1 x 1 x 9, like ashape of iunneL shown in Fig. 3(a). Weights are reused horizontally and inputs Similar to JS, TS holds an input feature with size 1 x 1 x C. As .reuBed vertically over time, e.g., weights are .reuaed on each slice illustrated in Fig. 3 (b), a 2 X2 XC weightis segmented into 4 1X1 XC and input features are shared on all the slices. Thus, the partial tunnels, Fl perfonm convolution with all the tunnel filters, Tt, T2, sums are accumulated diagonally with inter-slice communication. T3, and T4. 'Ihen the feature changes to the nm tunnel. After going To be specific. each slice stores a row of a weight and each L through the L x L tunnels, partial sums are accumulated. slices store L rows of a weight, which iB a whole weight. At t2, In the iPPNA. TS utilizes less scratchpad than RS because the feature 1 row 2 is loaded and performs convolution with weight 1 ifmap scratchpad is eliminated Also, there is no inter-slice commu• row 1 in slice 1 and row 2 in slice 2. Then the intermediate results nication required and all the slices are utilized. are accumulated with the result of its neighbor with inter-slice communication. For example, F12W12 in slice 2 is accumulated with 5 COMPILATION AND TENSORIZATION FllWll in slice 1. Reuse within the PE uray minimizes the access 5.1 Compilation Flow ofboth input data and weights thus improves energy efficiency. With the instruction set and data flaw models, the iFPNA compiler In the iFPNA. input data are organized horizontally rather verti• cally. .Also, a spatial combination ofthree slices is utilized to perform has been developed. Fig. 4 shows the complete iFPNA compilation convolution with one 3 x 3 x C weight by inter slice rommunication. flow. It consists offour major steps: front-end compilation, mapping scheme selection, partition, and code generation. Front-end Compilation. Recent progress in neural network 4.4 Data Flow IV: Tunnel Stationary frameworks provides many open source front-end compilation tools RS requires the slice nwnber to be a multiple of 3 for a 3 X 3 X C that can transform the algorithmic description into intennecliate kemel to have 100% slice utilization. RS usea more 1C1atchpads to representation [7]. The input to our compiler uses a common neural •tore input feature•, weights and intermediate re1u.lt8. network model format, ONNX [8]. It includes both tensors and We propose a better data flow, referred to as Tunnel Stationary operators represented by a graph. Optimization such as combining (TS) to save input data scratchpad and to achieve full slice utilization. tensor computing, activation and pooling is perfOl'Dled in this step. Algorithm 1 Weight Stationary Algorithm 2 Input Stationary Algorithm 3 Row Stationary Algorithm 4 Tunnel Stationary Require: Weights, features Require: Weights, features Require: Weights, features Require: Weights, features 1: for w=O; w

Tensor primitives are predefined in the front-end compilation, and Table 3: Comparison of the four iFPNA data flows. application programming interfaces (APis) are utilized to invoke the deep learning processor. Data Flow Input Loading Latency HW Constraints Weight 2 Data Flow Mapping Scheme Selection. Given the NN model (HTL + 1) VK/B S~l and the hardware specification, the compiler selects the most suit• Stationary Input S~Ks, able data flow mapping scheme. NNs with few kernels (less than (HTL + l)'V({.)/B Stationary the slice number) adopt weight stationary, whereas NNs with many Ks = min{K, S, ~} Row 2 S~2L, kernels employ input stationary. Row/tunnel-stationary is utilized [H C+H{Fs-l){L-l)C]K Stationary B F, = (H-L+ !)/(~') for one-stride CNN layers. Tunnel [H2C+H(F -l)(L-l)C]K S~L. Partition. A large deep learning network is partitioned into a 6 Stationary B F, = (H - L + 1)/(f) number of subnetworks, which are processed sequentially. The com• Hybrids of [H2C+H{F6-l){L-l)C]K S ~ LKs, Ks - min{K, -;- }, piler determines a suitable partition based on the neural network IS and TS BK, F, = (H - L + 1)/(,1_,) description, the selected mapping scheme and hardware constraints. The overall criterion of the partition is to detect whether a regis• " It is assumed that W ~ V for the entire table. ter/memory overflow exists in the selected mapping scheme. The optimization targets are throughput {the reciprocal of latency) and reduced by near L 2 times. If the scratchpad size is too small to energy efficiency. support the whole feature RS, the compiler performs the input Code Generation. The final step is to generate a binary code partition to split the input into Fs small parts. The value of Fs for running the processor hardware. Generated code orders the depends on the size of scratchpad and inputs. Tunnel stationary instructions of weight load, vector collect and computing improves row stationary. With the same input loading latency, TS execute inside one for iteration loop. To assist debugging, the requires no scratch-pad on input features. The mixture of input corresponding assembly language counterpart is generated auto• stationary and tunnel stationary performs both input reuse and matically as well. Fig. 5 shows samples of pseudo code generation convolutional reuse and achieves the shortest input loading latency. for four data-flow mappings. When the kernel is large (V > W), e.g., in FC layers, each kernel is divided into few sub-kernels of channel width C,. The partition 5.2 Adaptive Tensorization Strategy strategy trades off between weight load and input stationary. Adaptive tensorization aims at choosing the best mapping and partition scheme. For CNNs/RNNs, a partition strategy assumes 6 EXPERIMENTAL RESULTS that more than one kernel can be stored in the weight memory A silicon prototype of the iFPNA processor has been designed and {W ~ V). Table 3 swmnarizes the analytical expressions of input fabricated in a 28nm HPC technology. The die micrograph and loading latency and hardware constraints. the demonstration system are shown in Fig. 8. The entire system Input stationary is the most desirable scheme. Though weight consists a PC running the compiler, a PCie link transferring data stationary consumes the least hardware, its latency is too long. In from the PC to the chip, and an FPGA board to buffer the data contrast, input stationary reuses one input feature when shuffling transmitted into the chip. Each single chip has an 1/0 bandwidth kernels. The first step of partition is to split the whole kernel set into of 1.6Gpbs with 16 neuron slices. Each slice contains a 256-index subsets, each of which has K, kernels. Therefore, input stationary 160-bit weight and bias SRAM. reduces the load latency by a factor of Ks. But the maximum Ks is The system, along with the iFPNA compiler, has been tested on restricted by the size of scratchpad and weight memory. a number of practical AI applications including the well-known Row stationary exploits convolutional reuse, which needs more AlexNet CNN and a LSTM-1024 RNN. Table 4 summarizes the scratchpad memory to store the intermediate convolution results performance of the prototype chip and compares with the state• and input features. With the RS scheme, the input latency can be of-art Eyeriss accelerator for CNNs [1] and Ocean accelerator for ... IS IS .. IS " vc; vc: V•ctor Cohct "" • WI.. WI..: Welghe Load .. - lX EiC Computln9 E)cecute " e " .. ~c · ,.. c I! .. ~ , ,.. ,. " " - .. . I .. .. , . I ·- •K 161( Jn< . • "R WelghtMtm" "' ll SPMem # SPMem " # Wef!.f'ltMem"' ll SPMem" "

(a) (b) (c) (d) (e)

Figure 6: Latency of different mappmg schemes on Conv Layer" in AluNet. Figure 7: Latency on test bench.es (a) (b) Conv Layer 4 in AluN'et, (c) FC Layer l in AltmNet (d) (e) LSTM-1024.

iFPNA GUI & Compiler Table 4: Measured performance aummary and comparison. Eyeriss[l] OCEAN[3] 'Ibis work Tec:lmology 6!imDLPI 65mn 28mnHPC VDDM 0.82-1.17 0.8-1.2 0.65-0.9 Clack(MHz) 100-200 2ll-400 20-200 Applicalion CNN RNN CNN/RNN Data flow RS I WS/IS/RStrS Bit width 16b 16b 4b-16b Power 278mW 6.6-155.8 mW 33.3mW Peak Performance S6GOPS 3U.6GOPS S3.4GOPS Peak Energy Eftic:iency 0.35TOPS/W 2.0TOP/W 1.6TOPS/W

traditional ISA (Jmtruction Set Architecture) and logic-level pro• grammability 1imilar to FPGA (Field-Programmable Gate Array), Figure 8: iFPNA chip prototype and demonstration system. and with a sliced architecture easy for scalability. A compiler has been developed to map automatically and eSiciently a deep J.eammg neural network description from commonly used Al framework& such as TensorFlow[9] to the iFPNA proce9SOr hardware. Four data RNN1 [5]. We can 1ee that the iFPNA processor, although highly flow models, including weight stationary, input stationary, row programmable, has achieved the performance comparable to that stationary, and tunnel stationary, are supported to explcm: the data of state-of-art dedicated CNN/RNN ac:celerators. and computational dependence inherent iD DNNs. A .silicon proto• Fig. 6 ahowa the measured latencies of one iFPNA chip with type of the iFPNA processor has been fabricated in 28nm CMOS various mappings on AlexNet Conv Layer 4. With input and con• and tested. The effectiveness ofthe proposed architecture and com• volutional reuse, IS and pure RSns have shorter latency than ws. pilation strategies have been evaluated on commonly used DNNs The combination ofIS and RSn'S achieves the shortest latency. found iD a number ofpractical Al applications. Fig. 7 .showa the measured shortest latenciea that can be achieved by running the iFPNA compiler on the iPPNA architecture under REFERENCES lllirerent hardware constraints for a variety of state-of-art DNNs. [l] Y. H. Chen, T. Krishm,J. S. Emer and V. Sze, "Eyaiss: an energy-dBdellt _. Figs. 7(a) and (b) ahow the results of mappmg AlexNet convolution Sgunl>le aa:elentm for deep convolutioml mun! netwodla," in JEBB Joumal '1f layer 4. With small-size weight memory or scratcltpad memory, Solld-Sr4te amim. vol 52. no. I, pp. 127-l!IS, Jan. 2017. only the TS or IS scheme is adopted. As the weight memory and [2] B. Moom, R. ~ W. n.hunl! and M. Vedielst, "ltn1ficion: a 0.26-to• lGTOPSJW subword-parallel dynamll>~-fieqaency-scalableC011VO­ saatchpad memory sizes mcrease, the slice baa space to stme more lutianal mmal mtworkproceuor in 28nmPD&>I," in 1BBB l1ltmtdlonal SolUl-.ftat~ kemela and intermediate reaults, both TS and IS schemes are then CimMta C1111feml# (mCC). 2017. [3] C. CMn d. "OCBAN: an an-chip .incnmimt:al-leaming enhanced pmceslClr used to reduce the latency. Figs. 7(c), (d). (e) are the measured al., and with ptecl m:unmt lleunl lletwozk accelerators," IA B~ Solid-State~ latency results offully-connected layer 1 in .AlexNet and a 1024-cell Conjmn« (l$SCIRC). 2017. LSTM network. here only the IS mappll1g scheme is used. [4) x. Wei d al. "Aulmmltedsystulicllmlyudilteclme SJDthe&iafor high lluoughput CNNlnf'-OllPl'GAI.' ml>e$1p.A~ Confumu (DAC). 2017. [5] S. liu flt al. 'Cambricoa: .AD imlrudim Set u.:hitectme for llCU1lll netwvrla," in r~ ~m on Contplltu Ndttttdlln osau 2016. 7 CONCLUSION [6) V. Sze. T.-J. Yang. Y.-K Chm mdJ. l!mu, "l'.J6ciait pmccuillg af clecp nemal With the rapid evolution of deep .1eaming algorithms and applica• zmworkl: a tutorial anchurvey," in.Procu41ngs flftM 1BliB, voL 105, DO. 12, pp. it higbly 2295-2'29, Dec. 2011. tions, is more desirable to have a programmable custom [7] T. Chm et oL, ~:aid-to-Gd compllatioll stack for deep leUD!Dg." In SytML silicon deep l.eanllng processor rather than a bigh-perfcmnance Conjimia. 2018 . .ASIC with limited applications. This paper presented iFPNA: one [8) Open neural aetwork adwige {ONNX), URL: hltpf://-.al. such a processor with instruction-level programmability similar to [9) Temodlow™. URL: bttpa:/lwww.temcuflow.org/.