Fast Inference of Deep Neural Networks in FPGAs for Particle Physics (arXiv:1804.06913)

Jennifer Ngadiuba, Maurizio Pierini [CERN] Javier Duarte, Sergo Jindariani, Ben Kreis, Ryan Rivera, Nhan Tran [Fermilab] Edward Kreinar [Hawkeye 360] Song Han, Phil Harris [MIT] Zhenbin Wu [University of Illinois at Chicago]

Research Techniques Seminar Fermilab 4/24/2018

1 Outline

• Introduction and Motivation • Machine Learning in HEP • FPGAs and High-Level Synthesis (HLS) • Industry Trends • HEP Latency Landscape • hls4ml: HLS for Machine Learning • Case Study and Design Exploration • Summary and Outlook • Cloud-scale Acceleration

Javier Duarte I hls4ml 2 Introduction

Javier Duarte I hls4ml 3 ReconstructionMachine chain:Learning Jet tagging in HEP Task to find the particle ID of a jet, e.g. b-quark • Learning optimized nonlinear functions of many inputs for … performingvertices difficult tasks from (real or simulated) data … Many successes in HEP: identificationData/MC of b-quarkuser jets, Higgs • Tag Info tagger corrections … candidates,Jet, particle energy regression, analysis selection, … particles CMS-PAS-BTV-15-001 s=13 TeV, 2016 1 CMS Simulation Preliminary Key features:tt events AK4jets (p > 30 GeV) T • Long lifetimeCSVv2 of heavy 10−1 flavour quarksDeepCSV cMVAv2

• Displacedmisid. probability tracks, … • Usage10−2 of ML standard for this problem

udsg Neural network based on 10−3 • 11 c high-level features 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 b-jet efficiency

Javier Duarte I hls4ml 4 Machine Learning in HEP • Learning optimized nonlinear functions of many inputs for performing difficult tasks from (real or simulated) data

• Many successes in 11.1HEP: Significance identification of the signal and of its strengthb-quark jets, Higgs 41 candidates, particle energy regression, analysis selection, … arXiv:1407.0558 19.7 fb-1 (8 TeV) + 5.1 fb-1 (7 TeV) ×1043 CMS 3.5 S/(S+B) weighted sum H → γγ Data 3 ML algorithms used in every S+B fits (weighted sum) 2.5 B component aspect of Higgs discovery: ±1σ energy regression 2 ±2σ S/B discrimination, 1.5 1 +0.26 µ = 1.14 −0.23 … 0.5 mH = 124.70 ± 0.34 GeV

0 S/(S+B) weighted events / GeV 200 Typically applied offline, B component subtracted not online (trigger-level) 100 0

-100 110 115 120 125 130 135 140 145 150 Javier Duarte I hls4ml 5 mγγ (GeV) Figure 19: Diphoton mass spectrum weighted by the ratio S/(S + B) in each event class, to- gether with the background subtracted weighted mass spectrum.

Table 5: Values of the best-fit signal strength, µˆ, when mH is treated as an unconstrained pa- rameter, for the 7 TeV, 8 TeV, and combined datasets. The corresponding best-fit value of mH, mH, is also given.

µˆ mH (GeV) b +0.62 7 TeV 2.22 0.55 124.2 +0.26 b 8 TeV 0.90 0.23 124.9 +0.26 Combined 1.14 0.23 124.7 section times the relevant branching fractions, relative to the SM expectation. In Fig. 20 the combined best-fit signal strength, µˆ, is shown as a function of the Higgs boson mass hypothesis, both for the standard analysis (left) and for the cut-based analysis (right). The two analyses agree well across the entire mass range. In addition to the signal around 125 GeV, both analyses see a small upward fluctuation at 150 GeV, which is found to have a maximum local significance of just over 2 s at mH = 151 GeV—slightly beyond the mass range of our analysis.

The best-fit signal strength for the main analysis, when the value of mH is treated as an un- +0.26 constrained parameter in the fit, is µˆ = 1.14 0.23, with the corresponding best-fit mass being mH = 124.7 GeV. The expected uncertainties in the best-fit signal strength, at this mass, are +0.24 and 0.22. The values of the best-fit signal strength, derived separately for the 7 and 8b TeV datasets, are listed in Table 5. For the cut-based analysis the corresponding value is +0.29 µˆ = 1.29 0.26 at mH = 124.6 GeV, and for the sideband background model analysis the value +0.26 measured is µˆ = 1.06 0.23 at mH = 124.7 GeV. These values are shown in Table 6 together with the expected uncertainty,b and the corresponding values for the main analysis. The uncertainty in the signalb strength may be separated into statistical and systematic con- tributions, with the latter further divided into those having, or not, a theoretical origin: µˆ = CMS Trigger 1 kHz 100 kHz 1 MB/evt

High-Level 40 MHz Trigger L1 Trigger

• Level-1 Trigger (hardware) • High-Level Trigger (software) • 99.75% rejected • 99% rejected • decision in ~4 μs • decision in ~100s ms

• After trigger, 99.99975% of events are gone forever

Javier Duarte I hls4ml 6 HEP Latency Landscape

CMS: L1 Trigger HLT Offline 1 ns 1 us 1 ms 1 s

FPGAs CPUs

Javier Duarte I hls4ml 7 MACHINE LEARNING & FPGAS 7

Machine learning algorithms are ubiquitous in HEP

FPGA usage broad across HEP experiments Field-Programmable Gate Array Centered on DAQ and triggerALL FPGA development ARCHITECTURE 16 • Flexible: re-programmable Some early adaptions ofO(50-100) ML techniques optical in triggerFPGA [1] interconnects betweentransceivers “programmable hardware” FPGA developmentconfigurable becominglogic blocksrunning atmore and accessible ~O(15) Gbs High Levelembedded Synthesis, components OpenCL • LUTs (logic), What are FPGAs?FPGA Flip-Flops (registers), “programmable hardware” FPGA diagram Field Programmable Gate Arrays are reprogrammable FPGA interestDSPs in (arithmetic),industry is growingintegrated circuits

Contain array of logic blocks used to configure low level Block RAMs (memory) operations (bit masking,DSPs shifting, (multiply-accumulate,addition) etc.) Programmable hardware with structures Flip Flops (registers/distributed memory) Logic blocks wired together through routing channels High Throughput: O(100)Typical modern FPGA: LUTs (logic) that maps• nicely onto ML architecturesSupport highly parallel and pipelinedBlock algorithm RAMs (memories) (Kintex ultrascale+) implementations with guaranteed latency optical transceivers 1.3Mrunning FFs at 700k LUTs O(15) Gbs 5500 DSPs Typical modern FPGADSPs (multiply-accumulate,LogicLogic block cell etc.) 2200 BRAMs (Kintex UltraScale):Flip Flops (registers/distributed memory) • Massively parallel 1.3 M FFs LUTs (logic) Block RAMsLook-up (memories) Flip-flop 700k LUTs table • Low power (relative to CPU/ 5500 DSPs 2200 BRAMs 9 GPU) 25.04.2018 Jennifer Ngadiuba - hls4ml: deep neural networks in FPGAs [1] Carnes et al., https://indico.cern.ch/event/567550/contributions/2629686/ Javier Duarte I hls4ml 8 CPUs, GPUs, FPGAs, and ASICs How are they different? • FPGAs are the middle ground of latency, energy AeRCHITECTURESfficiency, and flexibility 41

ASICs

FPGAs

GPUs

CPUs

Source: Bob Broderson, Berkeley Wireless group

Javier* GPUs Duarte still Ibest hls4ml option for training 9 * FPGAs generally much more power efficient High-Level Synthesis • FPGA development is becoming more accessible, with tools like High-Level Synthesis: untimed C code with additional Motivation for High-Level Synthesis (HLS) directives that is synthesized into RTL Verilog/VHDL Motivationmodule dut(rst, clk for, q); High-Level Synthesis (HLS) input rst; uint8 dut() { input clk; static uint8 c; moduleoutputdutq;(rst , clk, q); c+=1; inputregrst[7:0]; c; vs. } uint8 dut() { clk; inputalways @ (posedge clk) static uint8 c; Untimed C code outputbeginq; c+=1; reg [7:0]if (rst c;== 1b’1) begin vs. } c <= 8'b00000000; HLS alwaysend @ (posedge clk) Untimed C code beginelse begin rst 1 c <= c + 1; if (rst == 1b’1) begin end + q c <= 8'b00000000; 0 1 HLS 0 c 8 end assign q = c; elseendmodule begin rst 1 clk c <= c RTL+ 1; Verilog Javier Duarteend I hls4ml 10 + q An0 1 8-bit counter 0 c 8 4 assign q = c; endmodule clk RTL Verilog An 8-bit counter 4 Industry Trends • Industry is moving toward custom hardware and FPGAs to quickly apply ML algorithms • Already being used as backend accelerators for Bing web searches, Siri queries, and more…

Microsoft Apple

Google Intel

Javier Duarte I hls4ml 11 hls4ml 100 kHz

40 MHz L1 Trigger

• hls4ml: neural network translation library for HLS • Support for common ML workflows and architectures • Tunable configuration for different use cases • Focus on L1 trigger as 1st application • What can we do in < μs on one FPGA?

Javier Duarte I hls4ml 12 Case Study and Design Exploration

Javier Duarte I hls4ml 13 Case Study: Jet Substructure

• Illustrative example: not necessarily the most realistic for L1 today, but lessons are generic

Javier Duarte I hls4ml 14 Jet Substructure Inputs

ECFs Observables mass mmMDT =1,2 N2 =1,2 M2 =0,1,2 C1 =1,2 C2 =1,2 D2 ↵, = 1,1 , 1,2 multiplicity ( ) ( ) ( ) D2 z log z Multiplicity Õ

Table 1: A summary of the observables used in the analysis.

Groomed mass separates top, W/Z, and quark/gluon this study [51–54]. A brief description of• each of these variables is presented in Ref. [55]. These are used as expert-level inputs to a neural network• top/gluon classifier which have is greater near optimal multiplicity3. than W/Z/quark • ECF N2β=1 separates 2 and 3-prong jets (W/Z/top) from 1-prong jets (quark/gluon) Benchmark networks and floating point performance We train a neural network for the classification task of q, g, W, Z, and t discrimination. The data are Javier Duarte I hls4ml 15 randomly split into training (60%), validation (20%), and testing (20%) datasets. The input features are standardized by removing the mean and scaling to unit variance. The architecture, illustrated in Fig. 4 (left), is a fully-connected neural network with three hidden layers. The activation function for the hidden layers is ReLU [56] while the output layer activation function is a softmax function to provide probabilities for each class. The categorical cross-entropy loss function is minimized with and without L1 regularization of the weights (Sec. 2.3) using the Adam algorithm [57] with an initial 4 learning rate of 10 and a minibatch size of 1024. The learning rate is halved if the validation loss fails to improve over 10 epochs. Training is performed on an AWS EC2 P2 GPU instance [58] with [10]. We also consider a simpler architecture with one hidden layer, see Fig. 4 (right), when studying the final FPGA implementation on a specific device. This is described further in Sec. 3.3. The performance of the neural network classifier is shown in Fig. 5. The general features of this performance plot are typical of jet substructure classification tasks. Top-quark jets, by virtue of their large mass and three-prong nature, have the best separation from the rest of the jet types. The W and Z jets are similar in performance because of their masses and two-prong nature while quark and gluon jets are notoriously challenging to classify. Given this multi-jet classifier performance, we explore how to implement such a neural network architecture in an FPGA using hls4ml.

3More sophisticated approaches exist, but the goal of this study is not to achieve better performance than existing algorithms. Instead, the goal is to examine the implementation of several eective neural network architectures in FPGAs.

–8– Case Study: Jet Substructure

• 5 output multi-classifier auc = area under ROC curve (100% is perfect, 20% is random) • Does a jet originate from a quark, gluon, W/Z boson, top quark? • Fully connected network • 16 expert inputs • jet mass, multiplicity, ECFs better 16 inputs

64 nodes, ReLU

32 nodes, ReLU

32 nodes, ReLU

5 outputs, softmax

Javier Duarte I hls4ml 16 Neural Network k k 1 `j = (Wij`i + bj)

k k 1 `j = (Wij`i + bj) k k 1 k k 1 `j = (Wij`i +`j b=j) (Wij`i + bj)

Javier Duarte I hls4ml 17 Neural Networkmultiplication k k 1 `j = (Wij`i + bj) activation function addition

k k 1 `j = (Wij`i + bj) k k 1 k k 1 `j = (Wij`i +`j b=j) (Wij`i + bj)

Javier Duarte I hls4ml 18 Neural Networkmultiplication k k 1 `j = (Wij`i + bj) activation function addition

NN = multiplications, additions, and pre-computed activation functions k k 1 `j = (Wij`i + bj) k k 1 k k 1 `j = (Wij`i +`j b=j) (Wij`i + bj)

Javier Duarte I hls4ml 19 ML in FPGAs? FPGA 16×64 +64×32

+32×32

+5×32 = 4,256 synapses / mult.

How many resources? DSPs, LUTs, FFs? Can we fit in the latency requirements?

Javier Duarte I hls4ml 20 Efficient Neural Networks • Compression • Maintain same performance while removing redundant synapses and neurons • Quantization 32-bit floating point math is overkill • CHAPTER 3. PRUNING DEEP NEURAL NETWORKS 20 • 20-bit, 18-bit, 8-bit, …? fixed point, integers? binarized?

For further reading:Figure 3.1: arXiv:1510.00149 Pruning the synapses and neurons of a deep neural network.

Javierthe connections Duarte I that hls4ml have been removed. The phases21 of pruning and retraining may be repeated iteratively to further reduce network complexity. In effect, this training process learns the network connectivity in addition to the weights — this parallels the human brain development [109] [110], where excess synapses formed in the first few months of life are gradually "pruned", with neurons losing little-used connections while preserving the functionally important connections. On the ImageNet dataset, the pruning method reduced the number of parameters of AlexNet by a factor of 9 (61 to 6.7 million), without incurring accuracy loss. Similar experiments with × VGG-16 found that the total number of parameters can be reduced by 13 (138 to 10.3 million), × again with no loss of accuracy. We also experimented with the more efficient fully-convolutional neural networks: GoogleNet (Inception-V1), SqueezeNet, and ResNet-50, which have zero or very thin fully connected layers. From these experiments we find that they share very similar pruning ratios before the accuracy drops: 70% of the parameters in those fully-convolutional neural networks can be pruned. GoogleNet is pruned from 7 million to 2 million parameters, SqueezeNet from 1.2 million to 0.38 million, and ResNet-50 from 25.5 million to 7.47 million, all with no loss of Top-1 and Top-5 accuracy on Imagenet. In the following sections, we provide solutions on how to prune neural networks and how to retrain the pruned model to recover prediction accuracy. We also demonstrate the speedup and energy efficiency improvements of the pruned model when run on commodity hardware.

3.2 Pruning Methodology

Our pruning method employs a three-step process: training connectivity, pruning connections, and retraining the remaining weights. The last two steps can be done iteratively to obtain better compression ratios. The process is illustrated in Figure 3.2 and Algorithm 1. Design Exploration hls4ml Keras TensorFlow PyTorch … Co-processing kernel

model hls 4 ml

compressed model HLS HLS conversion project Custom firmware HLS 4 ML design Usual ML software workflow tune configuration precision reuse/pipeline

Javier Duarte I hls4ml 22 Training Training/Compression Example: https://github.com/hls-fpga- machine-learning/keras-training/ (Some useful stuff for compression too)

json_string = keras_model.to_json() Export Model keras_model.save_weights(‘my_model_weights.h5’)

json

Javier Duarte I hls4ml 23 Training for Compression

Train with L1 Prune 1st iteration

• Many possible schemes for compression Our approach is a simplified version of iterative parameter pruning and retraining [59, 70] with • Simple, iterative version: L1 regularization, where the loss function L is augmented with an additional penalty term, • Train with L1 regularization (down-weights unimportant synapses) L w = L w + w . w 1 = wi (2.3) ( ) ( ) k k1 k k i | | • Remove X% of weights and retrain P L1 regularization is known to produce sparse models, provide built-in feature selection [71], and Repeat is a readily available option• in many machine learning workflows. In principle, training with Lp regularization with 0 p < 1 [62] may improve the sparsity and performance of the model, but these Javier Duarte I hls4ml 24 regularizers are not always easy to implement. While we take this simplified approach, we note that there are other, more sophisticated, approaches to compression in the literature which may yield even better results. 4 We train the model with L1 regularization with = 10 . We then sort the weights based on their absolute value relative to the maximum absolute value of the weights in a particular layer. With L1 regularization we see two separate sub-populations of weights with one at smaller values and one at larger values. Weights falling below a certain percentile, corresponding to the smaller-value sub- population, are removed. Next, we retrain the model again with L1 regularization while constraining the previously pruned weights to remain zero. We stop after seven iterations of this procedure at which point the sum of the pruned weight sub-population is 3% of the original summed weight population and the model is compressed by 70% (3051 weights pruned out of 4389 original weights and biases). Fig. 6 illustrates this procedure. The top left of Fig. 6 shows the distribution of the weights before compression. From the top left to the bottom right, the arrows indicate the following steps of the pruning and retraining procedure and the resulting distribution of weights is shown. Finally, in the bottom right, we present the final distribution of the weights after compression. We observe no significant change in the pruned network performance when compared with the original.

Quantization Quantized [59, 72–75] and even binarized [76–79] neural networks have been studied in detail as an additional way to compress neural networks by reducing the number of bits required to represent each weight. FPGAs provide considerable freedom in the choice of data type and precision. Both are important to consider to prevent wasting FPGA resources and incurring additional latency. In hls4ml we use fixed point arithmetic, which uses less resources and latency than floating point arithmetic. The inputs, weights, biases, sums, and outputs of each layer (see Eq. 2.1) are all represented as fixed point numbers. For each, the number of bits above and below the binary point can be configured for the use case. It is broadly observed that precision can be reduced significantly without causing a loss in performance [75], but this must be done with care. In Fig. 7, we show the distribution of the absolute value of the weights after the compression described in Sec. 2.3. In this case, to avoid underflow/overflow in the weights, at least three bits should be assigned above the binary point — two to envelope the largest absolute value and one for the sign. The neuron values, xm, and intermediate signals in the FPGA used to compute them may require more bits to avoid underflows/overflows. We determine the number of bits to assign below the binary point by scanning physics performance as a function of the bit precision.

– 11 – Training for Compression

Train with L1 Prune 1st iteration … … …

Retrain with L1 Prune 7th iteration

70% reduction with no loss in performance

Javier Duarte I hls4ml 25 Other Compression Schemes Louizos et al. 2017 arXiv:1712.01312 • Train with Lp (0≤p<1) regularization p 1/p to promote sparsity (though w p =( i wi ) difficult to optimize) k k | | P • as p → 0, Lp → L0 • “Optimal brain damage:” use second derivatives of loss Figure 1: Lp norm penalties for a parameter ✓ according to different values of p. It is easily observed function to rank parameter that both weight decay and Lasso, p =2and p =1respectively, impose shrinkage for large values of ✓. By gradually allowing p<1 we observe that the shrinkageLeCun is reduced and et at theal. limit 1989 of p =0 we observe that the penalty is a constant for ✓ =0. saliency (rather than using 6 NIPS 250 parameter magnitude) is achieved by transforming continuous random variables (r.v.s) with a hard nonlinearity, the hard- sigmoid. We further propose and employ a novel distribution obtained by this procedure; the hard concrete. It is obtained by “stretching” a binary concrete random variable (Maddison et al., 2016; Jang et al., 2016) and then passing its samples through a hard-sigmoid. We demonstrate the effectiveness • Weight-sharing using k-meansof this simple procedure in various experiments. clustering to identify weights to 2MINIMIZING THE L0 NORM OF PARAMETRIC MODELSHan et al. 2015

share One way to sparsify parametric models, such as deep neuralarXiv:1510.00149 networks, with the least assumptions about the parameters is the following; let be a dataset consisting of N i.i.d. input output pairs D (x1, y1),...,(xN , yN ) and consider a regularized empirical risk minimization procedure with an L{ regularization on the parameters} ✓ of a hypothesis (e.g. a neural network) h( ; ✓): • Huffman coding (optimal prefix)0 · N ✓ 1 | | (✓)= h(xi; ✓), yi + ✓ 0, ✓ 0 = I[✓j = 0], (1) R N L k k k k 6 ✓ i=1 ◆ j=1 Javier Duarte I hls4ml 26 X X ✓⇤ = arg min (✓) , ✓ {R } where ✓ is the dimensionality of the parameters, is a weighting factor for the regularization and ( ) corresponds| | to a loss function, e.g. cross-entropy loss for classification or mean-squared error for L · regression. The L0 norm penalizes the number of non-zero entries of the parameter vector and thus encourages sparsity in the final estimates ✓⇤. The Akaike Information Criterion (AIC) (Akaike, 1998) and the Bayesian Information Criterion (BIC) (Schwarz et al., 1978), well-known model selection criteria, correspond to specific choices of . Notice that the L0 norm induces no shrinkage on the actual values of the parameters ✓; this is in contrast to e.g. L1 regularization and the Lasso (Tibshirani, 1996), where the sparsity is due to shrinking the actual values of ✓. We provide a visualization of this effect in Figure 1. Unfortunately, optimization under this penalty is computationally intractable due to the non- ✓ differentiability and combinatorial nature of 2| | possible states of the parameter vector ✓. How can we relax the discrete nature of the L0 penalty such that we allow for efficient continuous optimization of Eq. 1, while allowing for exact zeros in the parameters? This section will present the necessary details of our approach.

2.1 A GENERAL RECIPE FOR EFFICIENTLY MINIMIZING L0 NORMS

Consider the L0 norm under a simple re-parametrization of ✓:

✓ | | ✓ = ✓˜ z ,z0, 1 , ✓˜ =0, ✓ = z , (2) j j j j 2 { } j 6 k k0 j j=1 X

2 Translation

Translation python keras-to-hls.py -c keras-config.yml

Inputs

Config

• IOType: parallelize or serialize

• ReuseFactor: how much to parallelize my-hls-test/: build_prj.tcl DefaultPrecision: inputs, weights, biases firmware • myproject_test.cpp

Javier Duarte I hls4ml 27 Network Tuning: Parallelization

• ReuseFactor: how much to parallelize

mult reuse = 4 use 1 multiplier 4 times

mult reuse = 2 use 2 multipliers 2 times each mult

mult

mult reuse = 1 use 4 multipliers 1 time each mult

mult related to the Initiation Interval = when new inputs are introduced to the algo.

Javier Duarte I hls4ml 28 HLS Project

Build HLS project vivado_hls -f build_prj.tcl

################# # HLS4ML # ################# open_project -reset myproject_prj set_top myproject add_files firmware/myproject.cpp -cflags "-I[file normalize ../../nnet_utils]" add_files -tb myproject_test.cpp -cflags "-I[file normalize ../../nnet_utils]" add_files -tb firmware/weights open_solution -reset "solution1" set_part {xcku115-flvf1924-2-i} create_clock -period 5 -name default csim_design produces an “IP core” csynth_design cosim_design -trace_level all that can be dropped into export_design -format ip_catalog a full firmware design exit

Javier Duarte I hls4ml 29 Study Results

Xilinx Vivado 2017.2 Clock frequency: 200 MHz FPGA: Xilinx Kintex Ultrascale (XCKU115-FLVB2104)

Javier Duarte I hls4ml 30 Compression

75 ns

• Big reduction in DSP usage with pruned model! • ~15 clocks @ 200 MHz = 75 ns inference

Javier Duarte I hls4ml 31 Quantization Scan integer bits Scan fractional bits

Full performance Full performance with 6 integer bits with 8 fractional bits Quantization

ap_fixed • Quantify the performance of the classifier with the AUC General strategy: avoid overflows in • 0101.1011101010 integer bit then scan the decimal bit • Expected AUC = AUC achieved by 32-bit floating point integer fractional inference of the neural network until reaching optimal performance width

Javier Duarte I hls4ml 32 Scan integer bits Scan fractional bits Fractional bits fixed to 8 Integer bits fixed to 6

Full performance at 6 integer bits Full performance at 8 fractional bits FPGA AUC / Expected FPGA AUC / Expected

25.04.2018 Jennifer Ngadiuba - hls4ml: deep neural networks in FPGAs 29 Resource Usage and Timing

15 clocks [75 ns] time

16 × 64 64 × 32 32 × 32 32 × 5 softmax (5)

reuse = 1 BRAM DSP FF LUT <16, 6> bits Total 13 954 53k 36k

% Usage ~0% 17% 3% 5%

Javier Duarte I hls4ml 33 Resource Usage with Reuse

• Tuning the throughput with reuse factor reduces the DSP usage • Steady increase of LUTs and FFs vs. bit precision • Spikes in LUTs at the DSP precision transitions (not present in final implementation)

Javier Duarte I hls4ml 34 Timing with Reuse

• Additional latency introduced by reusing the multipliers • Initiation interval scales with the reuse factor

Javier Duarte I hls4ml 35 Implementing the HLS Design “place and route” • How optimal is the HLS design (vs. RTL)? • For DSPs, HLS seems close to “back- of-the-envelope” optimal estimate • HLS is good for quickly getting a conservative estimate of resources

• Power decreases as throughput is decreased (by increasing reuse factor)

Javier Duarte I hls4ml 36 HLS vs. Implementation

• HLS estimates are conservative compared to final implementation • No spikes in LUTs at the DSP precision transitions in implementation

Javier Duarte I hls4ml 37 Note: different model Summary and Outlook

Javier Duarte I hls4ml 38 hls4ml Status hls-fpga-machine-learning.github.io/hls4ml

FERMILAB-PUB-18-089-E

P JINST

Fast inference of deep neural networks in FPGAs for particle physics

Javier Duartea , Song Hanb , Philip Harrisb , Sergo Jindariania , Edward Kreinarc , Benjamin Kreisa , Jennifer Ngadiubad , Maurizio Pierinid , Ryan Riveraa , Nhan Trana , Zhenbin Wue aFermi National Accelerator Laboratory, Batavia, IL 60510, USA bMassachusetts Institute of Technology, Cambridge, MA 02139, USA cHawkEye360, Herndon, VA 20170, USA dCERN, CH-1211 Geneva 23, Switzerland eUniversity of Illinois at Chicago, Chicago, IL 60607, USA E-mail: [email protected]

A: Recent results at the Large Hadron Collider (LHC) have pointed to enhanced physics capabilities through the improvement of the real-time event processing techniques. Machine learning • Beta version is live! arXiv:1804.06913 methods are ubiquitous and have proven to be very powerful in LHC physics, and particle physics as a whole. However, exploration of the use of such techniques in low-latency, low-power FPGA hardware has only just begun. FPGA-based trigger and data acquisition (DAQ) systems have extremely low, sub-microsecond latency requirements that are unique to particle physics. We present a case study for • Next steps: neural network inference in FPGAs focusing on a classifier for jet substructure which would enable, among many other physics scenarios, searches for new dark sector particles and novel measurements of the Higgs boson. While we focus on a specific example, the lessons are far-reaching. We develop Applications? a package based on High-Level Synthesis (HLS) called hls4ml to build machine learning models • in FPGAs. The use of HLS increases accessibility across a broad user community and allows for a drastic decrease in firmware development time. We map out FPGA resource usage and latency versus neural network hyperparameters to identify the problems in particle physics that would benefit from

arXiv:1804.06913v2 [physics.ins-det] 20 Apr 2018 performing neural network inference with FPGAs. For our example jet substructure model, we fit well • Bigger networks? within the available resources of modern FPGAs with a latency on the scale of 100 ns.

Javier Duarte I hls4ml 39 L1 Muon Trigger Application ACAT 2017 BDT L1T • Current BDT for CMS Level-1 muon pT assignment based on defection angles (ΔΦ, Δθ) and other variables • Implemented using a 1 GB pre-computed LUT that stores the BDT output for every the possible input (compressed to 30- bits) • CMS studying NN implemented with hls4ml, which allows additional inputs (e.g. 68 variables with 18 bits each = 1228 bits!) and improved pT resolution

Javier Duarte I hls4ml 40 Big Convolutional Neural Networks • Main task is /image recognition • Control the number of parameters by baking in assumptions like locality and translation invariance to share weights within a layer Krizhevsky, et al. NIPS 4824

8 layers 62 million parameters (94% are in the FC layers)

AlexNet (2012)

Javier Duarte I hls4ml 41 Big Convolutional Neural Networks • Main task is computer vision/image recognition • Control the number of parameters by baking in assumptions like locality and translation invariance to share weights within a layer Szegedy et al. arXiv:1409.4842 iue3 ogee ewr ihalteblsadwhistles and bells the all with network GoogLeNet 3: Figure

22 layers 1x1+1(S) 1x1+1(S)

5 million parameters Conv Conv 1x1+1(S) Conv 1x1+1(S) 3x3+1(S) 1x1+1(S) 3x3+1(S) 1x1+1(S) 1x1+1(S) Conv Conv Conv Conv SoftmaxActivation Conv Conv DepthConcat DepthConcat AveragePool 7x7+1(V) softmax2 1x1+1(S) 3x3+1(S) DepthConcat 1x1+1(S) FC Conv Conv 3x3+2(S) MaxPool Conv 1x1+1(S) 5x5+1(S) 1x1+1(S) 5x5+1(S) 1x1+1(S) 3x3+1(S) 1x1+1(S) 3x3+1(S) Conv Conv Conv Conv 1x1+1(S) 1x1+1(S) 1x1+1(S) Conv Conv Conv Conv Conv Conv Conv DepthConcat DepthConcat 1x1+1(S) 5x5+1(S) 1x1+1(S) 3x3+1(S) Conv Conv DepthConcat Conv Conv 3x3+1(S) 1x1+1(S) 3x3+1(S) 1x1+1(S) MaxPool MaxPool 1x1+1(S) 5x5+1(S) 1x1+1(S) 5x5+1(S) Conv Conv 1x1+1(S) 3x3+1(S) 1x1+1(S) 3x3+1(S) 1x1+1(S) 3x3+1(S) 7 Conv Conv Conv Conv Conv Conv Conv Conv Conv Conv LocalRespNorm LocalRespNorm DepthConcat DepthConcat DepthConcat 3x3+1(S) 1x1+1(S) MaxPool 1x1+1(V) 7x7+2(S) 3x3+2(S) 3x3+1(S) 3x3+2(S) 3x3+2(S) 1x1+1(S) 5x5+1(S) MaxPool MaxPool MaxPool Conv input Conv Conv Conv Conv Conv 3x3+1(S) 1x1+1(S) 3x3+1(S) 1x1+1(S) MaxPool MaxPool 1x1+1(S) 5x5+1(S) 1x1+1(S) 5x5+1(S) 1x1+1(S) 5x5+1(S) Conv Conv Conv Conv Conv Conv Conv Conv SoftmaxActivation AveragePool 5x5+3(V) 1x1+1(S) 3x3+1(S) 1x1+1(S) softmax1 MaxPool Conv Conv FC FC 3x3+1(S) 1x1+1(S) 3x3+1(S) 1x1+1(S) 3x3+1(S) 1x1+1(S) MaxPool MaxPool MaxPool Conv Conv Conv SoftmaxActivation AveragePool 5x5+3(V) 1x1+1(S) softmax0 Conv FC FC

GoogLeNet (2014)

Javier Duarte I hls4ml 42 Convolutional CNNs Neural in Neutrino Networks Experiments • Showing• Readout a electron detector neutrino as ainteraction (multidimensional) and feature mapsimage extracted from the convolutionalShown to kernels be eff afterective the at first classifying inception module NoνA neutrino events • The• strong features extracted are the shower as opposed to the muon track • Adapted from GoogLeNet Can we implement these on an FPGA?

νe

: e- :

FEATURE MAPS

Evan Niner in NoνA arXiv:1604.01444 32 9/1/17Javier EvanDuarte Niner | DeepI hls4ml Learning in NOvA 43 HEP Latency Landscape Tools to infer big networks: • Use BRAM to store weights Leads to O(ms) latencies • Increase reuse factor ×1000 • Additional compression: weight sharing, Huffman coding • Multi-pumping DSPs CMS: L1 Trigger HLT Offline 1 ns 1 us 1 ms 1 s DUNE: DAQ Offline

Pure FPGAs

Javier Duarte I hls4ml 44 HEP Latency Landscape Tools to infer big networks: • Use BRAM to store weights Leads to O(ms) latencies • Increase reuse factor ×1000 Natural application: • Additional compression: weight sharing, Huffman coding accelerate offline workflows • Multi-pumping DSPs CMS: L1 Trigger HLT Offline 1 ns 1 us 1 ms 1 s DUNE: DAQ Offline

Pure FPGAs Acceleration with FPGAs

Javier Duarte I hls4ml 45 Under review as a conference paper at ICLR 2017

Han et al. 2016 Table 2: Comparing SqueezeNet to model compression approaches. By modelarXiv:1602.07360 size, we mean the numberUnder review of bytes as requireda conference to store paper all at of ICLR the parameters 2017 in the trained model. Gschwend 2016 CNN architecture CompressionSqueezeNet Approach Data Original Reduction in Top-1 Top-5 Type Compressed! Model Model Size ImageNetZynqNetImageNet 6-bit SqueezeNet smaller than 32-bitSize AlexNet byvs. a AlexNet factor Accuracyof 500 Accuracy •AlexNet None (baseline) 32 bit 240MB 1x 57.2% 80.3% AlexNetand achievesSVD (Dentonthe same et al., accuracy32 bit 240MB(Han et48MB al. 2016) 5x 56.0% 79.4% 2014) ! •AlexNetFits on oneNetwork FPGA Pruning with (Han on32 board bit memory240MB 27MB (Gschwend9x 2016)57.2% 80.3% Table 2: Comparing SqueezeNetet al., 2015b) to model compression! approaches. By model size, we mean the number•AlexNetOthers of bytes requiredhave also toDeep store demonstrated all of the5-8 bit parameters CNNs240MB in on the6.9MB FPGAs trained model.35x 57.2% 80.3% Compression (Han ! CNN architecture Compression Approach DataSupercomputing SystemsOriginal AG Reduction in Top-1 Top-5 et al., 2015a) Type Compressed! Model Model Size ImageNet ImageNetSupercomputing Systems AG SqueezeNet (ours) None 32 bit 4.8MBSize vs. AlexNet50x Accuracy57.5% Accuracy80.3% SqueezeNetCNNAlexNet (ours)Optimization:DeepNone Compression (baseline) FPGA 328 bit bit 4.8MB240MB0.66MB 363x1x 57.2%57.5% 80.3% ! SqueezeNetAlexNet (ours) SVDDeep (Denton Compression et al., 326 bit bit 4.8MB240MB 0.47MB48MB 510x5x 56.0%57.5% 79.4%80.3% Backup!! Slide: FPGA Utilization • Optimizations: SqueezeNet2014) to ZynqNet CNN 1.18M 2.61M 35.62M 435.6k 37.17M 435.6k 136.29M 871.2k 1.68M 35.83M 209.95k 76.14M 314.93k 80.62M 314.93k AdditionalAlexNet optimizationNetwork for Pruning FPGA (Han 32 bit 240MB 27MB173.87M 9x 57.2% 80.3% macc comp comp macc comp macc comp macc comp comp macc comp macc comp macc comp macc comp comp macc comp macc comp macc comp comp macc comp macc comp macc comp ops et al., 2015b) ! ops 55×55 27×27 27×27 27×27 27×27 13×13 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 55×55 55×55 55×55 55×55 55×55 27×27 27×27 27×27 27×27 13×13 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ dim_out dim_out 227x227 111x111 111x111 55x55 55x55 55x55 55x55 27x27 27x27 27x27 27x27 16ch 16ch 64ch 16ch 64ch 32ch 128ch 32ch 128ch 48ch 192ch 48ch 192ch 64ch 256ch 64ch 256ch

AlexNet Deep 5-8 bit 240MB 15×15 1×1 6.9MB 35x 57.2% 80.3% ⋅ ⋅ 55×55 55×55 55×55 27×27 27×27 27×27 27×27 27×27 13×13 13×13 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 111×111 55×55 ch_out ch_out 3 96 96 96 128 128 256 256 256 384 384 ⋅ ⋅ 227×227 (edit) ⋅ 6 ! 3ch 3ch 96ch 96ch 128ch 128ch 256ch 256ch 256ch 384ch 384ch 512ch 512ch 512ch 1000ch 1000ch ;re2/expand3x3 ;re3/expand3x3 ;re4/expand3x3 ;re5/expand3x3 ;re6/expand3x3 ;re7/expand3x3 ;re8/expand3x3 ;re9/expand3x3 to SqueezeNet,;re2/relu_expand3x3 ;re3/relu_expand3x3 using;re4/relu_expand3x3 Compression 33%;re5/relu_expand3x3 sparsity (Han;re6/relu_expand3x3 and;re7/relu_expand3x3 8-bit;re8/relu_expand3x3 quantization.;re9/relu_expand3x3 This yields a 0.66 MB model (363 dim_in Table 5.2.: Resource Requirementsdim_in 227x227 227x227 111x111 111x111 55x55 55x55 55x55 and55x55 27x27 FPGA27x27 27x27 Utilization for the ZynqNet Accelerator when synthe- loss data pool1 pool4 pool8 drop9 conv1 pool10 conv10 relu_conv1 relu_conv10 ;re2/concat ;re3/concat ;re4/concat ;re5/concat ;re6/concat ;re7/concat ;re8/concat ;re9/concat Summary: ;re2/squeeze1x1 ;re3/squeeze1x1 ;re4/squeeze1x1 ;re5/squeeze1x1 ;re6/squeeze1x1 ;re7/squeeze1x1 ;re8/squeeze1x1 ;re9/squeeze1x1 ch_in ch_in 3 3 96 96 96 128 128 256 256 256 384 ⇥ ;re2/relu_squeeze1x1 ;re3/relu_squeeze1x1 ;re4/relu_squeeze1x1 et al.,;re5/relu_squeeze1x1 2015a);re6/relu_squeeze1x1 ;re7/relu_squeeze1x1 ;re8/relu_squeeze1x1 ;re9/relu_squeeze1x1 Network Analysis Network smallerSqueezeNet than 32-bit AlexNet) with equivalent accuracysized to for AlexNet. the Zynq XC-7Z045. Further,FPGA applying resources Deep Compres- ;re2/expand1x1 ;re3/expand1x1 ;re4/expand1x1 ;re5/expand1x1 ;re6/expand1x1 ;re7/expand1x1 ;re8/expand1x1 ;re9/expand1x1 SqueezeNet;re2/relu_expand1x1 (ours) ;re3/relu_expand1x1 ;re4/relu_expand1x1 None;re5/relu_expand1x1 ;re6/relu_expand1x1 ;re7/relu_expand1x1 32 bit;re8/relu_expand1x1 ;re9/relu_expand1x1 4.8MB 57.5% 80.3% 55×55 55×55 55×55 55×55 55×55 27×27 27×27 27×27 27×27 13×13 55×55 27×27 27×27 27×27 27×27 13×13 50x type type Data Convolution ReLU Pooling submodule(6) submodule(6) submodule(6) Pooling submodule(6) submodule(6) submodule(6) ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅

16ch 16ch 64ch 16ch 64ch 32ch 32ch 48ch 48ch 64ch 64ch 510 sion with 6-bit quantization 128ch 128ch and 33% 192ch sparsity 192ch 256ch on SqueezeNet, 256ch we produce a 0.47MB model (

SqueezeNet (ours) Deep Compression 8 bit 4.8MB 0.66MBname 57.5% 80.3% resourcename data conv1 relu_conv1 pool1 ;re2 ;re3 Block;re4 pool4 363x;re5 ;re6 RAM;re7 DSP Slices FF LUT⇥ ID ! ID 1 2 3 4 5 12 19 26 27 34 41 smallerSqueezeNet than (ours) 32-bit AlexNet)Deep Compression with equivalent•6 bit accuracy.4.8MB Our0.47MB small model is indeed57.5% amenable80.3% to 1.05M 79.69M 589.82k 50.33M 589.82k 79.69M 294.91k 50.33M 294.91k 79.69M 147.46k 39.85M 114.69k 43.12M 39.94k 30.05M 54.27k 47.1k remove Pooling 28.31M 510x 48.23M macc comp macc comp macc comp macc comp macc comp macc comp macc comp macc comp macc comp comp macc macc comp macc comp macc comp macc comp macc comp macc comp macc comp macc comp macc comp comp macc ops !ops used 996 739 137k 154k 32×32 32×32 16×16 16×16 8×8 8×8 8×8 8×8 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 64×64 64×64 64×64 64×64 32×32 32×32 16×16 16×16 compression.⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 8×8 8×8 ⋅ ⋅ dim_out dim_out 256x256 128x128 128x128 64x64 64x64 32x32 32x32 16x16 16x16 8x8 8x8 8x8 8x8

16ch 16ch 64ch 16ch 64ch 32ch 128ch 32ch 128ch 64ch 256ch 64ch 192ch 112ch 256ch 112ch 368ch available 1090 900 437k 218k 736ch 736ch 512ch 8×8 1×1 ⋅ ⋅ 64×64 64×64 32×32 32×32 16×16 16×16 8×8 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 128×128 ⋅ 256×256 ch_out ch_out 3 64 64 128 128 256 256 512 384 512 736 736 512 ⋅ (edit) 3ch 3ch 64ch 128ch 128ch 256ch 256ch 512ch 384ch 512ch 1024ch 1024ch :re2/expand3x3 :re3/expand3x3 :re4/expand3x3 :re5/expand3x3 :re6/expand3x3 :re7/expand3x3 :re8/expand3x3 :re9/expand3x3 conv10/split2 :re2/relu_expand3x3 :re3/relu_expand3x3 :re4/relu_expand3x3 :re5/relu_expand3x3 :re6/relu_expand3x3 :re7/relu_expand3x3 :re8/relu_expand3x3 :re9/relu_expand3x3 dim_in dim_in 256x256 256x256 128x128 128x128 64x64 64x64 32x32 32x32 16x16 16x16 8x8 8x8 8x8 loss data drop9 In addition,conv1 these results demonstrate that Deep Compressionutilization (Han et al.,91 % 2015a) 82 not % only 31 % works 70 % pool10 conv10 relu_conv1 :re2/concat :re3/concat :re4/concat :re5/concat :re6/concat :re7/concat :re8/concat :re9/concat Summary: :re2/squeeze3x3 :re3/squeeze1x1 :re4/squeeze3x3 :re5/squeeze1x1 :re6/squeeze3x3 :re7/squeeze1x1 :re8/squeeze3x3 :re9/squeeze1x1 ch_in ch_in 3 3 64 64 128 128 256 256 512 384 512 736 736 :re2/relu_squeeze3x3 :re3/relu_squeeze1x1 :re4/relu_squeeze3x3 :re5/relu_squeeze1x1 :re6/relu_squeeze3x3 :re7/relu_squeeze1x1 :re8/relu_squeeze3x3 :re9/relu_squeeze1x1 ZynqNet Network Analysis Network

6 conv10/split1 :re2/expand1x1 :re3/expand1x1 :re4/expand1x1 :re5/expand1x1 :re6/expand1x1 :re7/expand1x1 :re8/expand1x1 :re9/expand1x1 8×8 8×8

well on CNN architectures with many parameters⋅ ⋅ (e.g. AlexNet and VGG), but it is also able to :re2/relu_expand1x1 :re3/relu_expand1x1 :re4/relu_expand1x1 :re5/relu_expand1x1 :re6/relu_expand1x1 :re7/relu_expand1x1 :re8/relu_expand1x1 :re9/relu_expand1x1

8×8 8×8 8×8 8×8 1x1 to SqueezeNet, using 33% sparsity and⋅ ⋅ 8-bit⋅ ⋅ quantization. This yields a 0.66 MB model (363 32×32 32×32 16×16 16×16 64×64 64×64 64×64 64×64 32×32 32×32 16×16 16×16 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ type 736ch 736ch 512ch type Data Convolution ReLU submodule(6) submodule(6) submodule(6) submodule(6) submodule(6) submodule(6) submodule(6) submodule(6) Dropout submodule(1) 112ch 112ch 256ch 112ch 368ch 16ch 16ch 64ch 16ch 64ch 32ch 32ch 64ch 64ch compress the already 128ch compact, 128ch 256ch fully 192ch 8x8 convolutional SqueezeNet architecture. Deep Compression⇥ name smaller than 32-bit AlexNet)16x16 with equivalent accuracy toN AlexNet.name data conv1 relu_conv1 :re2 :re3 :re4 :re5 Further,:re6 :re7 :re8 :re9 drop9 conv10 applying Deep Compres- ID 64x64 32x32 ID 1 2 3 4 11 18 25 32 39 46 53 60 61 256x256128x128 • resizeResource layers Utilizationto 2 sioncompressed with 6-bit SqueezeNet quantization by 10 and 33%while sparsity preserving on SqueezeNet, the baseline we accuracy. produce In a summary:0.47MB model by combin- (510 ing CNN architectural innovation⇥ (SqueezeNet) with state-of-the-art compression techniques (Deep⇥ smaller than 32-bit AlexNet) with equivalentThe final accuracy. ZynqNetOur FPGA Accelerator small model contains isN indeedPE = 16 computational amenable to units, which concur- Compression),Javier Duarte we achieved I hls4ml a 510 reduction in46 model size with no decrease in accuracy compared compression. ⇥ rently operate on the calculation of different output feature maps. Each computational unit to the baseline. contains a pipelined 3 3 Multiply-Accumulate unit with 9 separate floating-point multipli- In addition, these results demonstrate that Deep Compression◊ (Han et al., 2015a) not only works Finally, note that Deep Compression (Haners et and al., an 2015b) adder tree uses for a thecodebook summationas of part their ofproducts. its scheme This results for in a total of 144 well on CNN architectures with many parametersfloating-point (e.g. multipliers AlexNet and 128 and floating-point VGG), but adders it is which also constitute able to the computational compressquantizing the CNN already parameters compact, to 6- fully or 8-bits convolutional of precision. SqueezeNet Therefore, architecture. on most commodity Deep Compression processors, 32core of the accelerator. The computational units32 are fed from on-chip caches. In total, up to compressedit is not trivial SqueezeNet to achieve by a10 speedupwhile of preserving8 =4x with the baseline 8-bit quantization accuracy. In or summary:6 =5.3x bywith combin- 6-bit ⇥ 1.7 MB CNN parameters (432k single-precision floating-point weights) and 133 kB image ingquantization CNN architectural using the innovation scheme developed (SqueezeNet)data in Deep are with buffered Compression. state-of-the-art in the on-chip However, compression Block RAM. Han When techniqueset synthesized al. developed (Deep for the Zynq XC-7Z045 Compression),custom hardware we – achievedEfficient a Inference510 reduction EngineFPGA, in(EIE) this model configuration– that size can with results compute no in decrease the codebook-quantized resource in requirements accuracy compared and CNNs utilization figures shown tomore the efficiently baseline. (Han et al., 2016a).⇥ In addition,in table 5.2. inThe the fact that months more than since90 % weof all released Block RAM SqueezeNet, resources and more than 80 % of P. Gysel developed a strategy called Ristrettothe DSPfor slices linearly are used, quantizing shows that the SqueezeNet design has been to properly 8 bits (Gysel,fitted to the FPGA resources 2016).Finally, Specifically, note that Deep Ristretto Compression does computation (Hanavailable. et al., in 2015b) 8 bits, and uses it a storescodebook parametersas part and of its activations scheme for in quantizing CNN parameters to 6- or 8-bits of precision. Therefore, on most commodity processors, 8-bit data types. Using the Ristretto strategy32 for 8-bit computation in SqueezeNet32 inference, Gysel it is not trivial to achieve a speedup of 8 =4x with 8-bit quantization or 6 =5.3x with 6-bit observed less than 1 percentage-point of dropMaximum in accuracy Clock Frequency when using 8-bit instead of 32-bit data types.quantization using the scheme developed in Deep Compression. However, Han et al. developed custom hardware – Efficient Inference Engine (EIE) – that can compute codebook-quantized CNNs Despite the high resource utilization and the resulting long routing paths, the ZynqNet FPGA more efficiently (Han et al., 2016a). In addition, in the months since we released SqueezeNet, accelerator can still be synthesized for an adequate clock frequency of f = 200 MHz. This P. Gysel developed a strategy called Ristretto for linearly quantizing SqueezeNet to 8 bits (Gysel,max 5CNNMICROARCHITECTURE Dis mostlyESIGN becauseSPACE the architectureEXPLORATION fully distributes the computation as well as all the required 2016). Specifically, Ristretto does computationdata onto in 8 the bits, different and it computational stores parameters units. There and are activations no inter-dependencies in between So8-bit far, data we types. have proposed Using the architectural Ristretto strategy designthe for individual strategies 8-bit computationcomputational for small models, units, in SqueezeNet even followed their results inference, these are principles accumulated Gysel separately. This toobserved create SqueezeNet, less than 1 percentage-point and discovered that of drop SqueezeNetresults in in accuracy mostly is local 50x when routing smaller using and few than 8-bit global AlexNet instead interconnections, with of 32-bit equivalent which data can all be sufficiently accuracy.types. However, SqueezeNet and otherpipelined. models reside in a broad and largely unexplored design space of CNN architectures. Now, in Sections 5 and 6, we explore several aspects of the design space. We divide this architectural explorationOperation into two Schedule main topics: microarchitectural exploration (per-module5CNNM layerICROARCHITECTURE dimensions and configurations)DESIGN andSPACEmacroarchitecturalEXPLORATION exploration (high-level end-to-end organization of modules and otherThe last layers). factor which determines the system throughput is the efficiency of the operation So far, we have proposed architectural designschedule. strategies The nested for small loops used models, in the followed algorithm theseallow (in principles principle) a fully pipelined In this section, we design and execute experiments with the goal of providing intuition about the to create SqueezeNet, and discovered that SqueezeNetoperation where is a 50x new smallerpixel (or channel) than AlexNet is fetched with and processedequivalent in every clock cycle. accuracy.shape of the However, microarchitectural SqueezeNet design and other spaceThere models with are norespect reside data dependencies in to a the broad design or and feedback strategies largely loops unexplored that which we would proposed design hinder pipelining within a spacein Section of CNN 3.1. architectures.Note that our goal Now, here in Sections is notsingleto layer. maximize 5 and 6, we accuracy explore in several every experiment, aspects of the but design rather space.to understand We divide the this impact architectural of CNN architectural exploration into choices two on main model topics: sizemicroarchitectural and accuracy. exploration (per-module layer dimensions and configurations) and macroarchitectural exploration (high-level Pipeline Flushing Issue in Vivado HLS 2016.2 end-to-end organization of modules and other layers). 6Note that, due to the storage overhead of storing sparse matrix indices, 33% sparsity leads to somewhat lessIn this than section, a 3 decrease we design in model and size. execute experimentsHowever, the with state the machine goal which of providing determines intuition the operation about schedule the is automatically shape of the⇥ microarchitectural design spacederived with from respect the high-level to the software design model strategies in Vivado that HLS we during proposed synthesis. As described in in Section 3.1. Note that our goal here is notsectionto maximize 4.4.4, Vivado accuracy HLS currently in has every an issue experiment, with the derivation but rather of an efficient operation to understand the impact of CNN architectural choices7 on model size and accuracy.

5.2 ZynqNet FPGA Accelerator Performance 69 6Note that, due to the storage overhead of storing sparse matrix indices, 33% sparsity leads to somewhat less than a 3 decrease in model size. ⇥

7 FPGA Co-Processor Acceleration Card

Leverage recent advances/trends in industry

128 MB

8 GB

8 lanes, 8 Gbps per lane

CPU

Javier Duarte I hls4ml 47 Machine Learning Forum Andrew Putnam Cloud Scale (Microsoft Research) >100 k FPGAs can May 14 @ 1pm communicate to each other L2 Microsoft Catapult

L1 L1

TOR TOR

TOR TOR Deep neural Expensive networks compression

Web search Bioinformatics ranking

Web search ranking TOR

(a) (b)

Fig. 1. (a) Decoupled Programmable Hardware Plane, (b) Server + FPGA schematic. A. Caulfield, et al., Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture (2016). https://www.microsoft.com/en-us/research/publication/configurable-cloud-acceleration/ tion hardware is tightly coupled with the datacenter network— services mapped to remote FPGA resources. Ideally, servers placing a layer of FPGAs between the servers’ NICs and not using all of their local FPGA resources can donate Javier Duarte I hls4ml 48 the Ethernet network switches. Figure 1b shows how the those resources to the global pool, while servers that need accelerator fits into a host server. All network traffic is routed additional resources can request the available resources on through the FPGA, allowing it to accelerate high-bandwidth remote servers. Failing nodes are removed from the pool network flows. An independent PCIe connection to the host with replacements quickly added. As demand for a service CPUs is also provided, allowing the FPGA to be used as a local grows or shrinks, a global manager grows or shrinks the pools compute accelerator. The standard network switch and topol- correspondingly. Services are thus freed from having a fixed ogy removes the impact of failures on neighboring servers, ratio of CPU cores per FPGAs, and can instead allocate (or removes the need for non-standard cabling, and eliminates the purchase, in the case of IaaS) only the resources of each type need to track the physical location of machines in each rack. needed. While placing FPGAs as a network-side “bump-in-the-wire” solves many of the shortcomings of the torus topology, much Space limitations prevent a complete description of the more is possible. By enabling the FPGAs to generate and management policies and mechanisms for the global resource consume their own networking packets independent of the manager. Instead, this paper focuses first on the hardware hosts, each and every FPGA in the datacenter can reach architecture necessary to treat remote FPGAs as available every other one (at a scale of hundreds of thousands) in resources for global acceleration pools. We describe the com- a small number of microseconds, without any intervening munication protocols and mechanisms that allow nodes in software. This capability allows hosts to use remote FPGAs for a remote acceleration service to connect, including a proto- acceleration with low latency, improving the economics of the col called LTL (Lightweight Transport Layer) that supports accelerator deployment, as hosts running services that do not lightweight connections between pairs of FPGAs, with mostly use their local FPGAs can donate them to a global pool and lossless transport and extremely low latency (small numbers extract value which would otherwise be stranded. Moreover, of microseconds). This protocol makes the datacenter-scale this design choice essentially turns the distributed FPGA remote FPGA resources appear closer than either a single local resources into an independent computer in the datacenter, SSD access or the time to get through the host’s networking at the same scale as the servers, that physically shares the stack. Then, we describe an evaluation system of 5,760 servers network wires with software. Figure 1a shows a logical view which we built and deployed as a precursor to hyperscale of this plane of computation. production deployment. We measure the performance charac- This model offers significant flexibility. From the local teristics of the system, using web search and network flow perspective, the FPGA is used as a compute or a network encryption as examples. We show that significant gains in accelerator. From the global perspective, the FPGAs can be efficiency are possible, and that this new architecture enables a managed as a large-scale pool of resources, with acceleration much broader and more robust architecture for the acceleration Summary and Outlook • We introduce an HLS-based software/firmware compiler for ultra low-latency applications • Case study with jet substructure in Level-1 trigger • Tunable configuration for a broad range of use cases • Upcoming features: • More network architectures (CNN, RNN, etc.) • Support for Altera/Intel Quartus HLS • FPGAs (with support from industry/cloud computing) may help to accelerate HEP computing workflows • Exciting times ahead at the intersection of machine learning, custom hardware, and high energy physics

Javier Duarte I hls4ml 49 Backup

Javier Duarte I hls4ml 50 CNNs on FPGAs

• NIPS 2017 Demo: https://docs.google.com/presentation/d/ 1mTqsm5TronnB8MFD6yyq3CSPCV4bfck_aQ-ericM9cQ/ edit#slide=id.p3 • Snowflake: arXiv:1708.02579 • DNNWeaver: http://act-lab.org/artifacts/dnnweaver/ • fpgaConvNet: http://cas.ee.ic.ac.uk/people/sv1310/ fpgaConvNet.html • Caffeine

Javier Duarte I hls4ml 51 Recurrent Neural Network

• Main task is language processing, time sequence prediction • LSTM layers allow learned information to persist; network can learn long-term dependences in sequences

LSTM (1997)

Javier Duarte I hls4ml 52 HLS Study Details • Xilinx Vivado 2017.2 • Clock frequency: 200 MHz • FPGA: Xilinx Kintex Ultrascale (XCKU115-FLVB2104)

• A note on inputs: • We assume network inputs have already been computed • Not a good assumption in the jet substructure case with “expert” features • Convolutional and recurrent networks which more naturally operate on “raw” features are in development

• Note: resource usage comes from HLS estimates • Discussion on differences w.r.t. implementation later

Javier Duarte I hls4ml 53 Logic Cell

M A U Out LUT FF X B

CLK

Javier Duarte I hls4ml 54 Meeting HLS Target Timing v2016.4 v2017.2

Timing result, 3 hidden pruned, <16,8>, reuse=2 Timing result, 3 hidden pruned, <16,8>, reuse=2 1 1 20 20 19 19 18 18 17 17 0.8 0.8 16 16 15 15 14 14 13 13 0.6 0.6 12 12 11 11 10 10 Implementation clock period [ns] Implementation clock period [ns] 9 Implementation clock period [ns] 9 0.4 0.4 8 8 7 7 6 6 5 5 0.2 0.2 4 4 3 3 2 2 1 1 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 HLS clock period [ns] HLS clock period [ns] • Implementation timing meets HLS target in v2017.2 for clock periods ≥ 11 ns • Implementation timing meets HLS target in v2016.4 for clock periods ≥ 4 ns

Javier Duarte I hls4ml 55 Kintex® UltraScale™ FPGAs

Device Name KU025(1) KU035 KU040 KU060 KU085 KU095 KU115 System Logic Cells (K) 318 444 530 726 1,088 1,176 1,451 Logic Resources CLB Flip-Flops 290,880 406,256 484,800 663,360 995,040 1,075,200 1,326,720 CLB LUTs 145,440 203,128 242,400 331,680 497,520 537,600 663,360 Maximum Distributed RAM (Kb) 4,230 5,908 7,050 9,180 13,770 4,800 18,360 Block RAM/FIFO w/ECC (36Kb each) 360 540 600 1,080 1,620 1,680 2,160 Memory Resources Block RAM/FIFO (18Kb each) 720 1,080 1,200 2,160 3,240 3,360 4,320 Total Block RAM (Mb) 12.7 19.0 21.1 38.0 56.9 59.1 75.9 CMT (1 MMCM, 2 PLLs) 6 10 10 12 22 16 24 Clock Resources I/O DLL 24 40 40 48 56 64 64 Maximum Single-Ended HP I/Os 208 416 416 520 572 650 676 Maximum Differential HP I/O Pairs 96 192 192 240 264 288 312 I/O Resources Maximum Single-Ended HR I/Os 104 104 104 104 104 52 156 Maximum Differential HR I/O Pairs 48 48 48 48 56 24 72 DSP Slices 1,152 1,700 1,920 2,760 4,100 768 5,520 System Monitor 1 1 1 1 2 1 2 Integrated IP PCIe® Gen1/2/3 1 2 3 3 4 4 6 Resources Interlaken 0 0 0 0 0 2 0 100G Ethernet 0 0 0 0 0 2 0 16.3Gb/s Transceivers (GTH/GTY) 12 16 20 32 56 64(2) 64 Commercial -1 -1 -1 -1 -1 -1 -1 Speed Grades Extended -2 -2 -3 -2 -3 -2 -3 -2 -3 -2 -2 -3 Industrial -1 -2 -1 -1L -2 -1 -1L -2 -1 -1L -2 -1 -1L -2 -1 -2 -1 -1L -2 Package Package Dimensions HR I/O, HP I/O, GTH/GTY Footprint(3, 4, 5, 6) (mm) A784(7) 23x23(8) 104, 364, 8 104, 364, 8 A676(7) 27x27 104, 208, 16 104, 208, 16 A900(7) 31x31 104, 364, 16 104, 364, 16 A1156 35x35 104, 208, 12 104, 416, 16 104, 416, 20 104, 416, 28 52, 468, 28 A1517 40x40 104, 520, 32 104, 520, 48 104, 520, 48 C1517 40x40 52, 468, 40 Footprint D1517 40x40 104, 234, 64 Compatible with B1760 42.5x42.5 104, 572, 44 52, 650, 48 104, 598, 52 Virtex® UltraScale A2104 47.5x47.5 156, 676, 52 Devices B2104 47.5x47.5 52, 650, 64 104, 598, 64 D1924 45x45 156, 676, 52 F1924 45x45 104, 520, 56 104, 624, 64 Notes: 1. Certain advanced configuration features are not supported in the KU025. Refer to the Configuring FPGAs section in DS890, UltraScale Architecture and Product Overview. 2. GTY transceivers in KU095 devices support data rates up to 16.3Gb/s. 3. Packages with the same package footprint designator, e.g., A2104, are footprint compatible with all other UltraScale devices with the same sequence. See the migration table for details on inter-family migration. 4. Maximum achievable performance is device and package dependent; consult the associated data sheet for details. 5. For full part number details, see the Ordering Information section in DS890, UltraScale Architecture and Product Overview. 6. See UG575, UltraScale Architecture Packaging and Pinouts User Guide for more information. 7. GTH transceivers in A784, A676, and A900 packages support data rates up to 12.5Gb/s. 8. 0.8mm ball pitch. All other packages listed 1mm ball pitch. Page 2 © Copyright 2013–2016 Xilinx .

Javier Duarte I hls4ml 56