Fast Inference of Deep Neural Networks in Fpgas for Particle Physics (Arxiv:1804.06913)
Total Page:16
File Type:pdf, Size:1020Kb
Fast Inference of Deep Neural Networks in FPGAs for Particle Physics (arXiv:1804.06913) Jennifer Ngadiuba, Maurizio Pierini [CERN] Javier Duarte, Sergo Jindariani, Ben Kreis, Ryan Rivera, Nhan Tran [Fermilab] Edward Kreinar [Hawkeye 360] Song Han, Phil Harris [MIT] Zhenbin Wu [University of Illinois at Chicago] Research Techniques Seminar Fermilab 4/24/2018 1 Outline • Introduction and Motivation • Machine Learning in HEP • FPGAs and High-Level Synthesis (HLS) • Industry Trends • HEP Latency Landscape • hls4ml: HLS for Machine Learning • Case Study and Design Exploration • Summary and Outlook • Cloud-scale Acceleration Javier Duarte I hls4ml 2 Introduction Javier Duarte I hls4ml 3 ReconstructionMachine chain:Learning Jet tagging in HEP Task to find the particle ID of a jet, e.g. b-quark • Learning optimized nonlinear functions of many inputs for … performingvertices difficult tasks from (real or simulated) data … Many successes in HEP: identificationData/MC of b-quarkuser jets, Higgs • Tag Info tagger corrections … candidates,Jet, particle energy regression, analysis selection, … particles CMS-PAS-BTV-15-001 s=13 TeV, 2016 1 CMS Simulation Preliminary Key features:tt events AK4jets (p > 30 GeV) T • Long lifetimeCSVv2 of heavy 10−1 flavour quarksDeepCSV cMVAv2 • Displacedmisid. probability tracks, … • Usage10−2 of ML standard for this problem udsg Neural network based on 10−3 • 11 c high-level features 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 b-jet efficiency Javier Duarte I hls4ml 4 Machine Learning in HEP • Learning optimized nonlinear functions of many inputs for performing difficult tasks from (real or simulated) data • Many successes in 11.1HEP: Significance identification of the signal and of its strengthb-quark jets, Higgs 41 candidates, particle energy regression, analysis selection, … arXiv:1407.0558 19.7 fb-1 (8 TeV) + 5.1 fb-1 (7 TeV) ×1043 CMS 3.5 S/(S+B) weighted sum H → γγ Data 3 ML algorithms used in every S+B fits (weighted sum) 2.5 B component aspect of Higgs discovery: ±1σ energy regression 2 ±2σ S/B discrimination, 1.5 1 +0.26 µ = 1.14 −0.23 … 0.5 mH = 124.70 ± 0.34 GeV 0 S/(S+B) weighted events / GeV 200 Typically applied offline, B component subtracted not online (trigger-level) 100 0 -100 110 115 120 125 130 135 140 145 150 Javier Duarte I hls4ml 5 mγγ (GeV) Figure 19: Diphoton mass spectrum weighted by the ratio S/(S + B) in each event class, to- gether with the background subtracted weighted mass spectrum. Table 5: Values of the best-fit signal strength, µˆ, when mH is treated as an unconstrained pa- rameter, for the 7 TeV, 8 TeV, and combined datasets. The corresponding best-fit value of mH, mH, is also given. µˆ mH (GeV) b +0.62 7 TeV 2.22 0.55 124.2 +−0.26 b 8 TeV 0.90 0.23 124.9 +−0.26 Combined 1.14 0.23 124.7 − section times the relevant branching fractions, relative to the SM expectation. In Fig. 20 the combined best-fit signal strength, µˆ, is shown as a function of the Higgs boson mass hypothesis, both for the standard analysis (left) and for the cut-based analysis (right). The two analyses agree well across the entire mass range. In addition to the signal around 125 GeV, both analyses see a small upward fluctuation at 150 GeV, which is found to have a maximum local significance of just over 2 s at mH = 151 GeV—slightly beyond the mass range of our analysis. The best-fit signal strength for the main analysis, when the value of mH is treated as an un- +0.26 constrained parameter in the fit, is µˆ = 1.14 0.23, with the corresponding best-fit mass being − mH = 124.7 GeV. The expected uncertainties in the best-fit signal strength, at this mass, are +0.24 and 0.22. The values of the best-fit signal strength, derived separately for the 7 and − 8b TeV datasets, are listed in Table 5. For the cut-based analysis the corresponding value is +0.29 µˆ = 1.29 0.26 at mH = 124.6 GeV, and for the sideband background model analysis the value − +0.26 measured is µˆ = 1.06 0.23 at mH = 124.7 GeV. These values are shown in Table 6 together with the expected uncertainty,b − and the corresponding values for the main analysis. The uncertainty in the signalb strength may be separated into statistical and systematic con- tributions, with the latter further divided into those having, or not, a theoretical origin: µˆ = CMS Trigger 1 kHz 100 kHz 1 MB/evt High-Level 40 MHz Trigger L1 Trigger • Level-1 Trigger (hardware) • High-Level Trigger (software) • 99.75% rejected • 99% rejected • decision in ~4 μs • decision in ~100s ms • After trigger, 99.99975% of events are gone forever Javier Duarte I hls4ml 6 HEP Latency Landscape CMS: L1 Trigger HLT Offline 1 ns 1 us 1 ms 1 s FPGAs CPUs Javier Duarte I hls4ml 7 MACHINE LEARNING & FPGAS 7 Machine learning algorithms are ubiquitous in HEP FPGA usage broad across HEP experiments Field-Programmable Gate Array Centered on DAQ and triggerALL FPGA development ARCHITECTURE 16 • Flexible: re-programmable Some early adaptions ofO(50-100) ML techniques optical in triggerFPGA [1] interconnects betweentransceivers “programmable hardware” FPGA developmentconfigurable becominglogic blocksrunning atmore and accessible ~O(15) Gbs High Levelembedded Synthesis, components OpenCL • LUTs (logic), What are FPGAs?FPGA Flip-Flops (registers), “programmable hardware” FPGA diagram Field Programmable Gate Arrays are reprogrammable FPGA interestDSPs in (arithmetic),industry is growingintegrated circuits Contain array of logic blocks used to configure low level Block RAMs (memory) operations (bit masking,DSPs shifting, (multiply-accumulate,addition) etc.) Programmable hardware with structures Flip Flops (registers/distributed memory) Logic blocks wired together through routing channels High Throughput: O(100)Typical modern FPGA: LUTs (logic) that maps• nicely onto ML architecturesSupport highly parallel and pipelinedBlock algorithm RAMs (memories) (Kintex ultrascale+) implementations with guaranteed latency optical transceivers 1.3Mrunning FFs at 700k LUTs O(15) Gbs 5500 DSPs Typical modern FPGADSPs (multiply-accumulate,LogicLogic block cell etc.) 2200 BRAMs (Kintex UltraScale):Flip Flops (registers/distributed memory) • Massively parallel 1.3 M FFs LUTs (logic) Block RAMsLook-up (memories) Flip-flop 700k LUTs table • Low power (relative to CPU/ 5500 DSPs 2200 BRAMs 9 GPU) 25.04.2018 Jennifer Ngadiuba - hls4ml: deep neural networks in FPGAs [1] Carnes et al., https://indico.cern.ch/event/567550/contributions/2629686/ Javier Duarte I hls4ml 8 CPUs, GPUs, FPGAs, and ASICs How are they different? • FPGAs are the middle ground of latency, energy AeRCHITECTURESfficiency, and flexibility 41 ASICs FPGAs GPUs CPUs Source: Bob Broderson, Berkeley Wireless group Javier* GPUs Duarte still Ibest hls4ml option for training 9 * FPGAs generally much more power efficient High-Level Synthesis • FPGA development is becoming more accessible, with tools like High-Level Synthesis: untimed C code with additional Motivation for High-Level Synthesis (HLS) directives that is synthesized into RTL Verilog/VHDL Motivationmodule dut(rst, clk for, q); High-Level Synthesis (HLS) input rst; uint8 dut() { input clk; static uint8 c; moduleoutputdutq;(rst , clk, q); c+=1; inputregrst[7:0]; c; vs. } uint8 dut() { clk; inputalways @ (posedge clk) static uint8 c; Untimed C code outputbeginq; c+=1; reg [7:0]if (rst c;== 1b’1) begin vs. } c <= 8'b00000000; HLS alwaysend @ (posedge clk) Untimed C code beginelse begin rst 1 c <= c + 1; if (rst == 1b’1) begin end + q c <= 8'b00000000; 1 0 HLS 0 c 8 end assign q = c; elseendmodule begin rst 1 clk c <= c RTL+ 1; Verilog Javier Duarteend I hls4ml 10 + q An1 0 8-bit counter 0 c 8 4 assign q = c; endmodule clk RTL Verilog An 8-bit counter 4 Industry Trends • Industry is moving toward custom hardware and FPGAs to quickly apply ML algorithms • Already being used as backend accelerators for Bing web searches, Siri queries, and more… Microsoft Apple Google Intel Javier Duarte I hls4ml 11 hls4ml 100 kHz 40 MHz L1 Trigger • hls4ml: neural network translation library for HLS • Support for common ML workflows and architectures • Tunable configuration for different use cases • Focus on L1 trigger as 1st application • What can we do in < μs on one FPGA? Javier Duarte I hls4ml 12 Case Study and Design Exploration Javier Duarte I hls4ml 13 Case Study: Jet Substructure • Illustrative example: not necessarily the most realistic for L1 today, but lessons are generic Javier Duarte I hls4ml 14 Jet Substructure Inputs ECFs Observables mass mmMDT β=1,2 N2 β=1,2 M2 β=0,1,2 C1 β=1,2 C2 β=1,2 D2 ↵,β = 1,1 , 1,2 multiplicity ( ) ( ) ( ) D2 z log z Multiplicity Õ Table 1: A summary of the observables used in the analysis. Groomed mass separates top, W/Z, and quark/gluon this study [51–54]. A brief description of• each of these variables is presented in Ref. [55]. These are used as expert-level inputs to a neural network• top/gluon classifier which have is greater near optimal multiplicity3. than W/Z/quark • ECF N2β=1 separates 2 and 3-prong jets (W/Z/top) from 1-prong jets (quark/gluon) Benchmark networks and floating point performance We train a neural network for the classification task of q, g, W, Z, and t discrimination. The data are Javier Duarte I hls4ml 15 randomly split into training (60%), validation (20%), and testing (20%) datasets. The input features are standardized by removing the mean and scaling to unit variance. The architecture, illustrated in Fig. 4 (left), is a fully-connected neural network with three hidden layers.