“GPU in HEP: online high quality trigger processing”

ISOTDAQ 15.1.2020 Valencia

Gianluca Lamanna (Univ.Pisa & INFN) G.Lamanna – ISOTDAQ – 15/1/2020 Valencia The World in 2035 in The World 2 G.Lamanna – ISOTDAQ – 15/1/2020 Valencia The problem 2035 in The problem What the triggers will will like in look the triggersWhat FCC by trigger! by target, Fixed (Future Circular Collider) Flavour factories, … the physics reach will be defined defined be reachwill physics … the factories, is only an example an only is 2035 ? 3 G.Lamanna – ISOTDAQ – 15/1/2020 Valencia The trigger 2035… in The trigger …but will be also … willbe tracks, clusters, rings clusters, tracks, respecttoday: with more be complicated will primitives The events interesting on trigger to the ability limit Up will Pile and backgroundThe higher resolution High Fast decision events interesting for efficiency High factor reduction High similar to the current trigger… different … 4 G.Lamanna – ISOTDAQ – 15/1/2020 Valencia The trigger 2035… in The trigger done in firmware may now be done in in done be now may firmware in done done be now previously What All of these effects go in the samedirection Higher luminosity Higher energy processing granularitymore & resolutionMore up pile ID becomes challenging, crossing Bunch track better granularity → region forwardin occupancy High high for Resolution - calo correlation in firmware; What What firmware; in pt had to be be to had leptons → high → leptons done in hardware may may hardware in done  - precision primitives precision more data moremore& was previously previously was software! 5 G.Lamanna – ISOTDAQ – 15/1/2020 Valencia generation on detector ( ondetector generation limitation Main traditional Is a Classic trigger in the future? the in Classic trigger Pre Getting alldata in one place Cost and dimension Yes and no • • • • FPGA: not suitable for complicated processing latency) No “slow” detectors can participate to trigger (limited New links Software: commodity - processing on - > data flow : high quality trigger primitives trigger quality high: - “pipelined” detector could help hw processing trigger possible? trigger ) 6 G.Lamanna – ISOTDAQ – 15/1/2020 Valencia Pipelined trigger in 2025… in trigger Pipelined 7 G.Lamanna – ISOTDAQ – 15/1/2020 Valencia Classic Pipeline: Processing Classic Pipeline: The performances of The performances FPGA+CPU of new introduction thewith the future in change would scenarioThis as fast is notas in “standard” capability in computing The increasing the problem on dependsdevice computing hybrid devices hybrid CPU FPGA FPGA as 8 G.Lamanna – ISOTDAQ – 15/1/2020 Valencia Triggerless ? Main limitation Main PCs? on all data to bring possibleIs it CMS CMS & ATLAS LHCb • • • switch, x2000 computing 4 PB/s readout data, 4M links, x10 in performance for 1 PB/s) ev.building (Maybe) No in 2035: links 30 : yes in 2021 MHz readout, : probablyno (in2035) (for comparison largest Google data center = : data transport data 40 Tb/s data network, 4000 cores, 8800 track+calo =2PB/s + 5 PB/s 9 G.Lamanna – ISOTDAQ – 15/1/2020 Valencia 10 G.Lamanna – ISOTDAQ – 15/1/2020 Valencia Nowadays standard market is not interested in this application. this in interested is not marketstandard Nowadays serializers):hard (rad purposes HEP with compatible is not consumption powerthe But increasing steadily is bandwidth The links Triggerless 4M links links 4M 5Gb/sper 500mW is e.g.lpGBT  2 MW only for links MW only for on detector2 links : Data Links 11 G.Lamanna – ISOTDAQ – 15/1/2020 Valencia Example: an alternative alternative an approach Example: Focus Focus on Data Links Triggerless processing Focus on On Classic pipeline: : - detector detector detector Buffers      Trigger: High Latency Focus Focus on Large buffers Large softwarein implementedTrigger trigger multiplexedTime network Toroidal nodes computing Heterogeneous On -

DET

Disk Buffer Merging Toroidal Network 12 G.Lamanna – ISOTDAQ – 15/1/2020 Valencia GPU: GPU: Graphics Units Processing 13 G.Lamanna – ISOTDAQ – 15/1/2020 Valencia Moore’s Law Moore’s Faster clock means higher voltage means higher Faster clock clockthe to is related performance ofincreasingThe every18 months” double will transistorstheir of and thenumber law: “The Moore’smicroprocessorsperformanceof  power wall 14 G.Lamanna – ISOTDAQ – 15/1/2020 Valencia Parallel programming Parallel programming physical constraints to frequency scaling frequencyto constraints physical the due to mainly is architecturesof parallel The use forlonger somethingis no Allthe processors nowadays are multicores 15 G.Lamanna – ISOTDAQ – 15/1/2020 Valencia Parallel Parallel computing maximum speed maximum caseIn anythe concurrently be solved to problems smaller in split be can Severalproblems Law ( resources the on depends parallelizable part of amount the if improve can situation The Amdahls’s codethe ( part of serial the on depends it but linear , not ) law  Gustafson’s Gustafson’s ) - up is up  16 G.Lamanna – ISOTDAQ – 15/1/2020 Valencia Parallel programming GPU on Parallel programming helps a lot. are typicalapplication whereparallelization can Rendering, Image transformation, ray tracing, etc. programming for graphical application. The GPUs are processors are processors to parallel dedicated 17 G.Lamanna – ISOTDAQ – 15/1/2020 Valencia "a single a of definition technical The manycores multi a introduce to firms different 2007 ( NVIDIA in by been introduced ( generic computing the use to possibility The per second.“ polygons million 10of processinga minimum of capable is that engines rendering and setup/clipping, triangle lighting, transform, integrated In 2008 OpenCL: consortium of 2008 consortium In OpenCL: What are What the GPUs? - platform language forlanguage platform CUDA - chip processor with with processor chip computing. computing. ) GPGPU GPU ) has ) GPU for is (1997) (2019) 18 G.Lamanna – ISOTDAQ – 15/1/2020 Valencia Why the the GPUs? Why cores of computing power,with Easy to have adesktopPCwith 2 ineachgeneration Impressive derivativealmost a factor of The PCno longer getfaster, just wider. GPU computing… scientific simulation, HPC, in Several applications development Continuous problems « power for computing Very high SIMD/SIMT is to a way cheat theMoore’s law . parallel architecture architecture parallel thousand of vectorizable teraflops » 19 G.Lamanna – ISOTDAQ – 15/1/2020 Valencia Computing power Computing 20 G.Lamanna – ISOTDAQ – 15/1/2020 Valencia Why?: CPU vs GPU CPU Why?: 21 G.Lamanna – ISOTDAQ – 15/1/2020 Valencia CPU CPU oriented design oriented latencyCPU: Large control part Memory management Powerful ALU Branch prediction Caches Multilevel and Large branching reduce latency in To accessmemory latency long Convert 22 G.Lamanna – ISOTDAQ – 15/1/2020 Valencia GPU Limited controlLimited cachingLimited level parallelism Thread kernels to execute Processors) Multi (Streaming SMX architecture Thread/Data) Multiple instruction (Single SIMT/SIMD branch predication branch but prediction, branchNo oriented design oriented throughputGPU: 23 G.Lamanna – ISOTDAQ – 15/1/2020 Valencia CPU GPU v CPU T 24 G.Lamanna – ISOTDAQ – 15/1/2020 Valencia CPU GPU vs CPU watt performanceper Low Cache misses costly bandwidth memorylow Relatively ALU + Powerful prediction + Branch caches+ Large rate+ Fast clock memorymain + Large Extension card Extension per Low capacity memory Limited performanceper watt+ High resources compute + More (parallelism) tolerant+ Latency memory main bandwidth + High - thread performancethread 25 G.Lamanna – ISOTDAQ – 15/1/2020 Valencia CPU Heterogeneous = CPU + GPU GPU The winning application uses both 100X faster than CPU for parallel code) code)parallel forCPU than 100X faster be wins (can where throughput part parallel for GPUs code) sequentialfor GPU be 10X thanfaster(can parts forsequential CPUs C omputing CPU and 26 G.Lamanna – ISOTDAQ – 15/1/2020 Valencia Processing structure with CUDA with structure Processing Three steps: What is 3) copy back results back copy3) copy2) Device fromdata copy1) processorgraphics the of functions the all almost control to allow APIs Dedicated on computing enable the to C/C++ extensions a setof It is Kernel CUDA GPGPU NVIDIA GPUs NVIDIA and executeand ? Host to 27 G.Lamanna – ISOTDAQ – 15/1/2020 Valencia Grids, blocks and and threads blocks Grids, kernel launch time The “shape”of the system is at decided parallelmodel of computation: (and physically) groupedin aflexible The computing resources are logically same time the at same instruction execute block same in threads interact, not do blocks different Threadsin a block in synchronizeand threadscommunicateOnly can 3D 2D and 1D, With 3D 2D and 1D, With 3D 1D,2D and grid threads blocks 28 G.Lamanna – ISOTDAQ – 15/1/2020 Valencia Mapping on the hardware the on Mapping GP100 29 G.Lamanna – ISOTDAQ – 15/1/2020 Valencia Memory Most of the memory managing and data data and managing memory the of Most GPU in fundamental is hierarchy memory The Shared memory/registers Shared Memory Global Space Address Unified user the to left is locality accessible from kernel only from kernel accessible of blocks/threads, very On Chip, fast, lifetime from hostanddevice accessible On board,application, of the relatively slow, lifetime programming 30 G.Lamanna – ISOTDAQ – 15/1/2020 Valencia Streams superimposed to avoid superimposed be can execution kernel and copy data asynchronous the device to host from transfer data multiple of Incase latency the to hideis all the of purposeThe main dead time dead GPU computing 31 G.Lamanna – ISOTDAQ – 15/1/2020 Valencia Other ways to program GPU program ways to Other HIP, SYCL, … program OpenCL use acceleratefunctions, dedicated you could consider to If your code is almost “low level” CUDA NVIDIA GPUs supports GPUs NVIDIA (Thrust, Libraries ( Directives is the“best”toprogram way is aframework to equivalent CUDA to multiplatforms OpenMP ArrayFire , OpenACC OpenCL ,…) CPU (GPU, CPU, DSP, FPGA,…). , …) . or if you needto NVIDIA GPU NVIDIA at 32 G.Lamanna – ISOTDAQ – 15/1/2020 Valencia Example you have thergbcode for each pixel ingreyscale Assume you want to convert an image inwhich carries only intensity information. intensity only carries a is image A greyscale pixel each in blue and red, green of quantity the to define a standard is Rgb : to RGB gray scale scale n image in which the value of each pixel of value the which in n image conversion 33 G.Lamanna – ISOTDAQ – 15/1/2020 Valencia grayPixel Example do: For formula: Conversion : to RGB } __global__ 255] [0, characters unsigned is encodedas image // Theinput 3 #define CHANNELS [I,J] = 0.21* [I,J] int int } if grayImage point constants by floating // Wemultiply it store rescaling and the // perform char unsigned char unsigned char unsigned int gray scaleimage the columns than times // CHANNEL having image ofthe RGB think // onecan int image grayscale for the coordinate // get1D (x y x < = = rgbOffset grayOffset width threadIdx.y threadIdx.x void [ grayOffset && grayscale colorConvert b g r = y = RGB to corresponding channels // wehave3 = = = grayOffset < y + + rgbImage rgbImage * height) { rgbImage width blockIdx.y blockIdx.x ] = 0.21f ( r + int unsignedchar [ [ * rgbOffset rgbOffset [ CHANNELS; x; rgbOffset + 0.71* * width, r * * conversion + blockDim.y blockDim.x 0.71f each unsigned char int + + * 3 2 * g ]; ]; height) { grayImage + ; ; //blue value for pixel //green value for pixel ]; 0.07f // redvalue for pixel g pixel (I, J) pixel (I,J) * * b; , + 0.07* rgbImage , b in Lab 14! programming More on GPU 34 G.Lamanna – ISOTDAQ – 15/1/2020 Valencia Triggers Triggers and and GPUs 35 G.Lamanna – ISOTDAQ – 15/1/2020 Valencia Next generation trigger Next generation Flexibility, Scalability, Upgradability Upgradability Scalability,Flexibility, selectiononlineAccurate band readoutHigher effects: tiny will lookfor experiments generationNext More softwareMoreless hardware readout detectorthe to closerand closerselection quality High nodes processing data fasteron bring to New links morebecomeThe triggersystemsand important more 36 G.Lamanna – ISOTDAQ – 15/1/2020 Valencia Where to use GPU in the trigger? the use in Where to GPU In In  datasourcethe to close processorsof flexibility power and Bring speed on factorgain can you algorithm) forevents or (either for be parallelized can problem yourIf place. “natural” the It is Low High more physics - up Level Trigger Level Level Trigger Level  smaller number of PC in Online Farm Online in PC of smaller number Krasznahorkay@CHEP2019 Teng@CHEP2019 Aaji@HOW2019 VomBruch@CHEP2019 Bocci@CHEP2019 Gutsche@CHEP2019 Chen@CHEP2019 Rohr@HSF Betev@CHEP2019 - workhop 37 G.Lamanna – ISOTDAQ – 15/1/2020 Valencia cheaper than full halves GTX 580 Cellular automaton + TPC tracks/event 2 kHz input at HLT, 5x10 ALICE: HLT TPC online Tracking in RUN1 in Tracking ALICE: TPC HLT online the number of computer nodes ( (in2011) and CPU ) Kalman AMD S9000 7 B/event, 25 GB/s, filter (2015) 1.5 MCHF  20000 GPUs 38 G.Lamanna – ISOTDAQ – 15/1/2020 Valencia ALICE: ALICE: O the concept trigger New O Run3 beam reconstruction , when no Asynchronous: final taking data compression during data Synchronous: calibration and events) crossing (1 TF is about 1000 frames instead(TF) of bunch X100events rate and time Detector modifiedwrt Run1/2 2 - less readout less (online+offline) 2 readout 39 G.Lamanna – ISOTDAQ – 15/1/2020 Valencia ALICE: ALICE: O the disk on Data are transfered events final the builds and reconstrution runs calibration, applies Node) (Event Processing EPN detectorsread receivesdata from levelprocessor)(Firts FLP in the full chain the full in dataOnly compression: ~40x any stage at applied selection triggerNot frames) assemblesSFT (sub FLP (mainly from Tbyte/sTPC)3.5 read 9000 - out links out 2 readout - out - time time 3.5 3.5 TB/s FLP 635 GB/s 5.5x EPN 90 GB/s 7.1x Storage 40 G.Lamanna – ISOTDAQ – 15/1/2020 Valencia performant algorithms and powerfull GPUs powerfulland algorithms performant speed Total TPC by dominated is fully reconstructionThe online ALICE and GPU and ALICE One GPU can replace ~40 CPU cores CPU ~40can replace One GPU 25x algorithms on GPUshave aspeed TPCThe processing online the is in doneITS tracking few Only GPUs 1500 - up is the sum of more of sum isthe up - up of 20 up - 41 G.Lamanna – ISOTDAQ – 15/1/2020 Valencia Cellular Cellular Design from scratch for parallel application parallel forscratch fromDesign tracks possible the all among realtrackthe find to rule some Apply segments possiblethe Connect layers detector from segments trackslocal Build Highly parallelizable Highly Automaton for track seeding 42 G.Lamanna – ISOTDAQ – 15/1/2020 Valencia CMS: CMS: different processors in CMSSW in processorsdifferent on computing accomodate a frameworkto build to Effort a solution be co otherand GPU (with computing Heterogeneous a 4x for only account can performance The foreseenCPUs increasein HLT in load computing HL In Similar Similar for ATLAS pile higher upgrade, detectors~3xfrom ~1.3x from heterogenous - LHC era CMS expectsLHC eraCMS 30x - up, ~7.5x from rate event up, ~7.5x - processors) can processors) computing in the high high the in level trigger 43 G.Lamanna – ISOTDAQ – 15/1/2020 Valencia CMS: GPU in RUN3 in CMS: GPU framework Use the CPU for interaction with the software algorithm on GPU Offload various steps of the reconstruction Tracking on GPU ready for HLT in Run3 (Patatrack) power thecomputing GPU exploitingthequality fitting Improve automatonCellular detectorpixelsthein verticesand tracks of Reconstruction 44 G.Lamanna – ISOTDAQ – 15/1/2020 Valencia CMS: CMS: of the 2018farm (~80%) Result: possible reduction GPU CPU Improved physics performances physics Improved events 10/16 concurrent cores) NVIDIA TeslaSingle T4(2560 16with threads 4 jobs cores2x16 6130 Gold Xeon socketDual Patatrack Throughput performances 45 G.Lamanna – ISOTDAQ – 15/1/2020 Valencia about (5120 ThunderX2+Volta Cavium Preliminar Encouraging CMS: CMS: NVIDIA Jetson AGX Xavier 150 W 150 2560 Comparison Rediced power consumption: 30 W consumption: power Rediced (512 cores) GPU Volta integrated an with cores 8ARMv8 with computerboard Single 1800 cores Heterogenous cores results ) with T4: with ev results give /s on Computing Computing 46 G.Lamanna – ISOTDAQ – 15/1/2020 Valencia CMS: CMS: factor 6x factor additional isan GPU on CLUE algorithm clustering the present than faster ~30x factor isa CPU on CLUE the HGCAL for designed New algorithm GPU on (CLUE) Energyby Clustering by using streams and multipleGPUs The data transfer timecan be reduced Factor 30x if exclude data transfer time Parallelizable5 steps algorithm clustering on GPU on 47 G.Lamanna – ISOTDAQ – 15/1/2020 Valencia HLT1 andHLT2 areplaces idealwhere to useGPUs! LHCb : Run3 DAQ for HLT2 HLT1 HLT2: HLT1: more pile 5x rateand 30x higher removed Trigger L0 Hardware triggers,… PID, vertices, exclusive reconstruction offline track quality Detector calibration and factor 30x Reduce the rate by a MHz!!!!) reconstruction (@30 Full charged track - up 48 G.Lamanna – ISOTDAQ – 15/1/2020 Valencia LHCb : the Allen : Allen the HLT 1 on GPU (O(500)) HLT2 O(1000) Calibration Event Storage (O(250)) building project 80 Gb/s 1 - 2 Tb/s Next step: useGPU in Next HLT2 Reduction of data bandwidth computing power Use GPU toincrease the farm switch, Bulding in the Event before the Move theHLT1 From 40 Tb/s to 1 From40to Tb/s Parallelize algorithms events on Natural Parallelprocessing HLT1at reconstruction Full - 2 Tb/s 49 G.Lamanna – ISOTDAQ – 15/1/2020 Valencia LHCb TDR TDR with consistent performace physicswith 1 MHzto MHz 30 from reducedrate Event Selection: GPU on produced All theHLT1primitives muon identification, … 1 Track, 2 Tracks, High Muon: particleidentification SciFi: Tracks reconstruction UT: Tracks reconstruction Velo: clustering, tracking, vertexing : the Allen : Allen the project - pt muons, 50 G.Lamanna – ISOTDAQ – 15/1/2020 Valencia LHCb Tb/s 40 to a respectwith cheaper easier and is network commutate Switch1 Tb/sa tested Several GPU 1 MHz to output the reduce and MHz) (30 rate thewith input O(500)cope to Use HLT1 full chain Run : Allen timing : timing Allen performaces 51 G.Lamanna – ISOTDAQ – 15/1/2020 Valencia algorithm seeding tracking Bestresult: hough ofbased tracking Muon, CA on and jet finding Calorimeter, Automata on based trackingDetector, Inner Framework Extension( AcceleratorProcess Run1 in Demonstrators GPU in HLT HLT ATLAS in GPU clusterization transforms (CA) Cellular Cellular APE x28 ) in in based the “Athena MT” framework (based on TBB) aynchrounous New studies are on going to study the interaction of concurrency and multithreading ATLAS software that was not able to manage The reason is related to the use of “Athena”, the in ATLAS The conclusion of this study was • The gain is marginal Muon run of heterogeneous not to use the GPU accelarators Calorimeter Tracking 52 in G.Lamanna – ISOTDAQ – 15/1/2020 Valencia Low level level Low Brute force: Brute Rock Solid: Rock Solid: flexibility re to (sometimes R&Dof years several Pro and links processors withdedicated board own your Build junk toprocessjust resources Cons: Pro routers. smart) eventually(and fast using farm, pc hugeon a data all Bring - rebuild the wheel), limitedwheel),the rebuild : power, reliability; power, : flexibility; toprogram, easy : very expensive, most of of most expensive, very trigger:Different Custom Hardware PCs [LHCb] Cons : . [Korda’s talk] Solutions Off Elegant: - Latency Pro reasons otherfor developed continuously purposes other built for hardware toexploit Try logic. and clock FPGAby limited complexity algorithm Cons Pro conditions. trigger your toapply way a flexibletohave logic a programmable Use the : cheap, flexible, scalable, PC based. scalable,flexible, cheap,: latency;deterministic low and flexibility : - : not so easy (up to now) toprogram, tonow) (up easy so not : shelf: shelf: FPGA GPU Cons 53 : G.Lamanna – ISOTDAQ – 15/1/2020 Valencia decision at tens of MHz events rate? events MHztens of at decision power: Computing systems? trigger in synchronous usage for enoughstable latency low level Is the a system?triggerof with the tinylatency Latency: GPU in low level trigger level low in GPU Is the GPU Is the Is latency per event small enough to cope enoughsmall event per latency GPU fast enough to take trigger to enoughfast 54 G.Lamanna – ISOTDAQ – 15/1/2020 Valencia Low Level Level Low NA62 trigger: Test bench NA62: RICH: Look Look for ultra extraction) Fixed target experiment on SPS (slow 10 MHz events: about 20 hits per particle MisID Time resolution: 70 2 spots of 1000 PMs each GeV between pions and muons from 15 to 35 Reconstruct Cherenkov Rings to distinguish 1 17 m long, 3 m in diameter, with Ne filled at atm : 5x10 - 3 - rare kaon decays ( ps K → pnn ) 55 G.Lamanna – ISOTDAQ – 15/1/2020 Valencia TEL62 primitives trigger arefor used data Readout buffer readout for memories DDR2 5 channels 512 FPGAs ( eth through transmission primitives Data and UDP HPTDC ) 56 G.Lamanna – ISOTDAQ – 15/1/2020 Valencia The The NA62 TDAQ “standard” system RICH L1/L2 PC 1 MHz 10 MHz Data Trigger primitives trigger L1 trigger L0 L1/L2 PC MUV L1/L2 PC CEDAR GigaEth SWITCH L0TP CDR L1/L2 PC STRAWS O(KHz) L1/L2 PC LKR L1/L2 PC 100 kHz 1 MHz L1/L2 LAV PC

10 MHz

L0 EB L1/2 L2 L1 L0 kHz. kHz 100 information “Complete 1 MHz detector”. “Single Max MHz 10 synchronous : : : Software Software Software Hardware latency to to to 100 kHz 100 1 MHz level level few level 1 ms level . . ”. . . . 57 G.Lamanna – ISOTDAQ – 15/1/2020 Valencia Latency: main problem of GPU computing GPU of main Latency: problem HostPC NIC set chip VRAM GPU CPU express PCI RAM computing computing multi the optimizing latency the of component “Hide” transfertime: Decreasedata the RAM Host in copy double by dominated latency Total NIC Custom manage of Memory Access) DMA buffers some (Direct - events 58 G.Lamanna – ISOTDAQ – 15/1/2020 Valencia NaNet - 10 (foreseen100 40 on Working …) decompression, FPGA Streamon processing Links Gb/s 4x10 UDP GPUDirect board) board ALTERA up to 10Gb/s) 4 SFP+ ports (Link speed PCIe offload support offload (merging, (merging, (TERASIC (TERASIC DE5 x8 Gen3 (8 GB/s) Stratix /RDMA GbE GbE V - dev Net ) 59 G.Lamanna – ISOTDAQ – 15/1/2020 Valencia NaNet - 10 VCI 2016 VCI 16/02/2016 60 60 G.Lamanna – ISOTDAQ – 15/1/2020 Valencia NA62 GPU trigger system trigger NA62 GPU Max output bandwidth Total buffering Max Latency L0 trigger rate Events rate 2024 TDC channels, 4 TEL62 : 10 MHz : 1 : (per (per board): 1 MHz ms TEL62 (per (per board): 8 GB GPU reduced event Readout event: 4 Gb/s GPU NVIDIA K20 4x1Gb/s primitives 4x1Gb/s 8x1Gb/s • • • • • 1.5 kb Bandwidth: PCI ex.gen3 6GB 3.9 2688 GPU trigger Standardtrigger for linksreadout data Teraflops : 300 b VRAM cores (1.5 (1.5 Gb/s) : (3 (3 Gb/s) 250 GB/s 61 G.Lamanna – ISOTDAQ – 15/1/2020 Valencia Ring fitting problem fitting Ring Fast Trackless Accurate Low latency Offline resolution resolution required Offline (synchronous) Online trigger Events rateof MHzat levels oftens iterativeNot procedure information from to mergemany L0Difficult at detectors no information from thetracker market: the on rings Multi transform, … Metropolis possibilisticclustering, Trackless: fiTQun, APFit, Constrained Hough, … With seeds: Likelihood, - Hastings, Hough 62 G.Lamanna – ISOTDAQ – 15/1/2020 Valencia Almagest Design a procedure for parallel implementation parallelfor a procedure Design “ AD*BC+AB*DC=AC*BD the relation:if (the and cyclicvertex is validon acircle)if only lie Newalgorithm( Almagest ) basedon) Ptolemy’s theorem : “A quadrilater is 63 G.Lamanna – ISOTDAQ – 15/1/2020 Valencia Almagest conditionthen satisfythe Ptolemy’s nextpointdoes not remainingpoints:the if ii) Loop on theLoopon fit considerit conditionthen Ptolemy’s satisfythe iii) If If the point starting points) i ) Selecta reject it for the triplet B (3 A C iv) …again… D D D singlefit ring v) Perform a points alreadyused excludingthe vi) Repeat by 64 G.Lamanna – ISOTDAQ – 15/1/2020 Valencia Histogram each point in the grid The histograms of the distances is created for The XY plane isdividedin a Grid 65 G.Lamanna – ISOTDAQ – 15/1/2020 Valencia Processing flow Processing TEL62 TEL62 TEL62 TEL62 (~10us) Datareceived are – – – – S2 S1 J2 J1 - - - ( Pre <10us) Merging Decompression Protocoloffload - VRAM CopythroughPCIe in processing gathering : NaNet Processing - 10 Eventsignal to GPU DMA Computing realtime), … realtime), Send,Re computing GPU KERNEL available Results - synchro (for synchro Clop GPU 66 G.Lamanna – ISOTDAQ – 15/1/2020 Valencia Results in 2017 NA62 2017 Run in NA62 Results beam 25% ~ Testbed time Gathering (9*10 the NA62 NA62 the Processing (P100) per time Processing P100 nVIDIA 32 GB DDR3 2.0 GHz Intel Patsburg QF Intel C602 Supermicro X9DRG : 350 Xeon 11 intensity K20c and target Pps E5 m requirements - s 2602 latency ) - : below event ) : 1 : 200 us us (K20c), <0.20 ( compatible with with us 67 G.Lamanna – ISOTDAQ – 15/1/2020 Valencia Where to use HEP? use in Where to GPU DNN andML Simulation & Analysis Trigger Jet reconstruction Data quality and (maybe) inferenceTraining Fast linearalgebra generators number Random V Geant recognition Pattern Calorimeters, Tracking, O(10 MHz)(perto Rate board) up 10 fromorder Latency - 1000 us 68 G.Lamanna – ISOTDAQ – 15/1/2020 Valencia Conclusions for several reasons… reasons… several for is GPUs program addition In advantages several give can purposes other for built processors with triggering and job our towards migration The architectures new to design help to solution possible a Heterogeneous unavoidable is power more computing today: In 2035 the trigger will look different than will lookdifferenttrigger the THANK YOU!!! THANK computing with with computing COTS quite funnyquite is a trend in trendis a GPU could be could … 69 G.Lamanna – ISOTDAQ – 15/1/2020 Valencia SPARES 70 G.Lamanna – ISOTDAQ – 15/1/2020 Valencia PANDA 1 TB/s 1 GB/s 10 respect to CPU) application (a GPUs are code on FPGA and on GPU: the Comparison between the same First Tracking, EMC, PID,… 10000 selection: assuming Full reconstruction for online 7 events/s exercice – 100000 CPU 100000CPU cores 30% faster : online tracking factor 200 1 for this - 10 with with ms  71 G.Lamanna – ISOTDAQ – 15/1/2020 Valencia CBM STAR & CBM Cellular the beam Since the trigger ~1000 10 7 Au+Au - tracks/event less automaton+KF continuos ~10000 tracks/frame ~10000 collisions /s Grid=100000 Grid=100000 cores structure of 72 G.Lamanna – ISOTDAQ – 15/1/2020 Valencia Computing vs LUT Computing processors LUT Sin,cos, log, … aim to shrink aim shrink to this space In any GPUs case the … It depends Where is this limit? Complexity 73 G.Lamanna – ISOTDAQ – 15/1/2020 Valencia Accelerators standard bus standard through connected are processors co Nowdays computing intensive for co Accelerators: - processors - 74 G.Lamanna – ISOTDAQ – 15/1/2020 Valencia Computing power Computing 75 G.Lamanna – ISOTDAQ – 15/1/2020 Valencia 76