HEP Computing Trends
Total Page:16
File Type:pdf, Size:1020Kb
HEP Computing Trends Andrew Washbrook University of Edinburgh ATLAS Software & Computing Tutorials 19th January 2015 UTFSM, Chile Introduction • I will cover future computing trends for High Energy Physics with a leaning towards the ATLAS experiment • Some examples of non-LHC experiments where appropriate • This is a broad subject area (distributed computing, storage, I/O) so here I will focus on the readiness of HEP experiments to changing trends in computing architectures • Also some shameless promotion of work I have been involved in.. Many thanks to all the people providing me with material for this talk! LHC Context Run 2 • Increase in centre of mass energy 13TeV • Increase in pile up from ~20 to ~50 • Increase in Trigger rate up to 1 KHz RAW to ESD • More computing resources required to Reconstruction reconstruct events Time High Luminosity LHC • HL-LHC starts after LS3 (~2022) • Aim to provide 300[-1 per year • Pileup of 150 expected • 200 PB/year of extra storage HL-LHC Timeline CPU Evolution • Die shrink getting smaller • Research down to 5nm depending on lithography and materials • Clock speed improvement has slowed • More cores per socket • Server at the Grid computing centre has at least 16 cores, typically more • Extrapolation from 2013 predicts 25% server performance improvement per year Processor scaling trends 1e+06 Transistors ● Clock Power ● Performance ● Performance/W ● ● ● ● ● ● ●●● ● ● ● ●●● ● ● ●●● ●● ●●●●● ● ● ●●●●● ●●●●●●●●● ●●●● ● ● ●●●● ●●●●●●●● ●●●●●● ● ●● ● ● ●● ● ●●●● ●●●●●●●●●●●● ● ● ● ●● ●●●● ●●●●●●●●●●●●●●●●● ●● ● ● ●●●●● ● ●● ●●●● ● ● ●● ● ● ●● ● ● ● ● ●● ● ● ● ● ●●●●●● ●●●●●●●● ● ● ● ● ● ● ●●●●●●● ● ● ● ●● ● ● ● ● ● ● ● ● ●●● ●●● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ●● ● ●●● ● ●● ● ● ● ● ● 1e+04 ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ●●● ● ●● ●● ● ● ● ● ● ●● ● ● ● ● ●● ● ●● ● ●● ●●● ● ● ● ●● ●● ● ● ● ●●● ● ●● ● ●●● ● ●● ● ● ● ●● ● ● ● ● ●●● ● ●●● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ●● ● ● ● ● ● ●● ●● ●●●●● ● ● ● ● ● ● ●●● ● ●● ● ● ●●● ●● ●●● ● ● ● ● ● ● ● ●●● ●● ●● ●● ●●●●●● ● ● ●● ●●●● ●● ●● ● ●●● ●●●●●● ●● ● ● ● ● ● ● ●● ●● ● ● ●●● ● ● ●● ● ●●● ● ●● ●●●●●●●● ● ● ● ● ●●● ●●● ● ● Relative scaling Relative ●● ● ● ●● ● ● ● ●● ●●● ●● ● ● ●● ● ● ● ●●●● ● ● ●● ● ●●● ● ● ● ● ● ● ●●● ●● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ●●● ● ● ● ● ● ● ● ● ● ●●● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ●●● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●●●●● ● ● ● ● ● ● ●● ●● ● ●● ●● ● ● ●● ●● ● ● ● ● ● ● ●● ● ● ●● ●●● ● ● ● 1e+02 ● ● ● ●●●● ● ● ●●● ●● ● ●●●● ● ● ● ● ●● ● ●● ●● ●● ●● ●●● ● ●●● ● ● ● ●● ● ●●●● ● ● ●●● ●● ● ● ● ● ●●● ● ● ● ●●● ●●● ●● ● ● ● ● ● ●● ● ● ● ●●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ●● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ●●● ● ● ● ●●● ● ● ●● ●● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ●●● ● ● ●● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ●●● ● ● ● ●● ●● ● ●●● ●● ● ● ● ●● ● ●● ●●● ● ● ● ●● ● ●●●●● ● ● ● ●●●●● ● ● ● ●● ● ●● ●● ● ● ●● ●● ●● ● ●● ●● ● ●● ● ● ● ● ● ●● ● ● ●● ●● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ●●● 1e+00 1980 1990 2000 2010 dates Low Power Processors • Power efficiency is becoming increasingly important in data centres • Cost effective high throughput computing • ARM processors now 64-bit with much higher performance/watt than Intel Xeon • Intel have Atom processor line (currently silvermont) AMD 64-bit board Intel ATOM Roadmap Geant 4 simulation studies on ARM – Andrea Dotti CPU Parallelism • High Energy Physics does embarrassingly parallel very well - events can be processed independent of each other • However there are other levels of parallelism in the CPU that are not used • Hyperthreading is enabled by some Grid sites Images from A. Nowak – The evolving marriage of hardware and software CPU Utilisation Emiliano Politano – Intel HPC Portfolio (September 2014) • By sticking with scalar single-threaded code HEP is losing more performance capability with each new generation of CPU • Performance loss at each level of parallelism is multiplicative • Can avoid under-utilisation by running one application instance per core • If the footprint of application is high this leads to memory pressure A. Nowak – An update on off the • Memory per core ratio trend in data centres is flat shelf computing (or falling) AthenaMP • ATLAS uses AthenaMP to reduce memory utilisation on a server running multiple Athena processes • Forking after initialisation allows memory sharing between processed • Note that this is still event level parallelism • For AthenaMP Details see talk by Atilla this morning • Some effort required to run this efficiently on distributed computing resources Rocco Mandrysch AthenaMP model Multithreading • Large majority of HEP code still single threaded • Sub-event or in-algorithm parallelsation methods being pursued using Intel Thread Building Blocks • Other MT methods: MPI, Cilk, OpenMP and many more • ATLAS are attempting to introduce framework parallelisation by the development of Gaudi Hive • CMS have an equivalent example processing events along parallel streams • See Atilla’s talk this morning for more details Gaudi Hive parallelism mode Gaudi Hive whiteboard CMS Multithreading model Vectorisation AVX 256 Bit Registers: c[] = a[] x b[] • Vector widths are getting larger with each new generation of CPU a[0] a[1] a[2] a[3] 256 bits used • SIMD operations (Single Instruction Multiple x Data) give access to the full vector width b[0] b[1] b[2] b[3] • Accessible through auto-vectorization = capabilities in compilers and through explicit c[0] c[1] c[2] c[3] SIMD instructions is the code HEP code is not easily vectorizable - have not exploited this level of parallelism fully in LS1 Emiliano Politano – Intel HPC Portfolio (September 2014) Speedup of MLfit on different Intel microarchitectures – CERN Openlab Platform Competence Center Report Optimisation and Performance Measurement • Instruction level efficiency explored by low- level profiling and sampling tools • Very useful when evaluating timing performance improvements at sub-ms level in online software • Valgrind suite popular in HEP (e.g. cachegrind, callgrind) • Gooda (Linux perf data visualiser and analyser) provides function costs at the instruction level in ATLAS • Collaboration with Google • Intel VTune also proving useful No shortage of tools available and hotspots are easily locatable – but difficult to make real impact on performance without considerable expertise and co-ordination ATLAS Inner Detector optimisation using Callgrind (top) and Gooda (right) Many Core Technologies • HEP has a reasonable grasp of event level parallelism using multiple cores - even if the use of the CPU itself is sub-optimal • Unrealistic to optimise the entire code base with millions of lines down to the instruction level • However significant gains in performance available by parallelising a few select algorithms • The adaption of key routines for offloading to co-processors offers huge speed-up potential • This is an active area of study in HEP • Note that many core does not mean multi-core – these are specialist devices Architecture diagram of Nvidia Fermi GPU Intel MIC: Xeon Phi • Optimised for highly parallel workloads • Intel’s first generation Many Integrated Core device is available with three product lines • PCI Express form factor • Runs own micro-OS • Up to 61 cores (1.24 GHz) and 244 hardware threads • Selling point is the low barrier of entry for Xeon- based code - no rewriting of source code • Customer experience: Optimise on the Xeon first before expecting significant speed up on the Phi Andrea Dotti MIC: Knights Landing • Next generation MIC expected next year - Knights Landing • Architecture will be a lot different • 14nm chip architecture • Up to 72 cores • AVX-512 support • ~3 TFlops performance • High bandwidth memory • Experiences in HEP - use the Xeon Phi as an exploration device in preparation for Knights Landing A. Nowak – Intel Knights Landing – what’s old, what’s new? GPUs • GPUs offer huge potential to accelerate algorithms in HEP • O(TFlops) available in each device • More powerful GPUs (with better performance per watt) are in the pipeline • Algorithm speed-up of (100-1,000x) reported in number of verticals (Oil+Gas, Finance, CFD) Nvidia Kepler GPU So why the delay in widespread deployment in HEP? • Not a generic device, thousands of SIMD cores • Sections of code have to be rewritten specifically for GPU execution • Number of code options - CUDA (platform specific), OpenCL (platform agnostic) and OpenMP/OpenACC The optimisation of the data handing is the real issue - not the re-coding of the software Nvidia Device Roadmap Images courtesy of Nvidia (December 2014) • Leading manufacturer is Nvidia, but also AMD and ATI produce O(TFlop) devices • K40/K80 is the most recent Nvidia enterprise model • Low power (Tegra K15) and embedded devices (Jetson TK1) are also available • Large scale deployment of GPUs at HPC facilities (Titan and Coral) • Recent developments in inter-GPU communication and faster access to GDDR5 memory (NVlink, stacked memory, united memory, GPUdirect RDMA) • C++11 support, Thrust STL and Building blocks (CUB) in latest CUDA version ATLAS Trigger • GPU development activity in HEP is most prominent in online software • Fast track reconstruction in the ATLAS trigger is computationally expensive (50-70% of processing time) • Tracking routines are a natural fit for parallelisation • Bytestream decoding and cluster show