HEP Computing Trends

HEP Computing Trends Andrew Washbrook University of Edinburgh ATLAS Software & Computing Tutorials 19th January 2015 UTFSM, Chile Introduction • I will cover future computing trends for High Energy Physics with a leaning towards the ATLAS experiment • Some examples of non-LHC experiments where appropriate • This is a broad subject area (distributed computing, storage, I/O) so here I will focus on the readiness of HEP experiments to changing trends in computing architectures • Also some shameless promotion of work I have been involved in.. Many thanks to all the people providing me with material for this talk! LHC Context Run 2 • Increase in centre of mass energy 13TeV • Increase in pile up from ~20 to ~50 • Increase in Trigger rate up to 1 KHz RAW to ESD • More computing resources required to Reconstruction reconstruct events Time High Luminosity LHC • HL-LHC starts after LS3 (~2022) • Aim to provide 300[-1 per year • Pileup of 150 expected • 200 PB/year of extra storage HL-LHC Timeline CPU Evolution • Die shrink getting smaller • Research down to 5nm depending on lithography and materials • Clock speed improvement has slowed • More cores per socket • Server at the Grid computing centre has at least 16 cores, typically more • Extrapolation from 2013 predicts 25% server performance improvement per year Processor scaling trends 1e+06 Transistors ● Clock Power ● Performance ● Performance/W ● ● ● ● ● ● ●●● ● ● ● ●●● ● ● ●●● ●● ●●●●● ● ● ●●●●● ●●●●●●●●● ●●●● ● ● ●●●● ●●●●●●●● ●●●●●● ● ●● ● ● ●● ● ●●●● ●●●●●●●●●●●● ● ● ● ●● ●●●● ●●●●●●●●●●●●●●●●● ●● ● ● ●●●●● ● ●● ●●●● ● ● ●● ● ● ●● ● ● ● ● ●● ● ● ● ● ●●●●●● ●●●●●●●● ● ● ● ● ● ● ●●●●●●● ● ● ● ●● ● ● ● ● ● ● ● ● ●●● ●●● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ●● ● ●●● ● ●● ● ● ● ● ● 1e+04 ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ●●● ● ●● ●● ● ● ● ● ● ●● ● ● ● ● ●● ● ●● ● ●● ●●● ● ● ● ●● ●● ● ● ● ●●● ● ●● ● ●●● ● ●● ● ● ● ●● ● ● ● ● ●●● ● ●●● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ●● ● ● ● ● ● ●● ●● ●●●●● ● ● ● ● ● ● ●●● ● ●● ● ● ●●● ●● ●●● ● ● ● ● ● ● ● ●●● ●● ●● ●● ●●●●●● ● ● ●● ●●●● ●● ●● ● ●●● ●●●●●● ●● ● ● ● ● ● ● ●● ●● ● ● ●●● ● ● ●● ● ●●● ● ●● ●●●●●●●● ● ● ● ● ●●● ●●● ● ● Relative scaling Relative ●● ● ● ●● ● ● ● ●● ●●● ●● ● ● ●● ● ● ● ●●●● ● ● ●● ● ●●● ● ● ● ● ● ● ●●● ●● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ●●● ● ● ● ● ● ● ● ● ● ●●● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ●●● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●●●●● ● ● ● ● ● ● ●● ●● ● ●● ●● ● ● ●● ●● ● ● ● ● ● ● ●● ● ● ●● ●●● ● ● ● 1e+02 ● ● ● ●●●● ● ● ●●● ●● ● ●●●● ● ● ● ● ●● ● ●● ●● ●● ●● ●●● ● ●●● ● ● ● ●● ● ●●●● ● ● ●●● ●● ● ● ● ● ●●● ● ● ● ●●● ●●● ●● ● ● ● ● ● ●● ● ● ● ●●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ●● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ●●● ● ● ● ●●● ● ● ●● ●● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ●●● ● ● ●● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ●●● ● ● ● ●● ●● ● ●●● ●● ● ● ● ●● ● ●● ●●● ● ● ● ●● ● ●●●●● ● ● ● ●●●●● ● ● ● ●● ● ●● ●● ● ● ●● ●● ●● ● ●● ●● ● ●● ● ● ● ● ● ●● ● ● ●● ●● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ●●● 1e+00 1980 1990 2000 2010 dates Low Power Processors • Power efficiency is becoming increasingly important in data centres • Cost effective high throughput computing • ARM processors now 64-bit with much higher performance/watt than Intel Xeon • Intel have Atom processor line (currently silvermont) AMD 64-bit board Intel ATOM Roadmap Geant 4 simulation studies on ARM – Andrea Dotti CPU Parallelism • High Energy Physics does embarrassingly parallel very well - events can be processed independent of each other • However there are other levels of parallelism in the CPU that are not used • Hyperthreading is enabled by some Grid sites Images from A. Nowak – The evolving marriage of hardware and software CPU Utilisation Emiliano Politano – Intel HPC Portfolio (September 2014) • By sticking with scalar single-threaded code HEP is losing more performance capability with each new generation of CPU • Performance loss at each level of parallelism is multiplicative • Can avoid under-utilisation by running one application instance per core • If the footprint of application is high this leads to memory pressure A. Nowak – An update on off the • Memory per core ratio trend in data centres is flat shelf computing (or falling) AthenaMP • ATLAS uses AthenaMP to reduce memory utilisation on a server running multiple Athena processes • Forking after initialisation allows memory sharing between processed • Note that this is still event level parallelism • For AthenaMP Details see talk by Atilla this morning • Some effort required to run this efficiently on distributed computing resources Rocco Mandrysch AthenaMP model Multithreading • Large majority of HEP code still single threaded • Sub-event or in-algorithm parallelsation methods being pursued using Intel Thread Building Blocks • Other MT methods: MPI, Cilk, OpenMP and many more • ATLAS are attempting to introduce framework parallelisation by the development of Gaudi Hive • CMS have an equivalent example processing events along parallel streams • See Atilla’s talk this morning for more details Gaudi Hive parallelism mode Gaudi Hive whiteboard CMS Multithreading model Vectorisation AVX 256 Bit Registers: c[] = a[] x b[] • Vector widths are getting larger with each new generation of CPU a[0] a[1] a[2] a[3] 256 bits used • SIMD operations (Single Instruction Multiple x Data) give access to the full vector width b[0] b[1] b[2] b[3] • Accessible through auto-vectorization = capabilities in compilers and through explicit c[0] c[1] c[2] c[3] SIMD instructions is the code HEP code is not easily vectorizable - have not exploited this level of parallelism fully in LS1 Emiliano Politano – Intel HPC Portfolio (September 2014) Speedup of MLfit on different Intel microarchitectures – CERN Openlab Platform Competence Center Report Optimisation and Performance Measurement • Instruction level efficiency explored by low- level profiling and sampling tools • Very useful when evaluating timing performance improvements at sub-ms level in online software • Valgrind suite popular in HEP (e.g. cachegrind, callgrind) • Gooda (Linux perf data visualiser and analyser) provides function costs at the instruction level in ATLAS • Collaboration with Google • Intel VTune also proving useful No shortage of tools available and hotspots are easily locatable – but difficult to make real impact on performance without considerable expertise and co-ordination ATLAS Inner Detector optimisation using Callgrind (top) and Gooda (right) Many Core Technologies • HEP has a reasonable grasp of event level parallelism using multiple cores - even if the use of the CPU itself is sub-optimal • Unrealistic to optimise the entire code base with millions of lines down to the instruction level • However significant gains in performance available by parallelising a few select algorithms • The adaption of key routines for offloading to co-processors offers huge speed-up potential • This is an active area of study in HEP • Note that many core does not mean multi-core – these are specialist devices Architecture diagram of Nvidia Fermi GPU Intel MIC: Xeon Phi • Optimised for highly parallel workloads • Intel’s first generation Many Integrated Core device is available with three product lines • PCI Express form factor • Runs own micro-OS • Up to 61 cores (1.24 GHz) and 244 hardware threads • Selling point is the low barrier of entry for Xeon- based code - no rewriting of source code • Customer experience: Optimise on the Xeon first before expecting significant speed up on the Phi Andrea Dotti MIC: Knights Landing • Next generation MIC expected next year - Knights Landing • Architecture will be a lot different • 14nm chip architecture • Up to 72 cores • AVX-512 support • ~3 TFlops performance • High bandwidth memory • Experiences in HEP - use the Xeon Phi as an exploration device in preparation for Knights Landing A. Nowak – Intel Knights Landing – what’s old, what’s new? GPUs • GPUs offer huge potential to accelerate algorithms in HEP • O(TFlops) available in each device • More powerful GPUs (with better performance per watt) are in the pipeline • Algorithm speed-up of (100-1,000x) reported in number of verticals (Oil+Gas, Finance, CFD) Nvidia Kepler GPU So why the delay in widespread deployment in HEP? • Not a generic device, thousands of SIMD cores • Sections of code have to be rewritten specifically for GPU execution • Number of code options - CUDA (platform specific), OpenCL (platform agnostic) and OpenMP/OpenACC The optimisation of the data handing is the real issue - not the re-coding of the software Nvidia Device Roadmap Images courtesy of Nvidia (December 2014) • Leading manufacturer is Nvidia, but also AMD and ATI produce O(TFlop) devices • K40/K80 is the most recent Nvidia enterprise model • Low power (Tegra K15) and embedded devices (Jetson TK1) are also available • Large scale deployment of GPUs at HPC facilities (Titan and Coral) • Recent developments in inter-GPU communication and faster access to GDDR5 memory (NVlink, stacked memory, united memory, GPUdirect RDMA) • C++11 support, Thrust STL and Building blocks (CUB) in latest CUDA version ATLAS Trigger • GPU development activity in HEP is most prominent in online software • Fast track reconstruction in the ATLAS trigger is computationally expensive (50-70% of processing time) • Tracking routines are a natural fit for parallelisation • Bytestream decoding and cluster show

HEP Computing Trends

H1 2015-2016 Results

Multiprocessing Contents

Drive for Better Vision

Since the Advent of Integrated Circuit Technology in 1958, the Integration Has Been Primarily Monolithic

Unstructured Computations on Emerging Architectures

High Density Packaging (HDP/MCM) Newsletter

Intel's Atom Lines 1. Introduction to Intel's Atom Lines 3. Atom-Based Platforms Targeting Entry Level Desktops and Notebook

An Overview of Chip Level EMC Problems

Technology Roadmap Document for Ska Signal Processing

Wafer Level Packaging Interconnects: Russ Winslow Wafer Fabrication Vs

FDSOI Technology Overview by Nguyen Nanjing Sept 22, 2017 Final

Power, Interconnect, and Reliability Techniques for Large Scale Integrated Circuits