HEP Computing Trends

Andrew Washbrook University of Edinburgh ATLAS Software & Computing Tutorials 19th January 2015 UTFSM, Chile

Introduction

• I will cover future computing trends for High Energy Physics with a leaning towards the ATLAS experiment

• Some examples of non-LHC experiments where appropriate

• This is a broad subject area (distributed computing, storage, I/O) so here I will focus on the readiness of HEP experiments to changing trends in computing architectures

• Also some shameless promotion of work I have been involved in..

Many thanks to all the people providing me with material for this talk! LHC Context

Run 2 • Increase in centre of mass energy 13TeV • Increase in pile up from ~20 to ~50 • Increase in Trigger rate up to 1 KHz RAW to ESD • More computing resources required to Reconstruction reconstruct events Time

High Luminosity LHC • HL-LHC starts after LS3 (~2022) • Aim to provide 300-1 per year • Pileup of 150 expected • 200 PB/year of extra storage

HL-LHC Timeline CPU Evolution

shrink getting smaller • Research down to 5nm depending on lithography and materials • Clock speed improvement has slowed • More cores per socket • Server at the Grid computing centre has at least 16 cores, typically more • Extrapolation from 2013 predicts 25% server performance improvement per year

Processor scaling trends

1e+06 Transistors ● Clock Power ● Performance ● Performance/W ● ● ● ● ● ● ●●● ● ● ● ●●● ● ● ●●● ●● ●●●●● ● ● ●●●●● ●●●●●●●●● ●●●● ● ● ●●●● ●●●●●●●● ●●●●●● ● ●● ● ● ●● ● ●●●● ●●●●●●●●●●●● ● ● ● ●● ●●●● ●●●●●●●●●●●●●●●●● ●● ● ● ●●●●● ● ●● ●●●● ● ● ●● ● ● ●● ● ● ● ● ●● ● ● ● ● ●●●●●● ●●●●●●●● ● ● ● ● ● ● ●●●●●●● ● ● ● ●● ● ● ● ● ● ● ● ● ●●● ●●● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ●● ● ●●● ● ●● ● ● ● ● ● 1e+04 ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ●●● ● ●● ●● ● ● ● ● ● ●● ● ● ● ● ●● ● ●● ● ●● ●●● ● ● ● ●● ●● ● ● ● ●●● ● ●● ● ●●● ● ●● ● ● ● ●● ● ● ● ● ●●● ● ●●● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ●● ● ● ● ● ● ●● ●● ●●●●● ● ● ● ● ● ● ●●● ● ●● ● ● ●●● ●● ●●● ● ● ● ● ● ● ● ●●● ●● ●● ●● ●●●●●● ● ● ●● ●●●● ●● ●● ● ●●● ●●●●●● ●● ● ● ● ● ● ● ●● ●● ● ● ●●● ● ● ●● ● ●●● ● ●● ●●●●●●●● ● ● ● ● ●●● ●●● ● ● Relative scaling Relative ●● ● ● ●● ● ● ● ●● ●●● ●● ● ● ●● ● ● ● ●●●● ● ● ●● ● ●●● ● ● ● ● ● ● ●●● ●● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ●●● ● ● ● ● ● ● ● ● ● ●●● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ●●● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●●●●● ● ● ● ● ● ● ●● ●● ● ●● ●● ● ● ●● ●● ● ● ● ● ● ● ●● ● ● ●● ●●● ● ● ● 1e+02 ● ● ● ●●●● ● ● ●●● ●● ● ●●●● ● ● ● ● ●● ● ●● ●● ●● ●● ●●● ● ●●● ● ● ● ●● ● ●●●● ● ● ●●● ●● ● ● ● ● ●●● ● ● ● ●●● ●●● ●● ● ● ● ● ● ●● ● ● ● ●●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ●● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ●●● ● ● ● ●●● ● ● ●● ●● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ●●● ● ● ●● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ●●● ● ● ● ●● ●● ● ●●● ●● ● ● ● ●● ● ●● ●●● ● ● ● ●● ● ●●●●● ● ● ● ●●●●● ● ● ● ●● ● ●● ●● ● ● ●● ●● ●● ● ●● ●● ● ●● ● ● ● ● ● ●● ● ● ●● ●● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ●●● 1e+00

1980 1990 2000 2010

dates Low Power Processors

• Power efficiency is becoming increasingly important in data centres • Cost effective high throughput computing • ARM processors now 64-bit with much higher performance/watt than Xeon • Intel have Atom processor line (currently )

AMD 64-bit board

Intel ATOM Roadmap Geant 4 simulation studies on ARM – Andrea Dotti CPU Parallelism

• High Energy Physics does embarrassingly parallel very well - events can be processed independent of each other • However there are other levels of parallelism in the CPU that are not used • Hyperthreading is enabled by some Grid sites

Images from A. Nowak – The evolving marriage of hardware and software CPU Utilisation

Emiliano Politano – Intel HPC Portfolio (September 2014)

• By sticking with scalar single-threaded code HEP is losing more performance capability with each new generation of CPU • Performance loss at each level of parallelism is multiplicative • Can avoid under-utilisation by running one application instance per core • If the footprint of application is high this leads to memory pressure A. Nowak – An update on off the • Memory per core ratio trend in data centres is flat shelf computing (or falling) AthenaMP

• ATLAS uses AthenaMP to reduce memory utilisation on a server running multiple Athena processes • Forking after initialisation allows memory sharing between processed • Note that this is still event level parallelism • For AthenaMP Details see talk by Atilla this morning • Some effort required to run this efficiently on distributed computing resources Rocco Mandrysch

AthenaMP model Multithreading

• Large majority of HEP code still single threaded • Sub-event or in-algorithm parallelsation methods being pursued using Intel Thread Building Blocks • Other MT methods: MPI, Cilk, OpenMP and many more • ATLAS are attempting to introduce framework parallelisation by the development of Gaudi Hive • CMS have an equivalent example processing events along parallel streams • See Atilla’s talk this morning for more details

Gaudi Hive parallelism mode Gaudi Hive whiteboard

CMS Multithreading model Vectorisation

AVX 256 Bit Registers: c[] = a[] x b[] • Vector widths are getting larger with each new generation of CPU a[0] a[1] a[2] a[3] 256 bits used • SIMD operations (Single Instruction Multiple x Data) give access to the full vector width b[0] b[1] b[2] b[3] • Accessible through auto-vectorization = capabilities in compilers and through explicit c[0] c[1] c[2] c[3] SIMD instructions is the code

HEP code is not easily vectorizable - have not exploited this level of parallelism fully in LS1

Emiliano Politano – Intel HPC Portfolio (September 2014)

Speedup of MLfit on different Intel – CERN Openlab Platform Competence Center Report Optimisation and Performance Measurement

• Instruction level efficiency explored by low- level profiling and sampling tools • Very useful when evaluating timing performance improvements at sub-ms level in online software • Valgrind suite popular in HEP (e.g. cachegrind, callgrind) • Gooda (Linux perf data visualiser and analyser) provides function costs at the instruction level in ATLAS • Collaboration with Google • Intel VTune also proving useful

No shortage of tools available and hotspots are easily locatable – but difficult to make real impact on performance without considerable expertise and co-ordination

ATLAS Inner Detector optimisation using Callgrind (top) and Gooda (right) Many Core Technologies

• HEP has a reasonable grasp of event level parallelism using multiple cores - even if the use of the CPU itself is sub-optimal • Unrealistic to optimise the entire code base with millions of lines down to the instruction level • However significant gains in performance available by parallelising a few select algorithms • The adaption of key routines for offloading to co-processors offers huge speed-up potential • This is an active area of study in HEP • Note that many core does not mean multi-core – these are specialist devices

Architecture diagram of Fermi GPU Intel MIC: Xeon Phi

• Optimised for highly parallel workloads • Intel’s first generation Many Integrated Core device is available with three product lines • PCI Express form factor • Runs own micro-OS • Up to 61 cores (1.24 GHz) and 244 hardware threads • Selling point is the low barrier of entry for Xeon- based code - no rewriting of source code • Customer experience: Optimise on the Xeon first before expecting significant speed up on the Phi

Andrea Dotti MIC: Knights Landing

• Next generation MIC expected next year - Knights Landing • Architecture will be a lot different • 14nm chip architecture • Up to 72 cores • AVX-512 support • ~3 TFlops performance • High bandwidth memory • Experiences in HEP - use the Xeon Phi as an exploration device in preparation for Knights Landing

A. Nowak – Intel Knights Landing – what’s old, what’s new? GPUs

• GPUs offer huge potential to accelerate algorithms in HEP • O(TFlops) available in each device • More powerful GPUs (with better performance per watt) are in the pipeline • Algorithm speed-up of (100-1,000x) reported in number of verticals (Oil+Gas, Finance, CFD) Nvidia Kepler GPU So why the delay in widespread deployment in HEP? • Not a generic device, thousands of SIMD cores • Sections of code have to be rewritten specifically for GPU execution • Number of code options - CUDA (platform specific), OpenCL (platform agnostic) and OpenMP/OpenACC

The optimisation of the data handing is the real issue - not the re-coding of the software Nvidia Device Roadmap

Images courtesy of Nvidia (December 2014)

• Leading manufacturer is Nvidia, but also AMD and ATI produce O(TFlop) devices • K40/K80 is the most recent Nvidia enterprise model • Low power (Tegra K15) and embedded devices (Jetson TK1) are also available • Large scale deployment of GPUs at HPC facilities (Titan and Coral) • Recent developments in inter-GPU communication and faster access to GDDR5 memory (NVlink, stacked memory, united memory, GPUdirect RDMA) • C++11 support, Thrust STL and Building blocks (CUB) in latest CUDA version ATLAS Trigger

• GPU development activity in HEP is most prominent in online software • Fast track reconstruction in the ATLAS trigger is computationally expensive (50-70% of processing time) • Tracking routines are a natural fit for parallelisation

• Bytestream decoding and cluster show 26x speed up • Track formation and clone removal show 12x speed up (both using C2050 GPU)

Bytestream decoding results

Track formation and clone removal results Other GPU Tracking Studies

Kalman Filter • GPU Kalman Filter code is less dependent on track multiplicity • Processing time improves on newer generation devices

Track seeding – J. Mattmann C. Schmitt Z-Finder • ATLAS HLT track finding algorithm was an early GPU pathfinder • 35x speed-up observed

Z-Finder Algorithm Other GPU Tracking in HEP

ALICE • Not just speculative feasibility studies in HEP • ALICE have a GPU tracker in production since 2010 • 64 GPU compute nodes (Nvidia Fermi) • Ran in 2012 without incident • Upgraded HLT farm use more recent 180 AMD S9000 GPUs • Three-fold performance of GPU tracker compared to all CPUs in a node

• ATLAS - Muon reconstruction trigger algorithms (GAP-RT) • NA62 - GPU in L0 RICH and low level triggers • CBM - Kalman Filter Track Fitting for First Level Event Selection • Panda at FAIR - GPUs for Triplet Finding See the the PISA GPU workshop for more examples Offline GPU examples

• Not as well covered as online computing but plenty of examples studied

• GooFit - maximum likelihood fitting speed- up adapting RootFit for GPUs

• Vegas - Monte Carlo integration algorithm

• TMVA-ANN - Parallel Neural network GooFit Architecture processing for data analysis

TMVA Neural network with GPU processing time independent 35 input parameters of network complexity GPUs in Software Frameworks

• Framework integration of GPU in ATLAS software through client-server architecture • Accelerator agnostic – in principle any co-processor can be used • Software sprint conducted last month - aiming for full demonstrator later this year

APE Architecture • LHCb are also developing a Gaudi-like tool to offload algorithms Outlook

• The “natural” gains from processor clock speed for single threaded event level parallelism are no longer available

• Multi-process implementations of software frameworks (AthenaMP) allow for better utilisation of multi-core servers

• Significant gains to be made to HEP code by using the full capability of CPU (MT, vectorisation, ILP)

• Co-processors have been demonstrated to deliver real benefits to tracking algorithms

• Focus on GPU development is currently on online software - no use case to deploy GPU devices at scale at WLCG computing centres • HPC facilities could be used for this purpose

• Disruptive technologies have not been considered!