High-Performance Computing at CSCS: Co-Design and New Separations of Concerns for HPC

Thomas C. Schulthess Computer performance and application performance increase ~103 every decade ~100 Kilowatts ~5 Megawatts 20-30 MW ~1 Exaflop/s

100 million or billion 1.35 Petaflop/s processing cores (!) Cray XT5 150’000 processors This system was built with commodity 1.02 Teraflop/s processors Cray T3E 1’500 processors 1 Gigaflop/s Cray YMP 8 processors

1988 1998 2008 2018

First sustained GFlop/s First sustained TFlop/s First sustained PFlop/s Another 1,000x increase in Gordon Bell Prize 1988 Gordon Bell Prize 1998 Gordon Bell Prize 2008 sustained performance

Wednesday, March 13, 2013 HPC Advisory Council Switzerland Conference, Lugano HOW WELL CAN APPLICATIONS DEAL WITH CONCURRENCY?

HOW EFFICIENT ARE THEY?

Wednesday, March 13, 2013 HPC Advisory Council Switzerland Conference, Lugano Applications running at scale on @ ORNL (Spring 2011) Domain area Code name Institution # of cores Performance Notes

2008 Gordon Bell Materials DCA++ ORNL 213,120 1.9 PF Prize Winner 2009 Gordon Bell Materials WL-LSMS ORNL/ETH 223,232 1.8 PF Prize Winner 2008 Gordon Bell Chemistry NWChem PNNL/ORNL 224,196 1.4 PF Prize Finalist 2010 Gordon Bell Materials DRC ETH/UTK 186,624 1.3 PF Prize Hon. Mention 2010 Gordon Bell Nanoscience OMEN Duke 222,720 > 1 PF Prize Finalist 2010 Gordon Bell Biomedical MoBo GaTech 196,608 780 TF Prize Winner Chemistry MADNESS UT/ORNL 140,000 550 TF

2008 Gordon Bell Materials LS3DF LBL 147,456 442 TF Prize Winner 2008 Gordon Bell Seismology SPECFEM3D USA (multiple) 149,784 165 TF Prize Finalist Combustion S3D SNL 147,456 83 TF

Weather WRF USA (multiple) 150,000 50 TF

Wednesday, March 13, 2013 HPC Advisory Council Switzerland Conference, Lugano Hirsch-Fey quantum Monte Carlo with Delayed updates (or Ed updates) Ed D’Azevedo, ORNL t Gc({si,l}k+1)=Gc({si,l}k)+ak × bk

t Gc({si,l}k+1)=Gc({si,l}0)+[a0|a1|...|ak] × [b0|b1|...|bk]

2 6000 Complexity for k updates remains O(kNt ) mixed precision double precision But we can replace k rank-1 updates 4000 with one matrix-matrix multiply plus some additional bookkeeping. 2000 time to solution [sec] G. Alvarez, M. S. Summers, D. E. Maxwell, M. Eisenbach, J. S. Meredith, J. M. Larkin, J. Levesque, T. A. Maier, P. R. C. Kent, E. F. D'Azevedo, T. C. Schulthess; New algorithm to enable 400+ TFlop/ 0 0 20 40 60 80 100 s sustained performance in simulations of disorder effects in high-Tc superconductors, Proceedings of the 2008 ACM/IEEE conference on Supercomputing 61, 2008 delay Wednesday, March 13, 2013 HPC Advisory Council Switzerland Conference, Lugano Arithmetic intensity of a computation floating point operations ↵ = data transferred

t Gc({si,l}k+1)=Gc({si,l}k)+ak × bk

o(1) ⇡

t Gc({si,l}k+1)=Gc({si,l}0)+[a0|a1|...|ak] × [b0|b1|...|bk]

o(k) ⇡

Wednesday, March 13, 2013 HPC Advisory Council Switzerland Conference, Lugano Algorithmic motifs and their arithmetic intensity Arithmetic intensity: number of operations per word of memory transferred Finite difference / stencil in S3D and WRF (& COSMO) Rank-1 update in HF-QMC Rank-N update in DCA++ QMR in WL-LSMS Sparse linear algebra Linpack (Top500) Matrix-Vector Vector-Vector Fast Fourier Transforms Dense Matrix-Matrix BLAS1&2 FFTW & SPIRAL BLAS3

O(1) O(log N) O(N)

Supercomputers are designed for certain algorithmic motifs – which ones?

Wednesday, March 13, 2013 HPC Advisory Council Switzerland Conference, Lugano Petaflop/s = 1015 64-bit floating point operations / sec. which takes more energy?

64-bit floating-point fused multiply add or moving three 64-bit operands 20 mm across the die

934,569.299814557 x 52.827419489135904 ------= 49,370,884.442971624253823 + 4.20349729193958 ------= 49,370,888.64646892 20 mm

this takes over 3x the energy! loading the data from off chip takes > 10x more yet source: Steve Scott, Cray Inc. moving data is expensive – exploiting data locality is critical to energy efficiency If we care about energy consumption, we have to worry about these and other physical considerations of the computation – but where is the separation of concerns?

Wednesday, March 13, 2013 HPC Advisory Council Switzerland Conference, Lugano COSMO in production for Swiss weather prediction ECMWF COSMO-7 2x per day 3x per day 72h forecast 16 km lateral grid, 91 6.6 km lateral grid, 60 layers layers COSMO-2 8x per day 24h forecast 2.2 km lateral grid, 60 layers

§ Some of the products generated from these simulations § Daily weather forecast § Forecasting for air traffic control (Sky Gide) § Safety management in event of nuclear incidents

Wednesday, March 13, 2013 HPC Advisory Council Switzerland Conference, Lugano Insight into model/methods/algorithms used in COSMO § PDE on structured grid (variables: velocity, temperature, pressure, humidity, etc.) § Explicit solve horizontally (I, J) using finite difference stencils § Implicit solve in vertical direction (K) with tri-diagonal solve in every column (applying Thomas algorithm in parallel – can be expressed as stencil)

~2km Due to implicit solves in the vertical we can work with 60m longer time steps (2km and not 60m grid size K J is relevant)

Tri-diagonal solves I

Wednesday, March 13, 2013 HPC Advisory Council Switzerland Conference, Lugano Hence the algorithmic motif are

§ Tri-diagonal solve § vertical K-diretion § with loop carried dependencies in K J

K I

§ Finite difference stencil computations J § focus on horizontal IJ-plane access K § no loop carried dependencies

Wednesday, March 13, 2013 HPC Advisory Council Switzerland Conference, Lugano Performance profile of (original) COSMO-CCLM

Runtime based 2 km production model of MeteoSwiss

% Code Lines (F90) % Runtime

Wednesday, March 13, 2013 HPC Advisory Council Switzerland Conference, Lugano Analyzing two examples/motifs – how are they different?

Physics 3 memory accesses 136 FLOPs è compute bound

Dynamics 3 memory accesses 5 FLOPs è memory bound

Wednesday, March 13, 2013 HPC Advisory Council Switzerland Conference, Lugano Running the simple examples on the Cray XK6 Compute bound (physics) problem Machine Interlagos Fermi (2090) GPU+transfer

Time 1.31 s 0.17 s 1.9 s

Speedup 1.0 (REF) 7.6 0.7

Memory bound (dynamics) problem Machine Interlagos Fermi (2090) GPU+transfer

Time 0.16 s 0.038 s 1.7 s

Speedup 1.0 (REF) 4.2 0.1

The simple lesson: leave data on the GPU!

Wednesday, March 13, 2013 HPC Advisory Council Switzerland Conference, Lugano Performance profile of (original) COSMO-CCLM

Runtime based 2 km production model of MeteoSwiss

% Code Lines (F90) % Runtime

Original code (with OpenACC for GPU) Rewrite in C++ (with CUDA backend for GPU)

Wednesday, March 13, 2013 HPC Advisory Council Switzerland Conference, Lugano Stencil Library Ideas § Implement a stencil library using C++ and template meta programming § 3D structured grid § Parallelization in horizontal IJ-plane (sequential loop in K for tridiagonal solves) § Multi-node support using explicit halo exchange (Generic Communication Library – not covered by presentation) § Abstract the hardware platform (CPU/GPU) § Adapt loop order and storage layout to the platform § Leverage software caching § Hide complex and “ugly” optimizations § Blocking § Single source code compiles to multiple platforms § Currently, efficient back-ends are implemented for CPU and GPU

CPU GPU Storage Order (Fortran notation) KIJ IJK Parallelization OpenMP CUDA

Wednesday, March 13, 2013 HPC Advisory Council Switzerland Conference, Lugano “Apples to Oranges” comparison of new vs. current COSMO Speedup of new code on K20x GPU vs. current code on AMD Interlagos (@2.1GHz)

7.0# new C++ code with CUDA (*)

6.0#

5.0#

4.0# Physics# Dynamics# 3.0# Dyn+Physics# Speedup&

2.0# current F90 code with OpenACC

1.0#

0.0# 64x128# 128x128# 256*256#

(*) new C++ code with OpenMP is about 2x faster than current F90 code

Wednesday, March 13, 2013 HPC Advisory Council Switzerland Conference, Lugano Solving Kohn-Sham equation is the bottleneck of most DFT based materials science codes

2 ~ 2 Kohn-Sham Eqn. + vLDA(⇤r) ⇥ (⇤r)= ⇥ (⇤r) 2mr i i i ✓ ◆ Ansatz ⇥i(⇤r)= ciµµ(⇤r) µ X 2 ~ 2 Hermitian matrix H = ⇤ (⇥r) + v (⇥r) (⇥r)d⇥r µ µ 2mr LDA Z ✓ ◆

Basis may not be orthogonal Sµ = µ⇤ (⇥r) (⇥r)d⇥r Z Solve generalized eigenvalue problem (H " S)=0 i

Wednesday, March 13, 2013 HPC Advisory Council Switzerland Conference, Lugano Solving Kohn-Sham equation is the bottleneck of most DFT based materials science codes

2 ~ 2 Kohn-Sham Eqn. + vLDA(⇤r) ⇥ (⇤r)= ⇥ (⇤r) 2mr i i i ✓ ◆ Remarks:Ansatz ⇥i(⇤r)= ciµµ(⇤r) Typical matrix size is up to 10’000 µ X Only a few methods require high accuracy and matrix size2 up to 100’000 Hermitian matrix ~ 2 Hµ = µ⇤ (⇥r) + vLDA(⇥r) (⇥r)d⇥r Some methods try to avoid eigensolvers (orthogonalization2m willr be important motif instead) Z ✓ ◆ When lower accuracy is an option, order(N) methods are used instead Basis may not be orthogonal (this is coveredSµ by =one of ourµ⇤ ( HP2C⇥r) projects(⇥r)d⇥r see CP2K on hp2c.ch) Z Dominant motif: Solve generalized eigenvalue problem (H " S)=0 i We will need between 10% and 50% of the eigenvectors

Wednesday, March 13, 2013 HPC Advisory Council Switzerland Conference, Lugano Solving the generalized eigenvalue problem

Ax = Bx Standard 1 stage solver

H A0y = y xPOTRF B = LL

1 H H xHETRD xHEGST A0 = L AL T = Q A0Q Most time consuming step, dominated by level 2 BLAS xHEEVx A0y = y (memory bound) xSTExx Ty0 = y0

xTRSM H x = L y y = Qy0 xUNMTR

Wednesday, March 13, 2013 HPC Advisory Council Switzerland Conference, Lugano Solving the generalized eigenvalue problem with 2 stage solver

Ax = Bx A0y = y

H H reduction to banded A00 = Q1 A0Q1 xPOTRF B = LL Most time consuming step, but dominated by BLAS-3 tri-diagonalize H 1 H T = Q2 A00Q2 xHEGST A0 = L AL

Ty0 = y0 xHEEVx A0y = y

needs two eigenvector y00 = Q2y0 transformations xTRSM H (but easy to parallelize) x = L y y = Q1y00

Wednesday, March 13, 2013 HPC Advisory Council Switzerland Conference, Lugano 2 stage solver maps best to hybrid CPU-GPU architecture

A0y = y CPU GPU H reduction to banded A00 = Q1 A0Q1 hybrid

H tri-diagonalize T = Q2 A00Q2 GPU CPU divide an conquer Ty = y CPU 0 0

needs two eigenvector y00 = Q2y0 transformations (but easy to parallelize) y = Q1y00 GPU

Wednesday, March 13, 2013 HPC Advisory Council Switzerland Conference, Lugano Comparison of general eigensolvers: multi-core vs. hybrid double complex, matrix size 10000 hybrid with MAGMA (*): 6 threads + GPU (1x Intel X5650, 1x Nvidia M2090) multi-core with MKL: 12 threads (2x Intel X5650) multi-core with ELPA: 12 processes (2x Intel X5650) All eigenvectors 10% eigenvectors

(*) A novel hybrid CPU-GPU generalized eigensolver for electronic structure calculations based on fine grained memory aware tasks, A. Haidar, S. Tomov, J. Dongarra, R. Solcà, and TCS, to appear in Int. J. on High-Perf. Computing

> increase data locality with tridiagonalization in two stages > true hybrid CPU+GPU implementation of solver (see Azzam Haidar’s talk)

Wednesday, March 13, 2013 HPC Advisory Council Switzerland Conference, Lugano Essential lesson learned In order to improve/maximize data locality the work on algorithms and application software code will be essential § DCA++ code/algorithm: § Changed original Hirsch-Fye Quantum Monte Carlo algorithm in order to replace rank(1) Green’s function update with rank(k) update § COSMO (weather/climate): § Improve CPU performance by factor ~2 with (1) recomputing parameters on the fly to reduce data movements; (2) significant software rewrite with improved data locality (no free lunch) § CUDA implementation of DyCore ~3x faster on GPU than OpenMP on CPU (reflecting difference in memory bandwidth, no magic!) § Eigensolvers for ab initio materials science codes: § Replace single stage tridiagonalization, which relies on matrix-vector (BLAS-2) operation, with a two-stage approach, in which the computation is dominated by a reduction to banded matrix and BLAS-3 § Implemented for hybrid-CPU-GPU

Wednesday, March 13, 2013 HPC Advisory Council Switzerland Conference, Lugano A pragmatic/practical list of things we need (today and certainly when we reach exascale) § Algorithm redesign and code refactoring – continue investing in this area! § increasing data locality is a key factor § scalability has to be designed into application from bottom up – problem for legacy codes! § Programming model/environment for distributed memory & complex nodes § massive multi-threading on a node § heterogeneous memory and memory hierarchies § hybrid: scalar an parallel/multi-threaded processing units § has to be usable along with standard HPC languages (C, C++, FORTRAN) § support for creation of domain/data structure specific embedded languages § This must include practical tools/mechanism, such as § numerical libraries (as usual) § data structure specific tools/templates/languages § ...

Wednesday, March 13, 2013 HPC Advisory Council Switzerland Conference, Lugano WHAT ABOUT THE SEPARATION OF CONCERNS?

Wednesday, March 13, 2013 HPC Advisory Council Switzerland Conference, Lugano velocities

pressure temperature Physical model Mathematical description water turbulence Discretization / algorithm

lap(i,j,k) = –4.0 * data(i,j,k) + data(i+1,j,k) + data(i-1,j,k) + Code / implementation data(i,j+1,k) + data(i,j-1,k);

Code compilation “Port” serial code to supercomputers > vectorize > parallelize A given supercomputer > petascaling > exascaling > ...

Wednesday, March 13, 2013 HPC Advisory Council Switzerland Conference, Lugano velocities

pressure temperature Physical model Mathematical description water turbulence Discretization / algorithm

lap(i,j,k) = –4.0 * data(i,j,k) + data(i+1,j,k) + data(i-1,j,k) + Code / implementation data(i,j+1,k) + data(i,j-1,k);

Code compilation “Port” serial code to supercomputers > vectorize > parallelize A given supercomputer > petascaling > exascaling > ...

Wednesday, March 13, 2013 HPC Advisory Council Switzerland Conference, Lugano velocities

pressure Driven by domain temperature Physical model sicence Mathematical description water turbulence Discretization / algorithm Needs to be properly embedded – but how?

lap(i,j,k) = –4.0 * data(i,j,k) + Libraries data(i+1,j,k) + data(i-1,j,k) + Code / implementation data(i,j+1,k) + data(i,j-1,k); DSL / DSEL

Driven by vendors Code compilation Optimal algorithm Auto tuning Architectural options / design

Wednesday, March 13, 2013 HPC Advisory Council Switzerland Conference, Lugano COSMO current and new code could hint at new separation of concerns

main main Prototyping code / interactive data analysis Application code dynamics

stencil library dynamics Domain Specific Libraries & Tools (DSL&T) physics physics X86 GPU boundary conditions GCL MPI MPI Basic Libraries (incl. BLAS, LAPACK, FFT, ...)

system system

Some DSL&T could cut across multiple domains (e.g. grid/structured grid tools, or tools for other algorithmic motifs)

Wednesday, March 13, 2013 HPC Advisory Council Switzerland Conference, Lugano velocities

pressure Driven by domain temperature Physical model science Mathematical description water turbulence Discretization / algorithm Needs to be properly embedded

lap(i,j,k) = –4.0 * data(i,j,k) + Libraries data(i+1,j,k) + data(i-1,j,k) + Code / implementation data(i,j+1,k) + data(i,j-1,k); DSL / DSEL

Driven by vendors Code compilation Optimal algorithm Auto tuning Architectural options / design PASC co-design projects

Wednesday, March 13, 2013 HPC Advisory Council Switzerland Conference, Lugano Platform for Advanced Scientific Computing High-risk & high-impact projects of the Application driven co-design (www.hp2c.ch) of pre-exascale supercomputing ecosystem (proposal under consideration)

2017

2016 Upgrade Monte Rosa Cray XT5 2015 14’762 cores Upgrade to Cray XE6 2014 47,200 cores Development & Hex-core upgrade procurement of pre- 2013 22’128 cores exaflop/s scale supercomputers 2012

2011 Development & 2010 New procurement of building petaflop/s scale 2009 supercomputer(s) Begin construction complete of new building Platform for Advanced Scientific Computing

Wednesday, March 13, 2013 HPC Advisory Council Switzerland Conference, Lugano Acknowledgements § Team working on COSMO refactoring § Oliver Fuhrer (Meteo Swiss) § Tobias Gysi and Daniel Müller (Supercomputing Systems) § Xavier Lapillone, Carlos Osuna (C2SM@ETH) § Will Sawyer, Mauro Bianco, Ugo Vareto, Ben Cumming, TCS (CSCS) § Tim Schröder, Peter Messmer (NVIDIA) § Ulli Schättler, Michael Baldauf (DWD) § Team working on eigensolvers for hybrid architectures (MAGMA) § Stan Tomov, Azam Haidar, Jack Dongara (UTK), Raffaele Solca, and TCS (ETH) § Discussions with and inputs for materials science part § Volker Blum (FHI, Berlin), Paul Kent (ORNL), Anton Kozevnikov (ETH) § Nicola Marzari (EPFL), Joost VandeVondele (ETH), Alessandro Curioni (IBM) § PRACE 2IP WP8 Materials team, in particular Marc Torrent (CEA), Fabio Affinito (CINECA), Georg Huhs (BSC), Xavier Gonze (U. of Louvain la Neuve) § Funding for much of the HPC related work § Swiss University Conference through HP2C Platform (www.hp2c.ch) § PRACE 2IP WP8 and all institutions that had to provide matching funds

Wednesday, March 13, 2013 HPC Advisory Council Switzerland Conference, Lugano THANK YOU!

Wednesday, March 13, 2013 HPC Advisory Council Switzerland Conference, Lugano