High-Performance Computing at CSCS: Co-Design and New Separations of Concerns for HPC

High-Performance Computing at CSCS: Co-Design and New Separations of Concerns for HPC Thomas C. Schulthess Computer performance and application performance increase ~103 every decade ~100 Kilowatts ~5 Megawatts 20-30 MW ~1 Exaflop/s 100 million or billion 1.35 Petaflop/s processing cores (!) Cray XT5 150’000 processors This system was built with commodity 1.02 Teraflop/s processors Cray T3E 1’500 processors 1 Gigaflop/s Cray YMP 8 processors 1988 1998 2008 2018 First sustained GFlop/s First sustained TFlop/s First sustained PFlop/s Another 1,000x increase in Gordon Bell Prize 1988 Gordon Bell Prize 1998 Gordon Bell Prize 2008 sustained performance Wednesday, March 13, 2013 HPC Advisory Council Switzerland Conference, Lugano HOW WELL CAN APPLICATIONS DEAL WITH CONCURRENCY? HOW EFFICIENT ARE THEY? Wednesday, March 13, 2013 HPC Advisory Council Switzerland Conference, Lugano Applications running at scale on Jaguar @ ORNL (Spring 2011) Domain area Code name Institution # of cores Performance Notes 2008 Gordon Bell Materials DCA++ ORNL 213,120 1.9 PF Prize Winner 2009 Gordon Bell Materials WL-LSMS ORNL/ETH 223,232 1.8 PF Prize Winner 2008 Gordon Bell Chemistry NWChem PNNL/ORNL 224,196 1.4 PF Prize Finalist 2010 Gordon Bell Materials DRC ETH/UTK 186,624 1.3 PF Prize Hon. Mention 2010 Gordon Bell Nanoscience OMEN Duke 222,720 > 1 PF Prize Finalist 2010 Gordon Bell Biomedical MoBo GaTech 196,608 780 TF Prize Winner Chemistry MADNESS UT/ORNL 140,000 550 TF 2008 Gordon Bell Materials LS3DF LBL 147,456 442 TF Prize Winner 2008 Gordon Bell Seismology SPECFEM3D USA (multiple) 149,784 165 TF Prize Finalist Combustion S3D SNL 147,456 83 TF Weather WRF USA (multiple) 150,000 50 TF Wednesday, March 13, 2013 HPC Advisory Council Switzerland Conference, Lugano Hirsch-Fey quantum Monte Carlo with Delayed updates (or Ed updates) Ed D’Azevedo, ORNL t Gc({si,l}k+1)=Gc({si,l}k)+ak × bk t Gc({si,l}k+1)=Gc({si,l}0)+[a0|a1|...|ak] × [b0|b1|...|bk] 2 6000 Complexity for k updates remains O(kNt ) mixed precision double precision But we can replace k rank-1 updates 4000 with one matrix-matrix multiply plus some additional bookkeeping. 2000 time to solution [sec] G. Alvarez, M. S. Summers, D. E. Maxwell, M. Eisenbach, J. S. Meredith, J. M. Larkin, J. Levesque, T. A. Maier, P. R. C. Kent, E. F. D'Azevedo, T. C. Schulthess; New algorithm to enable 400+ TFlop/ 0 0 20 40 60 80 100 s sustained performance in simulations of disorder effects in high-Tc superconductors, Proceedings of the 2008 ACM/IEEE conference on Supercomputing 61, 2008 delay Wednesday, March 13, 2013 HPC Advisory Council Switzerland Conference, Lugano Arithmetic intensity of a computation ﬂoating point operations ↵ = data transferred t Gc({si,l}k+1)=Gc({si,l}k)+ak × bk α o(1) ⇡ t Gc({si,l}k+1)=Gc({si,l}0)+[a0|a1|...|ak] × [b0|b1|...|bk] α o(k) ⇡ Wednesday, March 13, 2013 HPC Advisory Council Switzerland Conference, Lugano Algorithmic motifs and their arithmetic intensity Arithmetic intensity: number of operations per word of memory transferred Finite difference / stencil in S3D and WRF (& COSMO) Rank-1 update in HF-QMC Rank-N update in DCA++ QMR in WL-LSMS Sparse linear algebra Linpack (Top500) Matrix-Vector Vector-Vector Fast Fourier Transforms Dense Matrix-Matrix BLAS1&2 FFTW & SPIRAL BLAS3 O(1) O(log N) O(N) Supercomputers are designed for certain algorithmic motifs – which ones? Wednesday, March 13, 2013 HPC Advisory Council Switzerland Conference, Lugano Petaflop/s = 1015 64-bit floating point operations / sec. which takes more energy? 64-bit floating-point fused multiply add or moving three 64-bit operands 20 mm across the die 934,569.299814557 x 52.827419489135904 ---------------------------- = 49,370,884.442971624253823 + 4.20349729193958 ---------------------------- = 49,370,888.64646892 20 mm this takes over 3x the energy! loading the data from off chip takes > 10x more yet source: Steve Scott, Cray Inc. moving data is expensive – exploiting data locality is critical to energy efficiency If we care about energy consumption, we have to worry about these and other physical considerations of the computation – but where is the separation of concerns? Wednesday, March 13, 2013 HPC Advisory Council Switzerland Conference, Lugano COSMO in production for Swiss weather prediction ECMWF COSMO-7 2x per day 3x per day 72h forecast 16 km lateral grid, 91 6.6 km lateral grid, 60 layers layers COSMO-2 8x per day 24h forecast 2.2 km lateral grid, 60 layers § Some of the products generated from these simulations § Daily weather forecast § Forecasting for air traffic control (Sky Gide) § Safety management in event of nuclear incidents Wednesday, March 13, 2013 HPC Advisory Council Switzerland Conference, Lugano Insight into model/methods/algorithms used in COSMO § PDE on structured grid (variables: velocity, temperature, pressure, humidity, etc.) § Explicit solve horizontally (I, J) using finite difference stencils § Implicit solve in vertical direction (K) with tri-diagonal solve in every column (applying Thomas algorithm in parallel – can be expressed as stencil) ~2km Due to implicit solves in the vertical we can work with 60m longer time steps (2km and not 60m grid size K J is relevant) Tri-diagonal solves I Wednesday, March 13, 2013 HPC Advisory Council Switzerland Conference, Lugano Hence the algorithmic motif are § Tri-diagonal solve § vertical K-diretion § with loop carried dependencies in K J K I § Finite difference stencil computations J § focus on horizontal IJ-plane access K § no loop carried dependencies Wednesday, March 13, 2013 HPC Advisory Council Switzerland Conference, Lugano Performance profile of (original) COSMO-CCLM Runtime based 2 km production model of MeteoSwiss % Code Lines (F90) % Runtime Wednesday, March 13, 2013 HPC Advisory Council Switzerland Conference, Lugano Analyzing two examples/motifs – how are they different? Physics 3 memory accesses 136 FLOPs è compute bound Dynamics 3 memory accesses 5 FLOPs è memory bound Wednesday, March 13, 2013 HPC Advisory Council Switzerland Conference, Lugano Running the simple examples on the Cray XK6 Compute bound (physics) problem Machine Interlagos Fermi (2090) GPU+transfer Time 1.31 s 0.17 s 1.9 s Speedup 1.0 (REF) 7.6 0.7 Memory bound (dynamics) problem Machine Interlagos Fermi (2090) GPU+transfer Time 0.16 s 0.038 s 1.7 s Speedup 1.0 (REF) 4.2 0.1 The simple lesson: leave data on the GPU! Wednesday, March 13, 2013 HPC Advisory Council Switzerland Conference, Lugano Performance profile of (original) COSMO-CCLM Runtime based 2 km production model of MeteoSwiss % Code Lines (F90) % Runtime Original code (with OpenACC for GPU) Rewrite in C++ (with CUDA backend for GPU) Wednesday, March 13, 2013 HPC Advisory Council Switzerland Conference, Lugano Stencil Library Ideas § Implement a stencil library using C++ and template meta programming § 3D structured grid § Parallelization in horizontal IJ-plane (sequential loop in K for tridiagonal solves) § Multi-node support using explicit halo exchange (Generic Communication Library – not covered by presentation) § Abstract the hardware platform (CPU/GPU) § Adapt loop order and storage layout to the platform § Leverage software caching § Hide complex and “ugly” optimizations § Blocking § Single source code compiles to multiple platforms § Currently, efficient back-ends are implemented for CPU and GPU CPU GPU Storage Order (Fortran notation) KIJ IJK Parallelization OpenMP CUDA Wednesday, March 13, 2013 HPC Advisory Council Switzerland Conference, Lugano “Apples to Oranges” comparison of new vs. current COSMO Speedup of new code on K20x GPU vs. current code on AMD Interlagos (@2.1GHz) 7.0# new C++ code with CUDA (*) 6.0# 5.0# 4.0# Physics# Dynamics# 3.0# Dyn+Physics# Speedup& 2.0# current F90 code with OpenACC 1.0# 0.0# 64x128# 128x128# 256*256# (*) new C++ code with OpenMP is about 2x faster than current F90 code Wednesday, March 13, 2013 HPC Advisory Council Switzerland Conference, Lugano Solving Kohn-Sham equation is the bottleneck of most DFT based materials science codes 2 ~ 2 Kohn-Sham Eqn. + vLDA(⇤r) ⇥ (⇤r)= ⇥ (⇤r) −2mr i i i ✓ ◆ Ansatz ⇥i(⇤r)= ciµφµ(⇤r) µ X 2 ~ 2 Hermitian matrix H = φ⇤ (⇥r) + v (⇥r) φ (⇥r)d⇥r µν µ −2mr LDA ν Z ✓ ◆ Basis may not be orthogonal Sµν = φµ⇤ (⇥r)φν (⇥r)d⇥r Z Solve generalized eigenvalue problem (H " S)=0 − i Wednesday, March 13, 2013 HPC Advisory Council Switzerland Conference, Lugano Solving Kohn-Sham equation is the bottleneck of most DFT based materials science codes 2 ~ 2 Kohn-Sham Eqn. + vLDA(⇤r) ⇥ (⇤r)= ⇥ (⇤r) −2mr i i i ✓ ◆ Remarks:Ansatz ⇥i(⇤r)= ciµφµ(⇤r) Typical matrix size is up to 10’000 µ X Only a few methods require high accuracy and matrix size2 up to 100’000 Hermitian matrix ~ 2 Hµν = φµ⇤ (⇥r) + vLDA(⇥r) φν (⇥r)d⇥r Some methods try to avoid eigensolvers (orthogonalization−2m willr be important motif instead) Z ✓ ◆ When lower accuracy is an option, order(N) methods are used instead Basis may not be orthogonal (this is coveredSµ byν =one of φourµ⇤ ( HP2C⇥r)φν projects(⇥r)d⇥r see CP2K on hp2c.ch) Z Dominant motif: Solve generalized eigenvalue problem (H " S)=0 − i We will need between 10% and 50% of the eigenvectors Wednesday, March 13, 2013 HPC Advisory Council Switzerland Conference, Lugano Solving the generalized eigenvalue problem Ax = λBx Standard 1 stage solver H A0y = λy xPOTRF B = LL 1 H H xHETRD xHEGST A0 = L− AL− T = Q A0Q Most time consuming step, dominated by level 2 BLAS xHEEVx A0y = λy (memory bound) xSTExx Ty0 = λy0 xTRSM H x = L− y y = Qy0 xUNMTR Wednesday, March 13, 2013 HPC Advisory Council Switzerland Conference, Lugano Solving

Load more