“Piz Daint” & “Piz Kesch”: from general purpose super- computing to an appliance for weather forecasting

Thomas C. Schulthess

GTC 2016, San Jose, Wednesday April 6, 2016 T. Schulthess 1 XC30 with 5272 hybrid, GPU accelerated compute nodes “Piz Daint”

Compute node: > Host: E5 2670 (SandyBridge 8c) > Accelerator: NVIDIA K20X GPU (GK110)

GTC 2016, San Jose, Wednesday April 6, 2016 T. Schulthess 2 September 15, 2015 Today’s Outlook: GPU-accelerated Weather Forecasting John Russell “Piz Kesch”

GTC 2016, San Jose, Wednesday April 6, 2016 T. Schulthess 3 Swiss High-Performance Computing & Networking Initiative (HPCN)

High-risk & high-impact projects (www.hp2c.ch) Application driven co-design Phase II of pre-exascale supercomputing ecosystem Three pronged approach of the HPCN Initiative 2017 1. New, flexible, and efficient building 2. Efficient 2016 Monte Rosa 3. Efficient applications Pascal based hybrid Cray XT5 2015 14’762 cores Upgrade to Phase II Cray XE6 K20X based hybrid 2014 Upgrade 47,200 cores Hex-core upgrade Phase I 22’128 cores 2013 Development & Aries network & multi-core procurement of 2012 petaflop/s scale (s) 2011

2010 New 2009 building Begin construction complete of new building

GTC 2016, San Jose, Wednesday April 6, 2016 T. Schulthess 4 Platform for Advanced Scientific Computing

Climate Structuring project of the Swiss University Conference (swissuniversities) 5 domain science networks 1.ANSWERS > distributed application support 2.Angiogenesis 3.AV-FLOPW >20 projects 4.CodeWave see: www.pasc-ch.org 5.Coupled Cardiac Simulations 6.DIAPHANE Materials simulations 7.Direct GPU to GPU com. 8.Electronic Structure Calc. 9.ENVIRON 10.Genomic Data Processing 11.GeoPC Physics 12.GeoScale 13.Grid Tools 14.Heterogen. Compiler Platform 15.HPC-ABGEM 16.MD-based drug design 17.Multiscale applications 18.Multiscale economical data 19.Particles and fields Life Sciences Solid Earth Dynamics 20.Snowball sampling

GTC 2016, San Jose, Wednesday April 6, 2016 T. Schulthess 5 GTC 2016, San Jose, Wednesday April 6, 2016 T. Schulthess 6 Leutwyler, D., O. Fuhrer, X. Lapillone, D. Lüthi, C. Schär, 2015: Continental-Scale Climate Simulation at Kilometer resolution. ETH Zurich Online Resource, DOI: http://dx.doi.org/10.3929/ethz-a-010483656, online video: http://vimeo.com/136588806

GTC 2016, San Jose, Wednesday April 6, 2016 T. Schulthess 7 Meteo Swiss production suite until March 30, 2016 ECMWF COSMO-7 2x per day 3x per day 72h forecast 16 km lateral grid, 91 6.6 km lateral grid, 60 layers layers COSMO-2 8x per day 24h forecast 2.2 km lateral grid, 60 layers

Some of the products generate from these simulations: ‣ Daily weather forecast on TV / radio ‣ Forecasting for air traffic control (Sky Guide) ‣ Safety management in event of nuclear incidents

GTC 2016, San Jose, Wednesday April 6, 2016 T. Schulthess 8 “Albis” & “Lema”: CSCS production systems for Meteo Swiss until March 2016

Cray XE6 procured in spring 2012 based on 12-core AMD Opteron multi-core processors

GTC 2016, San Jose, Wednesday April 6, 2016 T. Schulthess 9 Improving simulation quality requires higher performance – what exactly and by how much?

Resource determining factors for Meteo Swiss’ simulations

Current model running through spring 2016 New model starting operation on in spring 2016

COSMO-2: 24h forecast running in 30 min. COSMO-1: 24h forecast running in 30 min. 8x per day 8x per day (~10x COSMO-2) COSMO-2E: 21-member ensemble,120h forecast in 150 min., 2x per day (~26x COSMO-2) KENDA: 40-member ensemble,1h forecast in 15 min., 24x per day (~5x COSMO-2)

New production system must deliver ~40x the simulations performance of “Albis” and “Lema”

GTC 2016, San Jose, Wednesday April 6, 2016 T. Schulthess 10 State of the art implementation of new system for Meteo Swiss

Albis & Lema: 3 cabinets Cray XE6 installed Q2/2012

• New system needs to be installed Q2-3/2015 • Assuming 2x improvement in per-socket performance: ~20x more X86 sockets would require 30 Cray XC cabinets

New system for Meteo Swiss if we build it like the German Weather Service (DWD) did theirs, or UK Met Office, or ECMWF … (30 racks XC) Current Cray XC30/XC40 platform (space for 40 racks XC)

Thinking inside the box is not a good option!

CSCS machine room GTC 2016, San Jose, Wednesday April 6, 2016 T. Schulthess 11 COSMO: old and new (refactored) code

main (current / Fortran) main (new / Fortran)

dynamics (C++)

stencil library boundary physics dynamics (Fortran) conditions & physics (Fortran) X86 GPU halo exchg. (Fortran) with OpenMP / OpenACC Generic Shared Comm. Infrastructure Library MPI MPI or whatever

system system

Used by most weather services HP2C/PASC development in production (incl. MeteoSwiss until 3/2016) on “Piz Daint” since 01/2014 and for as well as most HPC centres Meteo Meteo Swiss since 04/206

GTC 2016, San Jose, Wednesday April 6, 2016 T. Schulthess 12 Piz Kesch / Piz Escha: appliance for meteorology

• Water cooled rack (48U) • 12 compute nodes with • 2 Intel Xeon E5-2690v3 12 cores @ 2.6 GHz256 GB 2133 MHz DDR4 memory • 8 NVIDIA Tesla K80 GPU • 3 login nodes • 5 post-processing nodes • Mellanox FDR InfiniBand • Cray CLFS Luster Storage • Cray Programming Environment

GTC 2016, San Jose, Wednesday April 6, 2016 T. Schulthess 13 Origin of factor 40 performance improvement Performance of COSMO running on new “Piz Kesch” compared to (in Sept. 2015) (1) previous production system – Cray XE6 with AMD Barcelona (2) “Piz Dora” – Cray XE40 with Intel Haswell (E5-2690v3)

• Current production system installed in 2012 • New Piz Kesch/Escha installed in 2015 • Processor performance 2.8x Moore’s Law • Improved system utilisation 2.8x • General software performance 1.7x Software refactoring • Port to GPU architecture 2.3x • Increase in number of processors 1.3x • Total performance improvement ~40x • Bonus: simulation running on GPU is 3x more energy efficient compared to conventional state of the art CPU

GTC 2016, San Jose, Wednesday April 6, 2016 T. Schulthess 14 A factor 40 improvement with the same footprint

Current production system: Albis & Lema New system: Kesch & Escha

GTC 2016, San Jose, Wednesday April 6, 2016 T. Schulthess 15 GPU - accelerated hybrid (accelerated) Multi-core 2017+ Summit Aurora post-K Tsuname-3.0 2016 U. Tokyo MeteoSwiss 2015 Both architecture have heterogeneous memory! 2014 2013 2012 2011

DARPA HPCS GTC 2016, San Jose, Wednesday April 6, 2016 T. Schulthess 16