The Swiss High-Performance Computing and Networking

Thomas C. Schulthess

| T. Schulthess !1 What is considered success in HPC?

| T. Schulthess !2 High-Performance Computing Initiative (HPCN) in Switzerland

High-risk & high-impact projects (www.hp2c.ch) Application driven co-design Phase III of pre-exascale supercomputing ecosystem Three pronged approach of the HPCN Initiative 2017 1. New, flexible, and efficient building 2. Efficient 2016 Monte Rosa 3. Efficient applications Pascal based hybrid XT5 2015 14’762 cores Upgrade to Phase II Cray XE6 K20X based hybrid 2014 Upgrade 47,200 cores Hex-core upgrade Phase I 22’128 cores 2013 Development & Aries network & multi-core procurement of 2012 petaflop/s scale (s) 2011

2010 New 2009 building Begin construction complete of new building

| T. Schulthess !3 FACT SHEET

“Piz Daint”, one of the most powerful supercomputers

A hardware upgrade in the final quarter of 2016 Thanks to the new hardware, researchers can run their simula- saw “Piz Daint”, Europe’s most powerful super- tions more realistically and more efficiently. In the future, big computer, more than triple its computing perfor- science experiments such as the Large Hadron Collider at CERN mance. ETH Zurich invested around CHF 40 million will also see their data analysis support provided by “Piz Daint”. in the upgrade, so that simulations, data analysis and visualisation can be performed more effi- ETH Zurich has invested CHF 40 million in the upgrade of “Piz ciently than ever before. Daint” – from a Cray XC30 to a Cray XC40/XC50. The upgrade in- volved replacing two types of compute nodes as well as the in- With a peak performance of seven petaflops, “Piz Daint” has tegration of a novel technology from Cray known as DataWarp. been Europe’s fastest supercomputer since its debut in Novem- DataWarp’s “burst buffer mode” quadruples the effective ber 2013. And it is set to remain number one for now thanks to bandwidth to and from storage devices, markedly accelerating a hardware upgrade in late 2016, which boosted its peak per- data input and output rates and so facilitating the analysis of formance to more than 25 petaflops. This increase in perfor- millions of small, unstructured files. Thus, “Piz Daint” is able to mance is vital for enabling the higher-resolution, compute- and analyse the results of its computations even while they are still data-intensive simulations used in modern materials science, in progress. The revamped “Piz Daint” remains an extremely physics, geophysics, life sciences and climate science. Data energy-efficient and balanced system where simulations and science too, an area where ETH Zurich is establishing strategic data analyses are scalable from a few to thousands of compute research strength, calls for high-power computing facilities. nodes. These fields involve the processing of vast amounts of data. The new system is now well equipped to provide an infrastruc- ture that will accommodate the increasing demands in high performance computing (HPC) up until the end of the decade.

“Piz Daint” 2017 fact sheet

~5’000 NVIDIA P100 GPU accelerated nodes Piz~1’400 Daint Dual specifications multi-core socket nodes

Model Cray XC40/Cray XC50 Number of Hybrid Compute Nodes 5 320 Number of Multicore Compute Nodes 1 431 Theoretical Peak Floataing-point Performance per Hybrid Node 4.761 Teraflops E5-2690 v3/Nvidia Tesla P100 Theoretical Peak Floating-point Performance per Multicore Node 1.210 Teraflops Intel Xeon E5-2695 v4 Theoretical Hybrid Peak Performance 25.326 Petaflops Theoretical Muliticore Peak Performance 1.731 Petaflops Hybrid Memory Capacity per Node 64 GB; 16 GB CoWoS HBM2 Multicore Memory Capacity per Node 64 GB, 128 GB Total System Memory 437.9 TB; 83.1 TB System Interconnect Cray Aries routing and communications ASIC, and Dragonfly network topology Sonexion 3000 Storage Capacity 6.2 PB Sonexion 3000 Parallel File System Theoretical Peak Performance 112 GB/s Sonexion 1600 Storage Capacity 2.5 PB Sonexion 1600 Parallel File System Theoretcal Peak Performance 138 GB/s

http://www.cscs.ch/publications/fact_sheets/index.html

| T. Schulthess !4

Via Trevano 131 T +41 91 610 82 11 © CSCS 2017 6900 Lugano F +41 91 610 82 82 Switzerland www.cscs.ch First production level GPU deployment in 2013

• 2009: Call for application development projects • 2010: Start 11 high-risk, high-impact development projects (incl. COSMO) • Nov. 2011: decision to engage in study with Cray (X86 vs. XeonPhi vs. GPU) Model• GPU Cray areXC40/C NOTray XC50 plan of record for Cray XC systems Number of Hybrid Compute Nodes 5 320 • Jan.Number 2012: of Multicore study Compute of 9 applicationsNodes show GPU are1 431 a viable option Theoretical• GPU bladePeak Floataing-point design begins Performance but per XeonPhi Hybrid Node (KNC) 4.761 has Te rhigheraflops Intel priority Xeon E5-2690 v3/Nvidia Tesla P100 Theoretical Peak Floating-point Performance per Multicore Node 1.210 Teraflops Intel Xeon E5-2695 v4 • Jul.Theoretical 2012: Hybrid difficulties Peak Performance in getting applications to25.326 perform Petaflops on KNC • Sep.-Oct.Theoretical Muliticore 2012: Peakdemonstrate Performance KNC can’t work (deficient1.731 Petaflops memory subsystem) Hybrid Memory Capacity per Node 64 GB; 16 GB CoWoS HBM2 • Nov.Multicore 2012: Memo Crayry Capacity switches per Node priorities, putting GPU64 GB, ahead 128 GB of XeonPhi • Nov.Total System 2012: Memo firstry deployment of Piz Daint (CPU)437.9 TB; 83.1 TB System Interconnect Cray Aries routing and communications ASIC, and Dragonfly • Nov. 2013: full scale out of Pix Daint with Keplernetwork (K20X) topology GPU Sonexion 3000 Storage Capacity 6.2 PB • Feb. 2014 – Apr. 2015: analysis, design, negotiations of upgrade Sonexion 3000 Parallel File System Theoretical Peak Performance 112 GB/s • Nov.Sonexion 2016: 1600 Stoupgraderage Capacity to Pascal GPU 2.5 PB Sonexion 1600 Parallel File System Theoretcal Peak Performance 138 GB/s • Apr. 2017: fully integrated platform for compute (GPU & CPU) and data service http://www.cscs.ch/publications/fact_sheets/index.html

| T. Schulthess !5 FACT SHEET

“Piz Daint”, one of the most powerful supercomputers

A hardware upgrade in the final quarter of 2016 Thanks to the new hardware, researchers can run their simula- saw “Piz Daint”, Europe’s most powerful super- tions more realistically and more efficiently. In the future, big computer, more than triple its computing perfor- science experiments such as the Large Hadron Collider at CERN mance. ETH Zurich invested around CHF 40 million will also see their data analysis support provided by “Piz Daint”. in the upgrade, so that simulations, data analysis and visualisation can be performed more effi- ETH Zurich has invested CHF 40 million in the upgrade of “Piz ciently than ever before. Daint” – from a Cray XC30 to a Cray XC40/XC50. The upgrade in- volved replacing two types of compute nodes as well as the in- With a peak performance of seven petaflops, “Piz Daint” has tegration of a novel technology from Cray known as DataWarp. been Europe’s fastest supercomputer since its debut in Novem- DataWarp’s “burst buffer mode” quadruples the effective ber 2013. And it is set to remain number one for now thanks to bandwidth to and from storage devices, markedly accelerating a hardware upgrade in late 2016, which boosted its peak per- data input and output rates and so facilitating the analysis of formance to more than 25 petaflops. This increase in perfor- millions of small, unstructured files. Thus, “Piz Daint” is able to mance is vital for enabling the higher-resolution, compute- and analyse the results of its computations even while they are still data-intensive simulations used in modern materials science, in progress. The revamped “Piz Daint” remains an extremely physics, geophysics, life sciences and climate science. Data energy-efficient and balanced system where simulations and science too, an area where ETH Zurich is establishing strategic data analyses are scalable from a few to thousands of compute research strength, calls for high-power computing facilities. nodes. These fields involve the processing of vast amounts of data. The new system is now well equipped to provide an infrastruc- ture that will accommodate the increasing demands in high performance computing (HPC) up until the end of the decade.

“Piz Daint” 2017 fact sheet

Piz Daint specifications

Model Cray XC40/Cray XC50 Number of Hybrid Compute Nodes 5 320 Number of Multicore Compute Nodes 1 431 Theoretical Peak Floataing-point Performance per Hybrid Node 4.761 Teraflops Intel Xeon E5-2690 v3/Nvidia Tesla P100 Theoretical Peak Floating-point Performance perInstitutions Multicore Node using 1.210 Piz DaintTeraflops Intel Xeon E5-2695 v4 Theoretical Hybrid Peak Performance 25.326 Petaflops • User Lab (including PRACE Tier-0 allocations) Theoretical Muliticore Peak Performance 1.731 Petaflops Hybrid Memory Capacity per Node• University of Zurich, USI, PSI,64 GB; EMPA 16 GB CoWoS HBM2 Multicore Memory Capacity per Node 64 GB, 128 GB • Total System Memory MaterialsCloud and HBP Collaboratory437.9 TB; 83.1 TB (EPFL) System Interconnect • CHIPP (sine Aug. 2017) Cray Aries routing and communications ASIC, and Dragonfly network topology Sonexion 3000 Storage Capacity• Others, e.g. Swiss Data Science6.2 PB Center Sonexion 3000 Parallel File System(exploratory) Theoretical Peak Performance 112 GB/s Sonexion 1600 Storage Capacity 2.5 PB Sonexion 1600 Parallel File System Theoretcal Peak Performance 138 GB/s

http://www.cscs.ch/publications/fact_sheets/index.html

| T. Schulthess !6

Via Trevano 131 T +41 91 610 82 11 © CSCS 2017 6900 Lugano F +41 91 610 82 82 Switzerland www.cscs.ch | T. Schulthess !7 Higher resolution is necessary for quantitative agreement wth experiment (18 days for July 9-27, 2006)

Altdorf (Reuss valley) Lodrino (Leventina)

COSMO-2 COSMO-1

source: Oliver Fuhrer, MeteoSwiss

| T. Schulthess !8 Prognostic uncertainty The weather system is chaotic ! rapid growth of small perturbations (butterfly effect)

Start Prognostic timeframe Ensemble method: compute distribution over many simulations

| T. Schulthess !9 Benefit of ensemble forecast (heavy thunderstorms on July 24, 2015)

Adelboden reliable?

source: Oliver Fuhrer, MeteoSwiss

| T. Schulthess !10 Benefit of ensemble forecast (heavy thunderstorms on July 24, 2015)

Adelboden

source: Oliver Fuhrer, MeteoSwiss

| T. Schulthess !11 MeteoSwiss’ performance ambitions in 2013

40 Requirements from MeteoSwiss Data assimilation 35 6x 30

25 Ensemble with multiple forecasts 20 24x ? 15

10 Grid 2.2 km ! 1.1 km 5 10x 1 Constant budget for investments and operations

We need a 40x improvement between 2012 and 2015 at constant cost

| T. Schulthess !12 COSMO: old and new (refactored) code

main (current / Fortran) main (new / Fortran)

dynamics (C++)

stencil library boundary physics dynamics (Fortran) conditions & physics (Fortran) X86 GPU halo exchg. (Fortran) with OpenMP / OpenACC Generic Shared Comm. Infrastructure Library MPI MPI or whatever

system system

| T. Schulthess !13 September 15, 2015 Today’s Outlook: GPU-accelerated Weather Forecasting John Russell “Piz Kesch”

| T. Schulthess !14 Where the factor 40 improvement came from

Investment in software allowed mathematical improvements and change in architecture

40 Requirements from MeteoSwiss Data assimilation 1.7x from software refactoring (old vs. new implementation on x86) 35 6x 30 2.8x Mathematical improvements (resource utilisation, precision)

25 Ensemble with multiple forecasts 2.3x Change in architecture (CPU ! GPU) 20 24x

15 Bonus: reduction in power! 10 Grid 2.2 km ! 1.1 2.8x Moore’s Law & arch. improvements on x86 km 5

10x 1.3x additional processors 1 Constant budget for investments and operations

There is no silver bullet!

| T. Schulthess !15 Setting a new baseline for atmospheric simulations

The state-of the art implementation of COSMO running at most weather services on multi-core hardware.

~10x

The refactored version of COSMO running at MeteoSwiss on multi-core or GPU accelerated hardware.

| T. Schulthess !16 What is common to both systems (User Lab & MeteoSwiss)

We started by rewriting large applications

| T. Schulthess !17 “We develop algorithms, we don’t have time to deal with C/C++ or MPI”

–a well-known computer science colleague working in machine learning

| T. Schulthess !18 … echoed by many scientists working with data

Nishant Shukla (2017)

| T. Schulthess !19 Using Jupyter to get our heads around interactive supercomputing

• Jupyter allow users to • integrate development, execution of computation, pre- and post processing with visualisation into one “workflow”; • share these workflows in a team; and • document their work • Need to provide an environment where web-based front-end of the notebook is separated from the computation backend • run some of the computation on a supercomputer • Interactive simply means sub-second response time • this requires properly organising data in memory/storage sub-system

| T. Schulthess !20 Docker solves the software deployment problem

Docker Container Service at CSCS • Operational since 2016 for HBP SGA1 and now SGA2 • Currently using “Shifter” solution to solve HPC security problems • Saurus – an OCI compliant runtime for Docker on HPC systems under development • Ongoing collaboration with Cray and Docker to improve HPC-container orchestration

| T. Schulthess !21 Data Flow Architectural Developments – Traditional Architecture Research Community CSCS User

CSCS External Login Access (ELA)

Piz Daint Login & Mgmt

/store Piz Daint Compute

| T. Schulthess !22 Data Flow Architectural Developments – Improved Architecture Based Research Community on External Portal CSCS User

Domain Specific Portal Repository Workflow access Manager Does Not CSCS ScaleExternal Login Access (ELA)

Piz Daint Login & Mgmt

/store Piz Daint

| T. Schulthess !23 Architectural developments – Service Oriented Architecture (SOA) Research Community CSCS User

Domain Specific Portal Repository Workflow access Manager

CSCS Infrastructure Services

Authentication & User Data Workflow Capacity authorization Management Management Automation Management

IT Infrastructure

Networking & OpenStack Archival DWH Active Storage HPC Services security Services Storage

[Confidential - For CSCS internal use only] Kick-off Meeting

| T. Schulthess !24 … and the service should be up most of the time (like 99+ %)

| T. Schulthess !25 Supporting Federation using SOA

Research CSCS Community Domain Specific Portal User Repository Workflow access Manager Research Community Domain Specific Portal Software services Repository Workflow access Manager Platform services

Infrastructure services Infrastructure provider Infrastructure

Infrastructure services Infrastructure provider Infrastructure

| T. Schulthess !26 Fenix: federated data and compute infrastructure

| T. Schulthess !27 Summary and conclusions

•Start by investing in application developed •Access to modern data centres and hardware is necessary •System development based on rewritten applications •Don’t build new stuff on top of legacy codes •COSMO example shows performance gain can be (non-trivial) 10x •(real) Cloud technologies, i.e. Service Oriented Architecture, add tremendous value to the scientific computing enterprise •Federated services because of resilience, not political correctness

| T. Schulthess !28