<<

What is « HPC » ?

Mineure HPC-SBD High performance hardware A short overview of Numerical TeraFlops High High Performance PetaFlops Serial optimization – tilling Performance Vectorization ExaFlops Multithreading Computing Stéphane Vialle Parallel programming «HPDA» Intensive computing & large scale applications TeraBytes [email protected] PetaBytes http://www.metz.supelec.fr/~vialle ExaBytes 2

High Performance Hardware High Performance Hardware From core to

Computing cores

Inside ….

… high performance hardware

3 4

High Performance Hardware High Performance Hardware From core to SuperComputer From core to SuperComputer

Computing cores Computing cores with vector units with vector units Multi-core

Instru- ctions ALUs Unités AVX 5 6

1 High Performance Hardware High Performance Hardware From core to SuperComputer From core to SuperComputer

Computing cores Computing cores with vector units with vector units Multi-core Multi-core processor processor Multi-core Multi-core PC/node PC/node Multi-core PC cluster

7 8

High Performance Hardware High Performance Hardware From core to SuperComputer From core to SuperComputer

Computing cores Fat Tree Computing cores with vector units interconnect with vector units Multi-core Multi-core processor processor Multi-core Multi-core PC/node PC/node Multi-core Multi-core PC cluster PC cluster Super- Super- Computer + hardware

9 accelerators 10

High Performance Hardware High Performance Hardware Top500 site: HPC roadmap Top500 site: HPC roadmap

Expected machine Required computing power

Current Machine machine installed Present day New machine in production

Modeling and Implementation, Numerical debug and test investigations (on small pb)  Regular improvement allows to plan the future 12

2 High Performance Hardware HPC in the cloud ? Architecture issues

« AZUR BigCompute » • High performance nodes Why multi-core processors ? Shared or ? • High performance interconnect (Infiniband)  Customers can allocate a part of a HPC cluster

• Allows to allocate a huge number of nodes for a short time • No high performance interconnection network  Comfortable for Big Data scaling benchmarks

Some HPC or Large Scale PC-clusters exist in some Clouds But no SuperComputer available in a cloud 13 14

Architecture issues Architecture issues Re-interpreted Moore’s law Re-interpreted Moore’s law Processor performance increase due to parallelism since 2004: Impossible to CPU performance dissipate so increase is due to much (different) parallel mechanisms since electrical several years… power in the … and they require silicium explicit parallel programming

W. Kirshenmann EDF & EPI AlGorille, d’après une première étude menée par SpiralGen Auteur : 15 Jack Dongara

Architecture issues Architecture issues Re-interpreted Moore’s law Re-interpreted Moore’s law

To bound frequency and to increase the nb of cores is energy efficient

2×0.753 = 0.8

It became impossible to dissipate the energy consummed by the semiconductor when the frequency increased ! Auteur : 17 Jack Dongara

3 Architecture issues Architecture issues Re-interpreted Moore’s law 3 classic parallel architectures

Initial (electronic) Moore’s law: Shared-memory machines (Symetric MultiProcessor): each 18 months  x2 number of transistors per µm² One principle: - several Previous interpretation: implementations, each 18 months  x2 processor speed - different costs, - different speeds. New computer science interpretation: Network each 24 months  x2 number of cores Memory Overview of Recent Leads to a massive parallelism challenge: Aad J. van der Steen to split many codes in 100, 1000, ….. 106 threads … 107 threads!! Jack J. Dongarra20

Architecture issues Architecture issues 3 classic parallel architectures 3 classic parallel architectures

Distributed-memory machines (clusters): Distributed machines (DSM):

Non Hypercubes Mem Mem Mem Uniform Memory Architecture (ccNUMA) proc proc proc Extends the cache Fat trees mechanism network network

Cluster basic principles, Up to 1024 nodes but cost and speed depend Support global multithreading on the interconnection Gigabit Overview of Recent network ! Highly scalable Supercomputers Hardware implementation: fast & expensive… Aad J. van der Steen architecture 21 implementation: slow & cheap ! 22 Jack J. Dongarra

Architecture issues Architecture issues 3 classic parallel architectures Evolution of parallel architectures

% of computing power in Top500 % of 500 systems in Top500 Mem Mem Mem Mem Mem Mem network network proc proc proc …. proc proc proc network Shared memory Distributed memory Distributed shared «SMP» «Cluster» memory « DSM»

• Simple and • Unlimited . • Comfortable and efficient up to … But efficiency and efficient solution. 1993 2017 1993 2017 16 processors. price depend on the Efficient hardware Pure SIMD SMP MPP (cluster with custom interconnect) Limited solution interconnect. implementation up to Single processor ShM vector machine Cluster 1000 processors. 2016 : almost all 2016 : almost all supercomputers have supercomputers have BUT …  a cluster-like architecture 23 a cluster-like architecture 24

4 Architecture issues Architecture issues Modern parallel architecture Modern parallel architecture

Cluster of NUMA nodes network One PC cluster One PC / One node Time to access a data in the RAM of the processor < Time to access a data in a RAM attached to another processor Cluster architecture: One NUMA node (several access times exist inside one NUMA node)

Mem Mem Mem < Time to send/recv a data to/from another node

proc proc proc

network  « Hierarchical architecture »: network  {data – } location become critic

Architecture issues Architecture issues Multi-paradigms programming Distributed application deployment EU/France USA Distributed Memory space of the processMemory (and space of its of threads) the Software processMemory (and space of its of threads) the Architecture code (and of its threads) code code code code Stackcode and code of stack Stackthe main and threadcode of of stackthread x thread y thread z Stackthe mainthe and process threadcode of of stackthread x thread y thread z the mainthe process thread of thread x thread y thread z the process AVX vectorization: Cluster of multi-processor #pragma • Achieving efficient mapping NUMA nodes with hardware + CPU multithreading: How to software and hardware ressource ? vector accelerators #pragma omp parallel for • Achieving fault tolerance + Message passing: Which strategy and mechanisms ? Hierarchical and hybrid MPI_Send(…, Me-1,…); architectures : MPI_Recv(..., Me+1,…); • Achieving scalability + GPU vectorization Are my , mapping  multi-paradigms programming Distributed myKernel<<>>(…) and fault tolerance strategy (or new high level paradigm…) + checkpointing Hardware 27 adapted to larger systems ? Architecture

Fault tolerance Fault Tolerance in HPC Mean Time Between Failures

MTBF definition:

Can we run a very large and long parallel computation, and succeed ? Can a one-million core parallel program run during one week ?

29 30

5 Fault tolerance Fault tolerance Mean Time Between Failures Why do we need fault tolerance ? Experiments: Processor frequency is limited and number of cores increases The -1 required extensive maintenance. Initially, MTBF was on the order of 50 hours. MTBF is Mean Time Between Failures, and in this case, it was the  we use more and more cores average time the Cray-1 worked without any failures. Two hours of everyday was typically set aside for preventive maintenance…. (Cray-1 : 1976) We do not attempt to our applications System Resilience at Extreme Scale White Paper  we process larger problems in constant time ! (Gustafson’s law) Prepared for Dr. William Harrod, Defense Advanced Research Project Agency (DARPA) Today, 20% or more of the computing capacity in a large high-performance computing system is wasted due to failures and recoveries. Typical MTBF is from We use more and more cores during the same time 8 hours to 15 days. As systems increase in size to field of failure increases! capability and beyond, the MTBF will go lower and more capacity will be lost.

Addressing Failures in report produced by a workshop on “Addressing Failures in Exascale Computing” We (really) need for fault tolerance

2012-2013 31 or large parallel applications will never end! 32

Fault tolerance Fault tolerance strategies Energy Consumption High Performance Computing: big computations (batch mode)  Checkpoint/restart is the usual solution  Complexify src code, time consumming, disk consumming ! 1 PetaFlops: 2.3 MW !  1 ExaFlops : 2.3 GW !! 350 MW ! ….. 20 MW ? High Throuhgput Computing: flow of small and time constrained tasks  Small and independent tasks  A is re-run (entirely) when failure happens Fault tolerance in HPC remains a « hot topic » ? Perhaps we will be able to build the machines, Big Data: Different but not to pay for the energy consumption !!  Data storage redundancy approach !  Computation on (frequently) incomplete data sets … 33 34

Energy consumption Energy consumption How much electrical power for an Exaflops ? From Petaflops to Exaflops 1.00 Exaflops : 2018-2020 2020-2022 1.0 Exaflops should be reached close to 2020: ×1000 × 100 cores/node 25 Tb/s (IO) 20/35 MWatt max…. • 2.0 GWatts with the flop/ ratio of 2008 Top500 1st machine × 10 nodes × 50 IO 122 Petaflops : juin 2018 • 1.2 GWatts with the flop/watt ratio of 2011 Top500 1st machine × 10 energy – IBM, Oak Ridge - USA (only × 10) IBM POWER9 22C 3.07GHz • 350 MWatts if the flop/watt ratio increases regularly Volta GV100 2 282 544 « cores » • 20 MWatts if we succeed to improve the architecture ? … 8.8 MWatt 1.03 Petaflops : June 2008 … « the maximum energy cost we can support ! » (2010) RoadRunner (IBM) • How to program these machines ? • 2 MWatts … Opteron + PowerXCell 122440 « cores » • How to train large teams ? … « the maximum cost for a large set of customers » (2014) 500 Gb/s (IO) 35 2.35 Mwatt !!!!!! 36

6 Energy consumption Energy consumption TaihuLight - China: N°1 2016 - 2017 Summit - USA: N°1 June 2018

93.0 Pflops 122.3 Pflops (×1.31) • 41 000 processors Sunway SW26010 260C 1.45GHz • 9 216 processors IBM POWER9 22C 3.07GHz  10 649 600 « cores » • 27 648 GPU Volta GV100 • Sunway interconnect:  2 282 544 « cores » 5-level integrated hierarchy • interconnect: Dual-rail Mellanox EDR Infiniband (Infiniband like ?)

15.4 MWatt 8.8 MWatt (×0.57) Flops/Watt : ×2.3

37 38

Energy consumption Energy consumption Summit - USA: N°1 November 2018 What is the sustainable architecture ? 143.5 Pflops (×1.54) Différentes stratégies s’affrontent dans le Top500 : • 9 216 processors IBM POWER9 22C 3.07GHz • 27 648 GPU Volta GV100 • La performance à tous prix avec de gros CPUs très gourmands  2 282 544 « cores » Cray XT6 : 1.7 Pflops, 6.9 Mwatts K-Computer : 10.5 Pflops, 12.6 MWatts • interconnect: Dual-rail Mellanox EDR Infiniband • Beaucoup de processeurs moyennement puissants et peu gourmands IBM Blue Gene (gamme terminée) 9.8 MWatt (×0.64) Flops/Watt : ×2.4 • Utilisation d’accélérateurs matériels : GPU, -phi, …  machines hybrides : CPU + accélérateurs  difficiles à programmer et pas adaptées à tous les problèmes

Quel est le(s) bon(s) choix pour atteindre l’Exaflops ? Quel est le choix pertinent pour de « plus petits » clusters ?

39 40

Cooling Cooling Cooling is strategic !

Des processeurs moins gourmands en énergie : • on essaie de limiter la consommation de chaque processeur Cooling is close to 30% of the energy consumption • les processeurs passent en mode économique s’ils sont inutilisés • on améliore le rendement /watt Mais une densité de processeurs en hausse : • une tendance à la limitation de la taille totale des machines (en m² au sol)

 Besoin de refroidissement efficace et bon marché (!) Optimization is mandatory! Souvent estimé à 30% de la dépense énergétique! 41 42

7 Cooling Cooling Optimized air flow Cold (air+water cooling)

Optimisation des flux d’air : en entrée et en sortie des armoires On refroidit par eau une « porte/grille » dans laquelle circule un • Architecture Blue Gene : haute densité de processeurs flux d’air, qui vient de refroidir la machine • Objectif d’encombrement minimal (au sol) et de consommation énergétique minimale Le refroidissement • Formes triangulaires ajoutées pour optimiser le flux d’air se concentre sur IBM Blue Gene l’armoire.

43 44

Cooling Cooling Direct liquid cooling Liquid and immersive cooling

On amène de l’eau froide directement sur le point chaud, mais Refroidissement par immersion des l’eau reste isolée de l’électronique. cartes dans un liquide électriquement • Expérimental en 2009 Refroidissement neutre, et refroidi. liquide par • Adopté depuis (IBM, BULL, …) immersion testé Lame de calcul IBM en 2012 par SGI & Commercialisée Carte expérimentale IBM en 2009 Novec en 2014 (projet Blue Water)

Refroidissement liquide par immersion sur le 45 CRAY-2 en 1985 46

Cooling Cooling Extreme air cooling Extreme air cooling

Refroidissement avec de l’air à température ambiante : Refroidissement avec de l’air à température ambiante : - circulant à grande vitesse - circulant à grande vitesse - circulant à gros volume - circulant à gros volume

 Les CPUs fonctionnent proche de  Les CPUs fonctionnent proche de leur température max supportable leur température max supportable (ex : 35° sur une carte mère sans pb) (ex : 35°C sur une carte mère sans pb)

 Il n’y a pas de  Il n’y a pas de refroidissement Economique ! refroidissement Economique ! du flux d’air. du flux d’air. Mais arrêt de la Mais arrêt de la Une machine de machine quand Installation Iliium à machine quand Grid’5000 à Grenoble l’air ambiant est CentraleSupélec à l’air ambiant est (la seule en trop chaud (l’été) ! Metz trop chaud (l’été) ! Extreme Cooling) 47 (blockchain - 2018) 48

8 Interesting history of CRAY company A short overview of High Performance Computing

Questions ?

49 50

9