What is « HPC » ?
Mineure HPC-SBD High performance hardware A short overview of Numerical algorithms TeraFlops High High Performance Computing PetaFlops Serial optimization – tilling Performance Vectorization ExaFlops Multithreading Computing Message passing Stéphane Vialle Parallel programming «HPDA» Intensive computing & large scale applications TeraBytes [email protected] PetaBytes http://www.metz.supelec.fr/~vialle Big Data ExaBytes 2
High Performance Hardware High Performance Hardware From core to SuperComputer
Computing cores
Inside ….
… high performance hardware
3 4
High Performance Hardware High Performance Hardware From core to SuperComputer From core to SuperComputer
Computing cores Computing cores with vector units with vector units Multi-core processor
Instru- ctions ALUs Unités AVX 5 6
1 High Performance Hardware High Performance Hardware From core to SuperComputer From core to SuperComputer
Computing cores Computing cores with vector units with vector units Multi-core Multi-core processor processor Multi-core Multi-core PC/node PC/node Multi-core PC cluster
7 8
High Performance Hardware High Performance Hardware From core to SuperComputer From core to SuperComputer
Computing cores Fat Tree Computing cores with vector units interconnect with vector units Multi-core Multi-core processor processor Multi-core Multi-core PC/node PC/node Multi-core Multi-core PC cluster PC cluster Super- Super- Computer Computer + hardware
9 accelerators 10
High Performance Hardware High Performance Hardware Top500 site: HPC roadmap Top500 site: HPC roadmap
Expected machine Required computing power
Current Machine machine installed Present day New machine in production
Modeling and Implementation, Numerical debug and test investigations (on small pb) Regular improvement allows to plan the future 12
2 High Performance Hardware HPC in the cloud ? Architecture issues
« AZUR BigCompute » • High performance nodes Why multi-core processors ? Shared or distributed memory ? • High performance interconnect (Infiniband) Customers can allocate a part of a HPC cluster
• Allows to allocate a huge number of nodes for a short time • No high performance interconnection network Comfortable for Big Data scaling benchmarks
Some HPC or Large Scale PC-clusters exist in some Clouds But no SuperComputer available in a cloud 13 14
Architecture issues Architecture issues Re-interpreted Moore’s law Re-interpreted Moore’s law Processor performance increase due to parallelism since 2004: Impossible to CPU performance dissipate so increase is due to much (different) parallel mechanisms since electrical several years… power in the … and they require silicium explicit parallel programming
W. Kirshenmann EDF & EPI AlGorille, d’après une première étude menée par SpiralGen Auteur : 15 Jack Dongara
Architecture issues Architecture issues Re-interpreted Moore’s law Re-interpreted Moore’s law
To bound frequency and to increase the nb of cores is energy efficient
2×0.753 = 0.8
It became impossible to dissipate the energy consummed by the semiconductor when the frequency increased ! Auteur : 17 Jack Dongara
3 Architecture issues Architecture issues Re-interpreted Moore’s law 3 classic parallel architectures
Initial (electronic) Moore’s law: Shared-memory machines (Symetric MultiProcessor): each 18 months x2 number of transistors per µm² One principle: - several Previous computer science interpretation: implementations, each 18 months x2 processor speed - different costs, - different speeds. New computer science interpretation: Network each 24 months x2 number of cores Memory Overview of Recent Supercomputers Leads to a massive parallelism challenge: Aad J. van der Steen to split many codes in 100, 1000, ….. 106 threads … 107 threads!! Jack J. Dongarra20
Architecture issues Architecture issues 3 classic parallel architectures 3 classic parallel architectures
Distributed-memory machines (clusters): Distributed Shared Memory machines (DSM):
cache coherence Non Hypercubes Mem Mem Mem Uniform Memory Architecture (ccNUMA) proc proc proc Extends the cache Fat trees mechanism network network
Cluster basic principles, Up to 1024 nodes but cost and speed depend Support global multithreading on the interconnection Gigabit Overview of Recent network ! Ethernet Highly scalable Supercomputers Hardware implementation: fast & expensive… Aad J. van der Steen architecture 21 Software implementation: slow & cheap ! 22 Jack J. Dongarra
Architecture issues Architecture issues 3 classic parallel architectures Evolution of parallel architectures
% of computing power in Top500 % of 500 systems in Top500 Mem Mem Mem Mem Mem Mem network network proc proc proc …. proc proc proc network Shared memory Distributed memory Distributed shared «SMP» «Cluster» memory « DSM»
• Simple and • Unlimited scalability. • Comfortable and efficient up to … But efficiency and efficient solution. 1993 2017 1993 2017 16 processors. price depend on the Efficient hardware Pure SIMD SMP MPP (cluster with custom interconnect) Limited solution interconnect. implementation up to Single processor ShM vector machine Cluster 1000 processors. 2016 : almost all 2016 : almost all supercomputers have supercomputers have BUT … a cluster-like architecture 23 a cluster-like architecture 24
4 Architecture issues Architecture issues Modern parallel architecture Modern parallel architecture
Cluster of NUMA nodes network One PC cluster One PC / One node Time to access a data in the RAM of the processor < Time to access a data in a RAM attached to another processor Cluster architecture: One NUMA node (several access times exist inside one NUMA node)
Mem Mem Mem < Time to send/recv a data to/from another node
proc proc proc
network « Hierarchical architecture »: network {data – thread} location become critic
Architecture issues Architecture issues Multi-paradigms programming Distributed application deployment EU/France China USA Distributed Memory space of the processMemory (and space of its of threads) the Software processMemory (and space of its of threads) the Architecture code process (and of its threads) code code code code Stackcode and code of stack Stackthe main and threadcode of of stackthread x thread y thread z Stackthe mainthe and process threadcode of of stackthread x thread y thread z the mainthe process thread of thread x thread y thread z the process AVX vectorization: Cluster of multi-processor #pragma simd • Achieving efficient mapping NUMA nodes with hardware + CPU multithreading: How to map software and hardware ressource ? vector accelerators #pragma omp parallel for • Achieving fault tolerance + Message passing: Which strategy and mechanisms ? Hierarchical and hybrid MPI_Send(…, Me-1,…); architectures : MPI_Recv(..., Me+1,…); • Achieving scalability + GPU vectorization Are my algorithm, mapping multi-paradigms programming Distributed myKernel<<
Fault tolerance Fault Tolerance in HPC Mean Time Between Failures
MTBF definition:
Can we run a very large and long parallel computation, and succeed ? Can a one-million core parallel program run during one week ?
29 30
5 Fault tolerance Fault tolerance Mean Time Between Failures Why do we need fault tolerance ? Experiments: Processor frequency is limited and number of cores increases The Cray-1 required extensive maintenance. Initially, MTBF was on the order of 50 hours. MTBF is Mean Time Between Failures, and in this case, it was the we use more and more cores average time the Cray-1 worked without any failures. Two hours of everyday was typically set aside for preventive maintenance…. (Cray-1 : 1976) We do not attempt to speedup our applications System Resilience at Extreme Scale White Paper we process larger problems in constant time ! (Gustafson’s law) Prepared for Dr. William Harrod, Defense Advanced Research Project Agency (DARPA) Today, 20% or more of the computing capacity in a large high-performance computing system is wasted due to failures and recoveries. Typical MTBF is from We use more and more cores during the same time 8 hours to 15 days. As systems increase in size to field petascale computing probability of failure increases! capability and beyond, the MTBF will go lower and more capacity will be lost.
Addressing Failures in Exascale Computing report produced by a workshop on “Addressing Failures in Exascale Computing” We (really) need for fault tolerance
2012-2013 31 or large parallel applications will never end! 32
Fault tolerance Fault tolerance strategies Energy Consumption High Performance Computing: big computations (batch mode) Checkpoint/restart is the usual solution Complexify src code, time consumming, disk consumming ! 1 PetaFlops: 2.3 MW ! 1 ExaFlops : 2.3 GW !! 350 MW ! ….. 20 MW ? High Throuhgput Computing: flow of small and time constrained tasks Small and independent tasks A task is re-run (entirely) when failure happens Fault tolerance in HPC remains a « hot topic » ? Perhaps we will be able to build the machines, Big Data: Different but not to pay for the energy consumption !! Data storage redundancy approach ! Computation on (frequently) incomplete data sets … 33 34
Energy consumption Energy consumption How much electrical power for an Exaflops ? From Petaflops to Exaflops 1.00 Exaflops : 2018-2020 2020-2022 1.0 Exaflops should be reached close to 2020: ×1000 perf × 100 cores/node 25 Tb/s (IO) 20/35 MWatt max…. • 2.0 GWatts with the flop/watt ratio of 2008 Top500 1st machine × 10 nodes × 50 IO 122 Petaflops : juin 2018 • 1.2 GWatts with the flop/watt ratio of 2011 Top500 1st machine × 10 energy Summit – IBM, Oak Ridge - USA (only × 10) IBM POWER9 22C 3.07GHz • 350 MWatts if the flop/watt ratio increases regularly NVIDIA Volta GV100 2 282 544 « cores » • 20 MWatts if we succeed to improve the architecture ? … 8.8 MWatt 1.03 Petaflops : June 2008 … « the maximum energy cost we can support ! » (2010) RoadRunner (IBM) • How to program these machines ? • 2 MWatts … Opteron + PowerXCell 122440 « cores » • How to train large programmer teams ? … « the maximum cost for a large set of customers » (2014) 500 Gb/s (IO) 35 2.35 Mwatt !!!!!! 36
6 Energy consumption Energy consumption Sunway TaihuLight - China: N°1 2016 - 2017 Summit - USA: N°1 June 2018
93.0 Pflops 122.3 Pflops (×1.31) • 41 000 processors Sunway SW26010 260C 1.45GHz • 9 216 processors IBM POWER9 22C 3.07GHz 10 649 600 « cores » • 27 648 GPU Volta GV100 • Sunway interconnect: 2 282 544 « cores » 5-level integrated hierarchy • interconnect: Dual-rail Mellanox EDR Infiniband (Infiniband like ?)
15.4 MWatt 8.8 MWatt (×0.57) Flops/Watt : ×2.3
37 38
Energy consumption Energy consumption Summit - USA: N°1 November 2018 What is the sustainable architecture ? 143.5 Pflops (×1.54) Différentes stratégies s’affrontent dans le Top500 : • 9 216 processors IBM POWER9 22C 3.07GHz • 27 648 GPU Volta GV100 • La performance à tous prix avec de gros CPUs très gourmands 2 282 544 « cores » Cray XT6 : 1.7 Pflops, 6.9 Mwatts K-Computer : 10.5 Pflops, 12.6 MWatts • interconnect: Dual-rail Mellanox EDR Infiniband • Beaucoup de processeurs moyennement puissants et peu gourmands IBM Blue Gene (gamme terminée) 9.8 MWatt (×0.64) Flops/Watt : ×2.4 • Utilisation d’accélérateurs matériels : GPU, Xeon-phi, … machines hybrides : CPU + accélérateurs difficiles à programmer et pas adaptées à tous les problèmes
Quel est le(s) bon(s) choix pour atteindre l’Exaflops ? Quel est le choix pertinent pour de « plus petits » clusters ?
39 40
Cooling Cooling Cooling is strategic !
Des processeurs moins gourmands en énergie : • on essaie de limiter la consommation de chaque processeur Cooling is close to 30% of the energy consumption • les processeurs passent en mode économique s’ils sont inutilisés • on améliore le rendement flops/watt Mais une densité de processeurs en hausse : • une tendance à la limitation de la taille totale des machines (en m² au sol)
Besoin de refroidissement efficace et bon marché (!) Optimization is mandatory! Souvent estimé à 30% de la dépense énergétique! 41 42
7 Cooling Cooling Optimized air flow Cold doors (air+water cooling)
Optimisation des flux d’air : en entrée et en sortie des armoires On refroidit par eau une « porte/grille » dans laquelle circule un • Architecture Blue Gene : haute densité de processeurs flux d’air, qui vient de refroidir la machine • Objectif d’encombrement minimal (au sol) et de consommation énergétique minimale Le refroidissement • Formes triangulaires ajoutées pour optimiser le flux d’air se concentre sur IBM Blue Gene l’armoire.
43 44
Cooling Cooling Direct liquid cooling Liquid and immersive cooling
On amène de l’eau froide directement sur le point chaud, mais Refroidissement par immersion des l’eau reste isolée de l’électronique. cartes dans un liquide électriquement • Expérimental en 2009 Refroidissement neutre, et refroidi. liquide par • Adopté depuis (IBM, BULL, …) immersion testé Lame de calcul IBM en 2012 par SGI & Commercialisée Carte expérimentale IBM en 2009 Novec en 2014 (projet Blue Water)
Refroidissement liquide par immersion sur le 45 CRAY-2 en 1985 46
Cooling Cooling Extreme air cooling Extreme air cooling
Refroidissement avec de l’air à température ambiante : Refroidissement avec de l’air à température ambiante : - circulant à grande vitesse - circulant à grande vitesse - circulant à gros volume - circulant à gros volume
Les CPUs fonctionnent proche de Les CPUs fonctionnent proche de leur température max supportable leur température max supportable (ex : 35°C sur une carte mère sans pb) (ex : 35°C sur une carte mère sans pb)
Il n’y a pas de Il n’y a pas de refroidissement Economique ! refroidissement Economique ! du flux d’air. du flux d’air. Mais arrêt de la Mais arrêt de la Une machine de machine quand Installation Iliium à machine quand Grid’5000 à Grenoble l’air ambiant est CentraleSupélec à l’air ambiant est (la seule en trop chaud (l’été) ! Metz trop chaud (l’été) ! Extreme Cooling) 47 (blockchain - 2018) 48
8 Interesting history of CRAY company A short overview of High Performance Computing
Questions ?
49 50
9