What Is « HPC » ? High Performance Hardware from Core To

What is « HPC » ? Mineure HPC-SBD High performance hardware A short overview of Numerical algorithms TeraFlops High High Performance Computing PetaFlops Serial optimization – tilling Performance Vectorization ExaFlops Multithreading Computing Message passing Stéphane Vialle Parallel programming «HPDA» Intensive computing & large scale applications TeraBytes [email protected] PetaBytes http://www.metz.supelec.fr/~vialle Big Data ExaBytes 2 High Performance Hardware High Performance Hardware From core to SuperComputer Computing cores Inside …. … high performance hardware 3 4 High Performance Hardware High Performance Hardware From core to SuperComputer From core to SuperComputer Computing cores Computing cores with vector units with vector units Multi-core processor Instru- ctions ALUs Unités AVX 5 6 1 High Performance Hardware High Performance Hardware From core to SuperComputer From core to SuperComputer Computing cores Computing cores with vector units with vector units Multi-core Multi-core processor processor Multi-core Multi-core PC/node PC/node Multi-core PC cluster 7 8 High Performance Hardware High Performance Hardware From core to SuperComputer From core to SuperComputer Computing cores Fat Tree Computing cores with vector units interconnect with vector units Multi-core Multi-core processor processor Multi-core Multi-core PC/node PC/node Multi-core Multi-core PC cluster PC cluster Super- Super- Computer Computer + hardware 9 accelerators 10 High Performance Hardware High Performance Hardware Top500 site: HPC roadmap Top500 site: HPC roadmap Expected machine Required computing power Current Machine machine installed Present day New machine in production Modeling and Implementation, Numerical debug and test investigations (on small pb) Regular improvement allows to plan the future 12 2 High Performance Hardware HPC in the cloud ? Architecture issues « AZUR BigCompute » • High performance nodes Why multi-core processors ? Shared or distributed memory ? • High performance interconnect (Infiniband) Customers can allocate a part of a HPC cluster • Allows to allocate a huge number of nodes for a short time • No high performance interconnection network Comfortable for Big Data scaling benchmarks Some HPC or Large Scale PC-clusters exist in some Clouds But no SuperComputer available in a cloud 13 14 Architecture issues Architecture issues Re-interpreted Moore’s law Re-interpreted Moore’s law Processor performance increase due to parallelism since 2004: Impossible to CPU performance dissipate so increase is due to much (different) parallel mechanisms since electrical several years… power in the … and they require silicium explicit parallel programming W. Kirshenmann EDF & EPI AlGorille, d’après une première étude menée par SpiralGen Auteur : 15 Jack Dongara Architecture issues Architecture issues Re-interpreted Moore’s law Re-interpreted Moore’s law To bound frequency and to increase the nb of cores is energy efficient 2×0.753 = 0.8 It became impossible to dissipate the energy consummed by the semiconductor when the frequency increased ! Auteur : 17 Jack Dongara 3 Architecture issues Architecture issues Re-interpreted Moore’s law 3 classic parallel architectures Initial (electronic) Moore’s law: Shared-memory machines (Symetric MultiProcessor): each 18 months x2 number of transistors per µm² One principle: - several Previous computer science interpretation: implementations, each 18 months x2 processor speed - different costs, - different speeds. New computer science interpretation: Network each 24 months x2 number of cores Memory Overview of Recent Supercomputers Leads to a massive parallelism challenge: Aad J. van der Steen to split many codes in 100, 1000, ….. 106 threads … 107 threads!! Jack J. Dongarra20 Architecture issues Architecture issues 3 classic parallel architectures 3 classic parallel architectures Distributed-memory machines (clusters): Distributed Shared Memory machines (DSM): cache coherence Non Hypercubes Mem Mem Mem Uniform Memory Architecture (ccNUMA) proc proc proc Extends the cache Fat trees mechanism network network Cluster basic principles, Up to 1024 nodes but cost and speed depend Support global multithreading on the interconnection Gigabit Overview of Recent network ! Ethernet Highly scalable Supercomputers Hardware implementation: fast & expensive… Aad J. van der Steen architecture 21 Software implementation: slow & cheap ! 22 Jack J. Dongarra Architecture issues Architecture issues 3 classic parallel architectures Evolution of parallel architectures % of computing power in Top500 % of 500 systems in Top500 Mem Mem Mem Mem Mem Mem network network proc proc proc …. proc proc proc network Shared memory Distributed memory Distributed shared «SMP» «Cluster» memory « DSM» • Simple and • Unlimited scalability. • Comfortable and efficient up to … But efficiency and efficient solution. 1993 2017 1993 2017 16 processors. price depend on the Efficient hardware Pure SIMD SMP MPP (cluster with custom interconnect) Limited solution interconnect. implementation up to Single processor ShM vector machine Cluster 1000 processors. 2016 : almost all 2016 : almost all supercomputers have supercomputers have BUT … a cluster-like architecture 23 a cluster-like architecture 24 4 Architecture issues Architecture issues Modern parallel architecture Modern parallel architecture Cluster of NUMA nodes network One PC cluster One PC / One node Time to access a data in the RAM of the processor < Time to access a data in a RAM attached to another processor Cluster architecture: One NUMA node (several access times exist inside one NUMA node) Mem Mem Mem < Time to send/recv a data to/from another node proc proc proc network « Hierarchical architecture »: network {data – thread} location become critic Architecture issues Architecture issues Multi-paradigms programming Distributed application deployment EU/France China USA Distributed Memory space of the processMemory (and space of its of threads) the Software processMemory (and space of its of threads) the Architecture code process (and of its threads) code code code code Stackcode and code of stack Stackthe main and threadcode of of stackthread x thread y thread z Stackthe mainthe and process threadcode of of stackthread x thread y thread z the mainthe process thread of thread x thread y thread z the process AVX vectorization: Cluster of multi-processor #pragma simd • Achieving efficient mapping NUMA nodes with hardware + CPU multithreading: How to map software and hardware ressource ? vector accelerators #pragma omp parallel for • Achieving fault tolerance + Message passing: Which strategy and mechanisms ? Hierarchical and hybrid MPI_Send(…, Me-1,…); architectures : MPI_Recv(..., Me+1,…); • Achieving scalability + GPU vectorization Are my algorithm, mapping multi-paradigms programming Distributed myKernel<<<grid,bloc>>>(…) and fault tolerance strategy (or new high level paradigm…) + checkpointing Hardware 27 adapted to larger systems ? Architecture Fault tolerance Fault Tolerance in HPC Mean Time Between Failures MTBF definition: Can we run a very large and long parallel computation, and succeed ? Can a one-million core parallel program run during one week ? 29 30 5 Fault tolerance Fault tolerance Mean Time Between Failures Why do we need fault tolerance ? Experiments: Processor frequency is limited and number of cores increases The Cray-1 required extensive maintenance. Initially, MTBF was on the order of 50 hours. MTBF is Mean Time Between Failures, and in this case, it was the we use more and more cores average time the Cray-1 worked without any failures. Two hours of everyday was typically set aside for preventive maintenance…. (Cray-1 : 1976) We do not attempt to speedup our applications System Resilience at Extreme Scale White Paper we process larger problems in constant time ! (Gustafson’s law) Prepared for Dr. William Harrod, Defense Advanced Research Project Agency (DARPA) Today, 20% or more of the computing capacity in a large high-performance computing system is wasted due to failures and recoveries. Typical MTBF is from We use more and more cores during the same time 8 hours to 15 days. As systems increase in size to field petascale computing probability of failure increases! capability and beyond, the MTBF will go lower and more capacity will be lost. Addressing Failures in Exascale Computing report produced by a workshop on “Addressing Failures in Exascale Computing” We (really) need for fault tolerance 2012-2013 31 or large parallel applications will never end! 32 Fault tolerance Fault tolerance strategies Energy Consumption High Performance Computing: big computations (batch mode) Checkpoint/restart is the usual solution Complexify src code, time consumming, disk consumming ! 1 PetaFlops: 2.3 MW ! 1 ExaFlops : 2.3 GW !! 350 MW ! ….. 20 MW ? High Throuhgput Computing: flow of small and time constrained tasks Small and independent tasks A task is re-run (entirely) when failure happens Fault tolerance in HPC remains a « hot topic » ? Perhaps we will be able to build the machines, Big Data: Different but not to pay for the energy consumption !! Data storage redundancy approach ! Computation on (frequently) incomplete data sets … 33 34 Energy consumption Energy consumption How much electrical power for an Exaflops ? From Petaflops to Exaflops 1.00 Exaflops : 2018-2020 2020-2022 1.0 Exaflops should be reached close to 2020: ×1000 perf × 100 cores/node 25 Tb/s (IO) 20/35 MWatt max…. • 2.0 GWatts with the flop/watt ratio of 2008

Load more