High Performance Computing ADVANCED SCIENTIFIC COMPUTING

High Performance Computing ADVANCED SCIENTIFIC COMPUTING Prof. Dr. – Ing. Morris Riedel Adjunct Associated Professor School of Engineering and Natural Sciences, University of Iceland, Reykjavik, Iceland Research Group Leader, Juelich Supercomputing Centre, Forschungszentrum Juelich, Germany @MorrisRiedel LECTURE 1 @Morris Riedel @MorrisRiedel High Performance Computing September 5, 2019 Room V02-258 Review of Practical Lecture 0.2 – Short Intro to C Programming & Scheduling . Many C/C++ Programs used in HPC today . Multi-user HPC usage needs Scheduling Many multi-physics application challenges [3] LLview on different scales & levels of granularity drive the need for HPC [1] Terrestrial Systems SimLab [2] NEST Web page not good! right way! Lecture 1 – High Performance Computing 2/ 50 Outline of the Course 1. High Performance Computing 11. Scientific Visualization & Scalable Infrastructures 2. Parallel Programming with MPI 12. Terrestrial Systems & Climate 13. Systems Biology & Bioinformatics 3. Parallelization Fundamentals 14. Molecular Systems & Libraries 4. Advanced MPI Techniques 15. Computational Fluid Dynamics & Finite Elements 5. Parallel Algorithms & Data Structures 16. Epilogue 6. Parallel Programming with OpenMP 7. Graphical Processing Units (GPUs) + additional practical lectures & Webinars for our hands-on assignments in context 8. Parallel & Scalable Machine & Deep Learning 9. Debugging & Profiling & Performance Toolsets . Practical Topics 10. Hybrid Programming & Patterns . Theoretical / Conceptual Topics Lecture 1 – High Performance Computing 3/ 50 Outline . High Performance Computing (HPC) Basics . Four basic building blocks of HPC . TOP500 & Performance Benchmarks . Multi-core CPU Processors . Shared Memory & Distributed Memory Architectures . Hybrid Architectures & Programming . HPC Ecosystem Technologies . HPC System Software Environment – Revisited . System Architectures & Network Topologies . Many-core GPUs & Supercomputing Co-Design . Relationships to Big Data & Machine/Deep Learning . Data Access & Large-scale Infrastructures Lecture 1 – High Performance Computing 4/ 50 Selected Learning Outcomes . Students understand… . Latest developments in parallel processing & high performance computing (HPC) . How to create and use high-performance clusters . What are scalable networks & data-intensive workloads . The importance of domain decomposition . Complex aspects of parallel programming . HPC environment tools that support programming or analyze behaviour . Different abstractions of parallel computing on various levels . Foundations and approaches of scientific domain- specific applications . Students are able to … . Programm and use HPC programming paradigms . Take advantage of innovative scientific computing simulations & technology . Work with technologies and tools to handle parallelism complexity Lecture 1 – High Performance Computing 5/ 50 High Performance Computing (HPC) Basics Lecture 1 – High Performance Computing 6/ 50 What is High Performance Computing? . Wikipedia: ‘redirects from HPC to Supercomputer’ . Interesting – gives us already a hint what it is generally about . A supercomputer is a computer at the frontline of contemporary processing capacity – particularly speed of calculation [4] Wikipedia ‘Supercomputer’ Online . HPC includes work on ‘four basic building blocks’ in this course . Theory (numerical laws, physical models, speed-up performance, etc.) . Technology (multi-core, supercomputers, networks, storages, etc.) . Architecture (shared-memory, distributed-memory, interconnects, etc.) . Software (libraries, schedulers, monitoring, applications, etc.) [5] Introduction to High Performance Computing for Scientists and Engineers Lecture 1 – High Performance Computing 7/ 50 Understanding High Performance Computing (HPC) – Revisited . High Performance Computing (HPC) is based on computing resources that enable the efficient use of parallel computing techniques through specific support with dedicated hardware such as high performance cpu/core interconnections. HPC network interconnection important . High Throughput Computing (HTC) is based on commonly available computing resources such as commodity PCs and small clusters that enable the execution of ‘farming jobs’ without providing a high performance interconnection between the cpu/cores. network interconnection HTC less important! The complementary Cloud Computing & Big Data – Parallel Machine & Deep Learning Course focusses on High Throughput Computing Lecture 1 – High Performance Computing 8/ 50 Parallel Computing . All modern supercomputers depend heavily on parallelism . Parallelism can be achieved with many different approaches . We speak of parallel computing whenever a number of ‘compute elements’ (e.g. cores) solve a problem in a cooperative way [5] Introduction to High Performance Computing for Scientists and Engineers . Often known as ‘parallel processing’ of some problem space . Tackle problems in parallel to enable the ‘best performance’ possible . Includes not only parallel computing, but also parallel input/output (I/O) . ‘The measure of speed’ in High Performance Computing matters P1 P2 P3 P4 P5 . Common measure for parallel computers established by TOP500 list . Based on benchmark for ranking the best 500 computers worldwide [6] TOP500 Supercomputing Sites Lecture 3 will give in-depth details on parallelization fundamentals & performance term relationships & theoretical considerations Lecture 1 – High Performance Computing 9/ 50 TOP 500 List (June 2019) power challenge [6] TOP500 Supercomputing Sites EU #1 Lecture 1 – High Performance Computing 10 / 50 LINPACK Benchmarks and Alternatives . TOP500 ranking is based on the LINPACK benchmark [7] LINPACK Benchmark implementation . LINPACK solves a dense system of linear equations of unspecified size . LINPACK covers only a single architectural aspect (‘critics exist’) . Measures ‘peak performance’: All involved ‘supercomputer elements’ operate on maximum performance . Available through a wide variety of ‘open source implementations’ . Success via ‘simplicity & ease of use’ thus used for over two decades . Realistic applications benchmark suites might be alternatives . HPC Challenge benchmarks (includes 7 tests) [8] HPC Challenge Benchmark Suite . JUBE benchmark suite (based on real applications) [9] JUBE Benchmark Suite Lecture 1 – High Performance Computing 11 / 50 Multi-core CPU Processors . Significant advances in CPU (or microprocessor chips) . Multi-core architecture with dual, quad, six, or n processing cores . Processing cores are all on one chip one chip . Multi-core CPU chip architecture . Hierarchy of caches (on/off chip) . L1 cache is private to each core; on-chip . L2 cache is shared; on-chip [10] Distributed & Cloud Computing Book . L3 cache or Dynamic random access memory (DRAM); off-chip . Clock-rate for single processors increased from 10 MHz (Intel 286) to 4 GHz (Pentium 4) in 30 years . Clock rate increase with higher 5 GHz unfortunately reached a limit due to power limitations / heat . Multi-core CPU chips have quad, six, or n processing cores on one chip and use cache hierarchies Lecture 1 – High Performance Computing 12 / 50 Dominant Architectures of HPC Systems . Traditionally two dominant types of architectures . Shared-memory parallelization with OpenMP . Distributed-memory parallel programming with the Shared-Memory Computers Message Passing Interface (MPI) standard . Distributed Memory Computers . Often hierarchical (hybrid) systems of both in practice . Dominance in the last couple of years in the community on X86-based commodity clusters running the Linux OS on Intel/AMD processors . More recently, also accelerators play a significant role (e.g., many-core chips) . More recently . Both above considered as ‘programming models’ . Emerging computing models getting relevant for HPC: e.g., quantum devices, neuromorphic devices Lecture 1 – High Performance Computing 13 / 50 Shared-Memory Computers . A shared-memory parallel computer is a system in which a number of CPUs work on a common, shared physical address space [5] Introduction to High Performance Computing for Scientists and Engineers . Two varieties of shared-memory systems: 1. Unified Memory Access (UMA) 2. Cache-coherent Nonuniform Memory Access (ccNUMA) . The Problem of ‘Cache Coherence’ (in UMA/ccNUMA) . Different CPUs use Cache to ‘modify same cache values’ . Consistency between cached data & T1 T2 T3 T4 T5 Memory Shared data in memory must be guaranteed . ‘Cache coherence protocols’ ensure a consistent view of memory Lecture 1 – High Performance Computing 14 / 50 Shared-Memory with UMA . UMA systems use ‘flat memory model’: Latencies and bandwidth are the same for all processors and all memory locations. Also called Symmetric Multiprocessing (SMP) [5] Introduction to High Performance Computing for Scientists and Engineers . Selected Features . Socket is a physical package (with multiple cores), typically a replacable component . Two dual core chips (2 core/socket) . P = Processor core . L1D = Level 1 Cache – Data (fastest) . L2 = Level 2 Cache (fast) . Memory = main memory (slow) . Chipset = enforces cache coherence and mediates connections to memory Lecture 1 – High Performance Computing 15 / 50 Shared-Memory with ccNUMA . ccNUMA systems share logically memory that is physically distributed (similar like distributed-memory systems) . Network logic makes the aggregated memory appear as one single address space [5] Introduction to High Performance Computing for Scientists and Engineers .

Load more