Introduction to Parallel Programming for Multicore/Manycore Clusters

Introduction to Parallel Programming for Multicore/Manycore Clusters General Introduction Invitation to Supercomputing Kengo Nakajima Takahiro Katagiri Information Technology Center Information Technology Center The University of Tokyo Nagoya University Intro 2 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing in large-scale scientific simulations with detailed models. Computational science develops new frontiers of science and engineering. • Why parallel computing ? – faster & larger – “larger” is more important from the view point of “new frontiers of science & engineering”, but “faster” is also important. – + more complicated – Ideal: Scalable • Solving N x scale problem using N x computational resources during same computation time (weak scaling) • Solving a fix-sized problem using N x computational resources in 1/N computation time (strong scaling) Intro 3 Scientific Computing = SMASH Science • You have to learn many things. • Collaboration (or Co-Design) will be Modeling important for future career of each of you, as a scientist and/or an Algorithm engineer. – You have to communicate with people Software with different backgrounds. – It is more difficult than communicating Hardware with foreign scientists from same area. • (Q): Computer Science, Computational Science, or Numerical Algorithms ? Intro 4 • Supercomputers and Computational Science • Overview of the Class Intro 5 Computer & CPU • Central Processing Unit （中央処理装置）： CPU • CPU’s used in PC and Supercomputers are based on same architecture • GHz: Clock Rate – Frequency: Number of operations by CPU per second • GHz -> 10 9 operations/sec – Simultaneous 4-8 instructions per clock Intro 6 Multicore CPU • Core= Central CPU CPU CPU part of CPU Core Core Core コア（ Core ） • Multicore CPU’s Core Core Core with 4-8 cores Single Core Dual Core Quad Core are popular 1 cores/CPU 2 cores/CPU 4 cores/CPU – Low Power • GPU: Manycore – O(10 1)-O(10 2) cores • More and more cores – Parallel computing • Oakleaf-FX at University of Tokyo: 16 cores – SPARC64 TM IXfx Intro 7 GPU/Manycores • GPU：Graphic Processing Unit – GPGPU: General Purpose GPU – O(10 2) cores – High Memory Bandwidth – Cheap – NO stand-alone operations • Host CPU needed – Programming: CUDA, OpenACC • Intel Xeon/Phi: Manycore CPU – >60 cores – High Memory Bandwidth – Unix, Fortran, C compiler – Currently, host CPU needed • Stand-alone: Knights Landing Intro 8 Parallel Supercomputers Multicore CPU’s are connected through network CPU CPU CPU CPU CPU Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Intro 9 Supercomputers with Heterogeneous/Hybrid Nodes CPU CPU CPU CPU CPU Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core GPU GPU GPU GPU GPU Manycore Manycore Manycore Manycore Manycore C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C ・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・・ C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C Intro Performance of Supercomputers • Performance of CPU: Clock Rate • FLOPS (Floating Point Operations per Second) – Real Number • Recent Multicore CPU – 4-8 FLOPS per Clock – (e.g.) Peak performance of a core with 3GHz • 3×10 9×4(or 8)=12(or 24) ×10 9 FLOPS=12(or 24)GFLOPS • 10 6 FLOPS= 1 Mega FLOPS = 1 MFLOPS • 10 9 FLOPS= 1 Giga FLOPS = 1 GFLOPS • 10 12 FLOPS= 1 Tera FLOPS = 1 TFLOPS • 10 15 FLOPS= 1 Peta FLOPS = 1 PFLOPS • 10 18 FLOPS= 1 Exa FLOPS = 1 EFLOPS 10 Intro Peak Performance of Reedbush-U Intel Xeon E5-2695 v4 (Broadwell-EP) • 2.1 GHz – 16 DP (Double Precision) FLOP operations per Clock • Peak Performance (1 core) – 2.1 ×16= 33.6 GFLOPS • Peak Performance – 1-Socket, 18 cores: 604.8 GFLOPS – 2-Sockets, 36 cores: 1,209.6 GFLOPS 1-Node 11 Intro TOP 500 List http://www.top500.org/ • Ranking list of supercomputers in the world • Performance (FLOPS rate) is measured by “Linpack” which solves large-scale linear equations. – Since 1993 – Updated twice a year (International Conferences in June and November) • Linpack – iPhone version is available 12 Intro • PFLOPS: Peta (=10 15 ) Floating OPerations per Sec. 18 http://www.top500.org/ • Exa-FLOPS (=10 ) will be attained after 2020 13 14 50th TOP500 List (November, 2017) R R Power Site Computer/Year Vendor Cores max peak (TFLOPS) (TFLOPS) (kW) Sunway TaihuLight , Sunway MPP, National Supercomputing 93,015 1 Sunway SW26010 260C 1.45GHz, 10,649,600 125,436 15,371 Center in Wuxi, China (= 93.0 PF) 2016 NRCPC National Supercomputing Tianhe-2, Intel Xeon E5-2692, TH 33,863 2 3,120,000 54,902 17,808 Center in Tianjin, China Express-2, Xeon Phi, 2013 NUDT (= 33.9 PF) Swiss Natl. Supercomputer Piz Daint 3 361,760 19,590 33,863 2,272 Center, Switzerland Cray XC30/NVIDIA P100, 2013 Cray Gyoukou, ZettaScaler-2.2 HPC, Xeon 4 JAMSTEC, JAPAN 19,860 19,136 28,192 1,350 D-1571, 2017, ExaScaler Oak Ridge National Titan 5 560,640 17,590 27,113 8,209 Laboratory, USA Cray XK7/NVIDIA K20x, 2012 Cray Lawrence Livermore National Sequoia 6 1,572,864 17,173 20,133 7,890 Laboratory, USA BlueGene/Q, 2011 IBM Trinity, Cray XC40 Intel Xeon Phi 7250 7 DOE/NNSA/LANL/SNL 979,968 14,137 43,903 3.844 68C 1.4GHz, Cray Aries, 2017, Cray DOE/SC/LBNL/NERSC Cori , Cray XC40, Intel Xeon Phi 7250 8 632,400 14,015 27,881 3,939 USA 68C 1.4GHz, Cray Aries, 2016 Cray Oakforest-PACS , PRIMERGY CX600 Joint Center for Advanced M1, Intel Xeon Phi Processor 7250 68C 9 High Performance 557,056 13,555 24,914 2,719 1.4GHz, Intel Omni-Path, Computing, Japan 2016 Fujitsu K computer , SPARC64 VIIIfx , 2011 10 RIKEN AICS, Japan 705,024 10,510 11,280 12,660 Fujitsu Rmax : Performance of Linpack (TFLOPS) Rpeak : Peak Performance (TFLOPS), Power: kW http://www.top500.org/ Intro Computational Science The 3rd Pillar of Science • Theoretical & Experimental Science • Computational Science – The 3 rd Pillar of Science – Simulations using Supercomputers Theoretical Experimental Computational 15 Intro Methods for Scientific Computing • Numerical solutions of PDE (Partial Diff. Equations) • Grids, Meshes, Particles – Large-Scale Linear Equations – Finer meshes provide more accurate solutions 16 Intro 3D Simulations for Earthquake Generation Cycle San Andreas Faults, CA, USA Stress Accumulation at Transcurrent Plate Boundaries 17 Intro Adaptive FEM: High-resolution needed at meshes with large deformation (large accumulation) 18 Intro 19 Typhoon Simulations by FDM Effect of Resolution ∆h=100km ∆h=50km ∆h=5km [JAMSTEC] Intro 20 Simulation of Typhoon MANGKHUT in 2003 using the Earth Simulator [JAMSTEC] 21 Atmosphere-Ocean Coupling 1 on OFP by NICAM/COCO/ppOpen-MATH/MP • Highiresolution global atmosphereiocean coupled simulation by NICAM and COCO (Ocean Simulation) through ppOpeniMATH/MP on the K computer is achieved. – ppOpeniMATH/MP is a coupling software for the models employing various discretization method. • An O(km)imesh NICAMiCOCO coupled simulation is planned on the OakforestiPACS system (3.5kmi0.10deg., 5+B Meshes). – A big challenge for optimization of the codes on new Intel Xeon Phi processor – New insights for understanding of global climate dynamics [C/O M. Satoh (AORI/UTokyo)@SC16] 22 El Niño Simulations (1/2) U.Tokyo, RIKEN Mechanism of the Abrupt Terminate of Super El Niño in 1997/1998 has been investigated by Atmosphere- Ocean Coupling Simulations for the Entire Earth using ppOpen-HPC on the K computer 23 El Niño Simulations (2/2) U.Tokyo, RIKEN • A Madden-Julian Oscillation (MJO) event remotely accelerates ocean upwelling to abruptly terminate the 1997/1998 super El Niño • NICAM (Atmosphere)-COCO ppOpen- (Ocean) Coupling by ppOpen- MATH/MP MATH/MP • Data Assimilation + Simulations – Integration of Simulations + Data Analyses • NICAM: 14km – COCO 0.25-1.00 deg. on K • Next Target: 3.5km-0.10 deg. (5B Meshes) on OFP Intro 24 Simulation of Geologic CO2 2 Storage [Dr. Hajime Yamamoto, Taisei] Intro 25 Simulation of Geologic CO2 Storage • International/Interdisciplinary Collaborations – Taisei (Science, Modeling) – Lawrence Berkeley National Laboratory, USA (Modeling) – Information Technology Center, the University of Tokyo (Algorithm, Software) – JAMSTEC (Earth Simulator Center) (Software, Hardware) – NEC (Software, Hardware) • 2010 Japan Geotechnical Society (JGS) Award 26 Simulation of Geologic CO2 Storage • Science – Behavior of CO2 in supercritical state at deep reservoir • PDE’s – 3D Multiphase Flow (Liquid/Gas) + 3D Mass Transfer • Method for Computation – TOUGH2 code based on FVM, and developed by Lawrence Berkeley National Laboratory, USA • More than 90% of computation time is spent for solving large-scale linear equations with more than 10 7 unknowns • Numerical Algorithm – Fast algorithm for large-scale linear equations developed by Information Technology Center, the University of Tokyo • Supercomputer – Earth Simulator II (NEX SX9, JAMSTEC, 130 TFLOPS) – Oakleaf-FX (Fujitsu PRIMEHP FX10, U.Tokyo, 1.13 PFLOPS Diffusion-Dissolution-Convection Process Injection Well • Buoyant scCO 2 overrides onto groundwater • Dissolution of CO 2 increases water density Caprock (Low permeable seal) • Denser fluid laid on lighter fluid • Rayleigh-Taylor instability invokes convective Supercritical mixing of groundwater CO 2 The mixing significantly enhances the CO 2 dissolution into groundwater, resulting in Convective Mixing more stable storage Preliminary 2D simulation (Yamamoto et al., GHGT11) [Dr. Hajime Yamamoto, Taisei] 27 COPYRIGHT©TAISEI CORPORATION ALL RIGHTS RESERVED Density

Introduction to Parallel Programming for Multicore/Manycore Clusters

Petaflops for the People

Unlocking the Full Potential of the Cray XK7 Accelerator

This Is Your Presentation Title

Hardware Complexity and Software Challenges

CHABBI-DOCUMENT-2015.Pdf

ORNL Site Report

Technological Forecasting of Supercomputer Development: the March to Exascale Computing

OLCF AR 2016-17 FINAL 9-7-17.Pdf

Improving Programmability, Portability and Performance In

Leadership Computing Directions at Oak Ridge National Laboratory: Navigating the Transition to Heterogeneous Architectures

What Is a Core Hour on Titan?

A Failure Detector for HPC Platforms