Introduction to Parallel Programming for Multicore/Manycore Clusters

Introduction

Kengo Nakajima Information Technology Center The University of Tokyo

http://nkl.cc.u-tokyo.ac.jp/18s/ 2 Descriptions of the Class • Technical & Scientific Computing I (4820-1027) – 科学技術計算Ⅰ – Department of Mathematical Informatics • Special Lecture on Computational Science I(4810-1215) – 計算科学アライアンス特別講義Ⅰ – Department of Computer Science • Multithreaded Parallel Computing (3747-110) (from 2016) – スレッド並列コンピューティング – Department of Electrical Engineering & Information Systems • This class is certificated as the category “F" lecture of "the Computational Alliance, the University of Tokyo" • 2009-2014 3 – Introduction to FEM Programming • FEM: Finite-Element Method :有限要素法 – Summer (I) : FEM Programming for Solid Mechanics – Winter (II): Parallel FEM using MPI • The 1 st part (summer) is essential for the 2 nd part (winter) • Problems – Many new (international) students in Winter, who did not take the 1st part in Summer – They are generally more diligent than Japanese students • 2015 – Summer (I) : Multicore programming by OpenMP – Winter (II): FEM + Parallel FEM by MPI/OpenMP for Heat Transfer • Part I & II are independent (maybe...) • 2017 – Lectures are given in English – Reedbush-U System Intro 4 Motivation for Parallel Computing (and this class) • Large-scale parallel computer enables fast computing in large-scale scientific simulations with detailed models. Computational science develops new frontiers of science and engineering. • Why parallel computing ? – faster & larger – “larger” is more important from the view point of “new frontiers of science & engineering”, but “faster” is also important. – + more complicated – Ideal: Scalable • Solving N x scale problem using N x computational resources during same computation time (weak scaling) • Solving a fix-sized problem using N x computational resources in 1/N computation time (strong scaling) Intro 5 Scientific Computing = SMASH

Science • You have to learn many things. • Collaboration (or Co-Design) will be Modeling important for future career of each of you, as a scientist and/or an Algorithm engineer. – You have to communicate with people Software with different backgrounds. – It is more difficult than communicating Hardware with foreign scientists from same area. • (Q): Computer Science, Computational Science, or Numerical Algorithms ? Intro 6 This Class ...

• Target: Parallel FVM (Finite- Science Volume Method) using OpenMP Modeling • Science: 3D Poisson Equations • Modeling: FVM Algorithm • Algorithm: Iterative Solvers etc. Software • You have to know many components Hardware to learn FVM, although you have already learned each of these in undergraduate and high-school classes. Intro 7 Road to Programming for “Parallel” Scientific Computing

Programming for Parallel Scientific Computing (e.g. Parallel FEM/FDM)

Programming for Real World Scientific Computing (e.g. FEM, FDM) Big gap here !! Programming for Fundamental Numerical Analysis (e.g. Gauss-Seidel, RK etc.)

Unix, Fortan, C etc. Intro 8 The third step is important !

4. Programming for Parallel • How to parallelize applications ? Scientific Computing (e.g. Parallel FEM/FDM) – How to extract parallelism ? 3. Programming for Real World – If you understand methods, algorithms, Scientific Computing and implementations of the original (e.g. FEM, FDM) code, it’s easy. 2. Programming for Fundamental Numerical Analysis – “Data-structure” is important (e.g. Gauss-Seidel, RK etc.)

1. Unix, Fortan, C etc.

• How to understand the code ? – Reading the application code !! – It seems primitive, but very effective. – In this class, “reading the source code” is encouraged. – 3: FVM, 4: Parallel FVM 9 Kengo Nakajima 中島研吾 (1/2) • Current Position – Professor, Supercomputing Research Division, Information Technology Center, The University of Tokyo (情報基盤センター) • Department of Mathematical Informatics, Graduate School of Information Science & Engineering, The University of Tokyo (情報理工・数理情報学) • Department of Electrical Engineering and Information Systems, Graduate School of Engineering, The University of Tokyo (工・電気系工学) – Deputy Director, RIKEN CCS (Center for Computational Science) (Kobe) (20%) • Research Interest – High-Performance Computing – Parallel Numerical Linear Algebra (Preconditioning) – Parallel Programming Model – Computational Mechanics, Computational Fluid Dynamics – Adaptive Mesh Refinement, Parallel Visualization 10 Kengo Nakajima (2/2) • Education – B.Eng (Aeronautics, The University of Tokyo, 1985) – M.S. (Aerospace Engineering, University of Texas, 1993) – Ph.D. (Quantum Engineering & System Sciences, The University of Tokyo, 2003) • Professional – Mitsubishi Research Institute, Inc. (1985-1999) – Research Organization for Information Science & Technology (1999-2004) – The University of Tokyo • Department Earth & Planetary Science (2004-2008) • Information Technology Center (2008-) – JAMSTEC (2008-2011), part-time – RIKEN (2009-2016), part-time Intro 11

and Computational Science • Overview of the Class • Future Issues Intro 12 Computer & CPU

(中央処理装置): CPU • CPU’s used in PC and Supercomputers are based on same architecture • GHz: Clock Rate – Frequency: Number of operations by CPU per second • GHz -> 10 9 operations/sec – Simultaneous 4-8 instructions per clock Intro 13 Multicore CPU • Core= Central CPU CPU CPU part of CPU Core Core Core コア( Core ) • Multicore CPU’s Core Core Core with 4-8 cores Single Core Dual Core Quad Core are popular 1 cores/CPU 2 cores/CPU 4 cores/CPU – Low Power

• GPU: Manycore • Reedbush-U: 18 cores x 2 – O(10 1)-O(10 2) cores – Intel Broadwell-EP • More and more cores – Parallel computing Intro 14 GPU/Manycores • GPU:Graphic Processing Unit – GPGPU: General Purpose GPU – O(10 2) cores – High Memory Bandwidth – Cheap – NO stand-alone operations • Host CPU needed – Programming: CUDA, OpenACC • Intel Xeon/Phi: Manycore CPU – 60+ cores – High Memory Bandwidth – Unix, Fortran, C compiler – Host CPU needed in the 1 st generation • Stand-alone is possible now (Knights Landing, KNL) Intro 15 Parallel Supercomputers Multicore CPU’s are connected through network

CPU CPU CPU CPU CPU

Core Core Core Core Core Core Core Core Core Core

Core Core Core Core Core Core Core Core Core Core Intro 16 Supercomputers with Heterogeneous/Hybrid Nodes

CPU CPU CPU CPU CPU

Core Core Core Core Core Core Core Core Core Core

Core Core Core Core Core Core Core Core Core Core

GPU GPU GPU GPU GPU Manycore Manycore Manycore Manycore Manycore C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C ・・・・・・C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C Intro Performance of Supercomputers • Performance of CPU: Clock Rate • FLOPS (Floating Point Operations per Second) – Real Number • Recent Multicore CPU – 4-8 FLOPS per Clock – (e.g.) Peak performance of a core with 3GHz • 3×10 9×4(or 8)=12(or 24) ×10 9 FLOPS=12(or 24)GFLOPS

• 10 6 FLOPS= 1 Mega FLOPS = 1 MFLOPS • 10 9 FLOPS= 1 Giga FLOPS = 1 GFLOPS • 10 12 FLOPS= 1 Tera FLOPS = 1 TFLOPS • 10 15 FLOPS= 1 Peta FLOPS = 1 PFLOPS • 10 18 FLOPS= 1 Exa FLOPS = 1 EFLOPS

17 Intro Peak Performance of Reedbush-U Intel Xeon E5-2695 v4 (Broadwell-EP)

• 2.1 GHz – 16 DP (Double Precision) FLOP operations per Clock • Peak Performance (1 core) – 2.1 ×16= 33.6 GFLOPS • Peak Performance – 1-Socket, 18 cores: 604.8 GFLOPS – 2-Sockets, 36 cores: 1,209.6 GFLOPS 1-Node 18 Intro TOP 500 List http://www.top500.org/ • Ranking list of supercomputers in the world • Performance (FLOPS rate) is measured by “Linpack” which solves large-scale linear equations. – Since 1993 – Updated twice a year (International Conferences in June and November) • Linpack – iPhone version is available

19 Intro

• PFLOPS: Peta (=10 15 ) Floating OPerations per Sec. 18 http://www.top500.org/ • Exa-FLOPS (=10 ) will be attained after 2020 20 Benchmarks • TOP 500 (Linpack ,HPL(High Performance Linpack)) – Direct Linear Solvers, FLOPS rate – Regular Dense Matrices, Continuous Memory Access – Computing Performance • HPCG – Preconditioned Iterative Solvers, FLOPS rate – Irregular Sparse Matrices derived from FEM Applications with Many “0” Components • Irregular/Random Memory Access, • Closer to “Real” Applications than HPL – Performance of Memory, Communications • Green 500 – FLOPS/W rate for HPL (TOP500) 21 50th TOP500 List (November, 2017)

R R Power Site Computer/Year Vendor Cores max peak (TFLOPS) (TFLOPS) (kW)

Sunway TaihuLight , Sunway MPP, National Supercomputing 93,015 1 Sunway SW26010 260C 1.45GHz, 10,649,600 125,436 15,371 Center in Wuxi, China (= 93.0 PF) 2016 NRCPC National Supercomputing Tianhe-2, Intel Xeon E5-2692, TH 33,863 2 3,120,000 54,902 17,808 Center in Tianjin, China Express-2, , 2013 NUDT (= 33.9 PF) Swiss Natl. Supercomputer Piz Daint 3 361,760 19,590 33,863 2,272 Center, Switzerland XC30/NVIDIA P100, 2013 Cray Gyoukou, ZettaScaler-2.2 HPC, Xeon 4 JAMSTEC, JAPAN 19,860 19,136 28,192 1,350 D-1571, 2017, ExaScaler Oak Ridge National Titan 5 560,640 17,590 27,113 8,209 Laboratory, USA Cray XK7/NVIDIA K20x, 2012 Cray Lawrence Livermore National Sequoia 6 1,572,864 17,173 20,133 7,890 Laboratory, USA BlueGene/Q, 2011 IBM Trinity, Cray XC40 Intel Xeon Phi 7250 7 DOE/NNSA/LANL/SNL 979,968 14,137 43,903 3.844 68C 1.4GHz, Cray Aries, 2017, Cray DOE/SC/LBNL/NERSC Cori , Cray XC40, Intel Xeon Phi 7250 8 632,400 14,015 27,881 3,939 USA 68C 1.4GHz, Cray Aries, 2016 Cray Oakforest-PACS , PRIMERGY CX600 Joint Center for Advanced M1, Intel Xeon Phi Processor 7250 68C 9 High Performance 557,056 13,555 24,914 2,719 1.4GHz, Intel Omni-Path, Computing, Japan 2016 Fujitsu K computer , SPARC64 VIIIfx , 2011 10 RIKEN AICS, Japan 705,024 10,510 11,280 12,660 Fujitsu

Rmax : Performance of Linpack (TFLOPS) Rpeak : Peak Performance (TFLOPS), Power: kW http://www.top500.org/ 22 HPCG Ranking (November, 2017)

HPL Rmax TOP500 HPCG Computer Cores Peak (Pflop/s) Rank (Pflop/s) 1 K computer 705,024 10.510 10 0.603 5.3% 2 Tianhe-2 (MilkyWay-2) 3,120,000 33.863 2 0.580 1.1% 3 Trinity 979,072 93.015 7 0.546 0.4% 4 Piz Daint 361,760 19.590 3 0.486 1.9% 5 Sunway TaihuLight 10,649,600 93.015 1 0.481 0.4% 6 Oakforest-PACS 557,056 13.555 9 0.386 1.5% 7 Cori 632,400 13.832 8 0.355 1.3% 8 Sequoia 1,572,864 17.173 6 0.330 1.6% 9 Titan 560,640 17.590 4 0.322 1.2% 10 Tsbubame 3 136,080 8.125 13 0.189 1.6%

http://www.hpcg-benchmark.org/

23 Green 500 Ranking (November, 2016)

HPL TOP500 Power Site Computer CPU Rmax GFLOPS/W Rank (MW) (Pflop/s) NVIDIA DGX-1, Xeon E5-2698v4 20C NVIDIA DGX 1 2.2GHz, Infiniband EDR, NVIDIA 3.307 28 0.350 9.462 Corporation SATURNV Tesla P100 Swiss National Cray XC50, Xeon E5-2690v3 12C 2 Supercomputing Piz Daint 2.6GHz, Aries interconnect , NVIDIA 9.779 8 1.312 7.454 Centre (CSCS) Tesla P100 3 RIKEN ACCS Shoubu ZettaScaler-1.6 etc. 1.001 116 0.150 6.674 National SC Sunway Sunway MPP, Sunway SW26010 4 93.01 1 15.37 6.051 Center in Wuxi TaihuLight 260C 1.45GHz, Sunway SFB/TR55 at PRIMERGY CX1640 M1, Intel Xeon 5 Fujitsu Tech. QPACE3 Phi 7210 64C 1.3GHz, Intel Omni- 0.447 375 0.077 5.806 Solutions GmbH Path PRIMERGY CX1640 M1, Intel Xeon Oakforest- 6 JCAHPC Phi 7250 68C 1.4GHz, Intel Omni- 1.355 6 2.719 4.986 PACS Path DOE/SC/Argonne Cray XC40, Intel Xeon Phi 7230 64C 7 Theta 5.096 18 1.087 4.688 National Lab. 1.3GHz, Aries interconnect Cray CS-Storm, Intel Xeon E5-2680v2 Stanford Research 8 XStream 10C 2.8GHz, Infiniband FDR, Nvidia 0.781 162 0.190 4.112 Computing Center K80 ACCMS, Kyoto Cray XC40, Intel Xeon Phi 7250 68C 9 Camphor 2 3.057 33 0.748 4.087 University 1.4GHz, Aries interconnect Jefferson Natl. KOI Cluster , Intel Xeon Phi 7230 64C 10 SciPhi XVI 0.426 397 0.111 3.837 Accel. Facility 1.3GHz, Intel Omni-Path http://www.top500.org/

24 Green 500 Ranking (June, 2017) HPL Rmax TOP500 Power GFLOPS/ Site Computer CPU (Pflop/s) Rank (MW) W SGI ICE XA, IP139-SXM2, Xeon E5- 1 Tokyo Tech. TSUBAME3.0 1,998.0 61 142 14.110 2680v4, NVIDIA Tesla P100 SXM2, HPE ZettaScaler-1.6, Xeon E5-2650Lv4,, 2 Yahoo Japan kukai 460.7 465 33 14.046 NVIDIA Tesla P100 , Exascalar AIST AI NEC 4U-8GPU Server, Xeon E5-2630Lv4, 3 AIST, Japan 961.0 148 76 12.681 Cloud NVIDIA Tesla P100 SXM2 , NEC CAIP, RIKEN, RAIDEN GPU NVIDIA DGX-1, Xeon E5-2698v4, NVIDIA 4 635.1 305 60 10.603 JAPAN subsystem - Tesla P100 , Fujitsu Univ. Dell C4130, Xeon E5-2650v4, NVIDIA 5 Wilkes-2 - 1,193.0 100 114 10.428 Cambridge, UK Tesla P100 , Dell Swiss Natl. SC. Cray XC50, Xeon E5-2690v3, NVIDIA 6 Piz Daint 19,590.0 3 2,272 10.398 Center (CSCS) Tesla P100 , Cray Inc. JAMSTEC, ZettaScaler-2.0 HPC system, Xeon D- 7 Gyoukou, 1,677.1 69 164 10.226 Japan 1571, PEZY-SC2 , ExaScalar SGI Rackable C1104-GP1, Xeon E5- Inst. for Env. GOSAT-2 8 2650v4, NVIDIA Tesla P100 , 770.4 220 79 9.797 Studies, Japan (RCF2) NSSOL/HPE Penguin Xeon E5-2698v4/E5-2650v4, NVIDIA 9 Facebook, USA 3,307.0 31 350 9.462 Relion Tesla P100 , Acer Group DGX Saturn Xeon E5-2698v4, NVIDIA Tesla P100 , 10 NVIDIA, USA 3,307.0 32 350 9.462 V Nvidia ITC, U.Tokyo, SGI Rackable C1102-GP8, Xeon E5- 11 Reedbush-H 802.4 203 94 8.575 Japan 2695v4, NVIDIA Tesla P100 SXM2 , HPE

http://www.top500.org/ 25 Green 500 Ranking (Nov., 2017) HPL Rmax TOP500 Power GFLOPS/ Site Computer CPU (Pflop/s) Rank (kW) W Shoubu ZettaScaler-2.2 HPC system, Xeon D- 1 RIKEN, Japan 842.0 259 50 17.009 system B 1571, PEZY-SC2 , ExaScalar ZettaScaler-2.2 HPC system, Xeon D- 2 KEK, Japan Suiren2 788.2 307 47 16.759 1571, PEZY-SC2 , ExaScalar ZettaScaler-2.2 HPC system, Xeon E5- 3 PEZY, Japan Sakura 824.7 276 50 16.657 2618Lv3, PEZY-SC2 , ExaScalar DGX Saturn Xeon E5-2698v4, NVIDIA Tesla V100 , 4 NVIDIA, USA 1,070.0 149 97 15.113 V Volta Nvidia JAMSTEC, ZettaScaler-2.2 HPC system, Xeon D- 5 Gyoukou 19,135.8 4 1,350 14.173 Japan 1571, PEZY-SC2 , ExaScalar SGI ICE XA, IP139-SXM2, Xeon E5- 6 Tokyo Tech. TSUBAME3.0 8,125.0 13 792 13.704 2680v4, NVIDIA Tesla P100 SXM2, HPE AIST AI NEC 4U-8GPU Server, Xeon E5-2630Lv4, 7 AIST, Japan 961.0 148 76 12.681 Cloud NVIDIA Tesla P100 SXM2 , NEC

CAIP, RIKEN, RAIDEN GPU NVIDIA DGX-1, Xeon E5-2698v4, NVIDIA 8 635.1 305 60 10.603 JAPAN subsystem - Tesla P100 , Fujitsu

Univ. Dell C4130, Xeon E5-2650v4, NVIDIA 9 Wilkes-2 - 1,193.0 100 114 10.428 Cambridge, UK Tesla P100 , Dell Swiss Natl. SC. Cray XC50, Xeon E5-2690v3, NVIDIA 10 Piz Daint 19,590.0 3 2,272 10.398 Center (CSCS) Tesla P100 , Cray Inc. ITC, U.Tokyo, SGI Rackable C1102-GP8, Xeon E5- 11 Reedbush-L 805.6 291 79 10.167 Japan 2695v4, NVIDIA Tesla P100 SXM2 , HPE

http://www.top500.org/ 26 Intro Computational Science The 3rd Pillar of Science

• Theoretical & Experimental Science • Computational Science – The 3 rd Pillar of Science – Simulations using Supercomputers Theoretical Experimental Computational 27 Intro Methods for Scientific Computing • Numerical solutions of PDE (Partial Diff. Equations) • Grids, Meshes, Particles – Large-Scale Linear Equations – Finer meshes provide more accurate solutions

28 Intro 3D Simulations for Earthquake Generation Cycle San Andreas Faults, CA, USA Stress Accumulation at Transcurrent Plate Boundaries

29 Intro Adaptive FEM: High-resolution needed at meshes with large deformation (large accumulation)

30 Intro 31 Typhoon Simulations by FDM Effect of Resolution

∆h=100km ∆h=50km ∆h=5km

[JAMSTEC] Intro 32 Simulation of Typhoon MANGKHUT in 2003 using the Earth Simulator

[JAMSTEC] Intro 33 Simulation of Geologic CO2 Storage

[Dr. Hajime Yamamoto, Taisei] Intro 34

Simulation of Geologic CO2 Storage

• International/Interdisciplinary Collaborations – Taisei (Science, Modeling) – Lawrence Berkeley National Laboratory, USA (Modeling) – Information Technology Center, the University of Tokyo (Algorithm, Software) – JAMSTEC (Earth Simulator Center) (Software, Hardware) – NEC (Software, Hardware) • 2010 Japan Geotechnical Society (JGS) Award 35

Simulation of Geologic CO2 Storage • Science

– Behavior of CO2 in supercritical state at deep reservoir • PDE’s – 3D Multiphase Flow (Liquid/Gas) + 3D Mass Transfer • Method for Computation – TOUGH2 code based on FVM, and developed by Lawrence Berkeley National Laboratory, USA • More than 90% of computation time is spent for solving large-scale linear equations with more than 10 7 unknowns • Numerical Algorithm – Fast algorithm for large-scale linear equations developed by Information Technology Center, the University of Tokyo • Supercomputer – Earth Simulator II (NEX SX9, JAMSTEC, 130 TFLOPS) – Oakleaf-FX (Fujitsu PRIMEHP FX10, U.Tokyo, 1.13 PFLOPS Diffusion-Dissolution-Convection Process

Injection Well • Buoyant scCO 2 overrides onto groundwater • Dissolution of CO 2 increases water density Caprock (Low permeable seal) • Denser fluid laid on lighter fluid • Rayleigh-Taylor instability invokes convective Supercritical mixing of groundwater CO 2 The mixing significantly enhances the CO 2 dissolution into groundwater, resulting in Convective Mixing more stable storage Preliminary 2D simulation (Yamamoto et al., GHGT11) [Dr. Hajime Yamamoto, Taisei]

36

COPYRIGHT©TAISEI CORPORATION ALL RIGHTS RESERVED Density convections for 1,000 years:

Flow Model

Only the far side of the vertical cross section passing through the injection well is depicted.

[Dr. Hajime Yamamoto, Taisei] • The meter-scale fingers gradually developed to larger ones in the field-scale model • Huge number of time steps (> 10 5) were required to complete the 1,000-yrs simulation • Onset time (10-20 yrs) is comparable to theoretical (linear stability analysis, 15.5yrs) 37

COPYRIGHT©TAISEI CORPORATION ALL RIGHTS RESERVED Simulation of Geologic CO2 Storage 30 million DoF (10 million grids × 3 DoF/grid node) 10000 TOUGH2-MP 1000 on FX10

100 2-3 times speedup

10 Calculation Time (sec) Time Calculation Calculation Time (sec) Time Calculation 1 Average time for solving matrix for one time step 0.1 10 100 1000 10000 Number of Processors [Dr. Hajime Yamamoto, Taisei] Fujitsu FX10 (Oakleaf-FX ), 30M DOF: 2x-3x improvement 3D Multiphase Flow (Liquid/Gas) + 3D Mass Transfer 38 COPYRIGHT©TAISEI CORPORATION ALL RIGHTS RESERVED Intro 39 Motivation for Parallel Computing again • Large-scale parallel computer enables fast computing in large-scale scientific simulations with detailed models. Computational science develops new frontiers of science and engineering. • Why parallel computing ? – faster & larger – “larger” is more important from the view point of “new frontiers of science & engineering”, but “faster” is also important. – + more complicated – Ideal: Scalable • Solving N x scale problem using N x computational resources during same computation time (weak scaling) • Solving a fix-sized problem using N x computational resources in 1/N computation time (strong scaling) Intro 40

• Supercomputers and Computational Science • Overview of the Class • Future Issues Intro 41 Our Current Target: Multicore Cluster Multicore CPU’s are connected through network

CPU CPU CPU CPU CPU

Core Core Core Core Core Core Core Core Core Core

Core Core Core Core Core Core Core Core Core Core Memory Memory Memory Memory Memory • OpenMP • MPI  Multithreading  Message Passing  Intra Node (Intra CPU)  Inter Node (Inter CPU)  Shared Memory  Distributed Memory Intro 42 Our Current Target: Multicore Cluster Multicore CPU’s are connected through network

CPU CPU CPU CPU CPU

Core Core Core Core Core Core Core Core Core Core

Core Core Core Core Core Core Core Core Core Core Memory Memory Memory Memory Memory • OpenMP • MPI  Multithreading  Message Passing  Intra Node (Intra CPU)  Inter Node (Inter CPU)  Shared Memory  Distributed Memory Intro 43 Our Current Target: Multicore Cluster Multicore CPU’s are connected through network

CPU CPU CPU CPU CPU

Core Core Core Core Core Core Core Core Core Core

Core Core Core Core Core Core Core Core Core Core Memory Memory Memory Memory Memory • OpenMP • MPI (after October)  Multithreading  Message Passing  Intra Node (Intra CPU)  Inter Node (Inter CPU)  Shared Memory  Distributed Memory Intro 44 Flat MPI vs. Hybrid Flat-MPI :Each Core -> Independent

• MPI only core core core • Intra/Inter Node core core core core core core memory memory memory core core core

Hybrid:Hierarchal Structure

• OpenMP core core core • MPI core core core core core core memory memory memory memory core memory memory core core Intro 45 Example of OpnMP/MPI Hybrid Sending Messages to Neighboring Processes MPI: Message Passing, OpenMP: Threading with Directives

!C !C– SEND

do neib= 1, NEIBPETOT II= (LEVEL-1)*NEIBPETOT istart= STACK_EXPORT(II+neib-1) inum = STACK_EXPORT(II+neib ) - istart !$omp parallel do do k= istart+1, istart+inum WS(k-NE0)= X(NOD_EXPORT(k)) enddo

call MPI_Isend (WS(istart+1 -NE0), inum, MPI_DOUBLE_PRECISION, & & NEIBPE(neib), 0, MPI_COMM_WORLD, & & req1(neib), ierr) enddo 46 • In order to make full use of modern supercomputer systems with multicore/manycore architectures, hybrid parallel programming with message-passing and multithreading is essential . • MPI for message–passing and OpenMP for multithreading are the most popular ways for parallel programming on multicore/manycore clusters. • In this class, we “parallelize” a finite-volume method code with Krylov iterative solvers for Poisson’s equation on Reedbush-U Supercomputer with Intel Broadwell-EP at the University of Tokyo. • Because of limitation of time, we are (mainly) focusing on multithreading by OpenMP. • ICCG solver (Conjugate Gradient iterative solvers with Incomplete Cholesky preconditioning) is a widely-used method for solving linear equations. • Because it includes “data dependency” where writing/reading data to/from memory could occur simultaneously, parallelization using OpenMP is not straight forward. 47 • We need certain kind of reordering in order to extract parallelism.

• Lectures and exercise on the following issues related to OpenMP will be conducted:  Finite-Volume Method (FVM)  Krylov Iterative Method  Preconditioning  Implementation of the Program  Introduction to OpenMP  Reordering/Coloring Method  Parallel FVM Code using OpenMP Intro-01 48 Date ID Title Apr-09 (M) CS-01 Introduction Apr-16(M) CS-02 FVM (1/3) Apr-23(M) CS-03 FVM (2/3) Apr-30 (M) (no class) (National Holiday) May-07 (M) CS-04 FVM (3/3) May-14 (M) CS-05 Login to Reedbush-U, OpenMP (1/3) May-21 (M) CS-06 OpenMP (2/3) May-28 (M) CS-07 OpenMP (3/3) Jun-04 (M) CS-08 Reordering (1/2) Jun-11 (M) CS-09 Reordering (2/2) Jun-18 (M) CS-10 Tuning Jun-25 (M) (canceled) Jul-02 (M) CS-11 Parallel Code by OpenMP (1/2) Jul-09 (M) CS-12 Parallel Code by OpenMP (2/2) Jul-23 (M) CS-13 Exercises 49 “Prerequisites”

• Fundamental physics and mathematics – Linear algebra, analytics • Experiences in fundamental numerical algorithms – LU factorization/decomposition, Gauss-Seidel • Experiences in programming by C or Fortran • Experiences and knowledge in UNIX • User account of ECCS2016 must be obtained: – http://www.ecc.u-tokyo.ac.jp/doc/announce/newuser.html 50 Strategy • If you can develop programs by yourself, it is ideal... but difficult. – focused on “reading”, not developing by yourself – Programs are in C and Fortran • Lectures are done by ... • Lecture Materials – available at NOON Friday through WEB. • http://nkl.cc.u-tokyo.ac.jp/17s/ – NO hardcopy is provided • Starting at 08:30 – You can enter the building after 08:00 • Taking seats from the front row. • Terminals must be shut-down after class. Intro-01 51 Grades

• 1 or 2 Reports on programming 52 If you have any questions, please feel free to contact me !

• Office: 3F Annex/Information Technology Center #36 – 情報基盤センター別館3F 36 号室 • ext.: 22719 • e-mail: nakajima(at)cc.u-tokyo.ac.jp • NO specific office hours, appointment by e-mail

• http://nkl.cc.u-tokyo.ac.jp/18s/ • http://nkl.cc.u-tokyo.ac.jp/seminars/multicore/ 日本語資料 (一部) 53 Keywords for OpenMP

• OpenMP – Directive based, (seems to be) easy – Many books

• Data Dependency – Conflict of reading from/writing to memory – Appropriate reordering of data is needed for “consistent” parallel computing – NO detailed information in OpenMP books: very complicated 5454 Some Technical Terms • Processor, Core – Processing Unit (H/W), Processor=Core for single-core proc’s • Process – Unit for MPI computation, nearly equal to “core” – Each core (or processor) can host multiple processes (but not efficient) • PE (Processing Element) – PE originally mean “processor”, but it is sometimes used as “process” in this class. Moreover it means “domain” (next) • In multicore proc’s: PE generally means “core” • Domain – domain=process (=PE), each of “MD” in “SPMD”, each data set • Process ID of MPI (ID of PE, ID of domain) starts from “0” – if you have 8 processes (PE’s, domains), ID is 0~7 Intro 55

• Supercomputers and Computational Science • Overview of the Class • Future Issues Intro 56 Key-Issues towards Appl./Algorithms on Exa-Scale Systems Jack Dongarra (ORNL/U. Tennessee) at ISC 2013

• Hybrid/Heterogeneous Architecture – Multicore + GPU/Manycores (Intel MIC/Xeon Phi) • Data Movement, Hierarchy of Memory • Communication/Synchronization Reducing Algorithms • Mixed Precision Computation • Auto-Tuning/Self-Adapting • Fault Resilient Algorithms • Reproducibility of Results Intro 57 Supercomputers with Heterogeneous/Hybrid Nodes

CPU CPU CPU CPU CPU

Core Core Core Core Core Core Core Core Core Core

Core Core Core Core Core Core Core Core Core Core

GPU GPU GPU GPU GPU Manycore Manycore Manycore Manycore Manycore C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C ・・・・・・C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C Intro 58 Hybrid Parallel Programming Model is essential for Post-Peta/Exascale Computing • Message Passing (e.g. MPI) + Multi Threading (e.g. OpenMP, CUDA, OpenCL, OpenACC etc.) • In K computer and FX10, hybrid parallel programming is recommended – MPI + Automatic Parallelization by Fujitsu’s Compiler • Expectations for Hybrid – Number of MPI processes (and sub-domains) to be reduced – O(10 8-10 9)-way MPI might not scale in Exascale Systems – Easily extended to Heterogeneous Architectures • CPU+GPU, CPU+Manycores (e.g. Intel MIC/Xeon Phi) • MPI+X: OpenMP, OpenACC, CUDA, OpenCL Intro 59 This class is also useful for this type of parallel system

CPU CPU CPU CPU CPU

Core Core Core Core Core Core Core Core Core Core

Core Core Core Core Core Core Core Core Core Core

GPU GPU GPU GPU GPU Manycore Manycore Manycore Manycore Manycore C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C ・・・・・・C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C Intro 60 Parallel Programming Models

• Multicore Clusters (e.g. K, FX10) – MPI + OpenMP and (Fortan/C/C++) • Multicore + GPU (e.g. Tsubame) – GPU needs host CPU – MPI and [(Fortan/C/C++) + CUDA, OpenCL] • complicated, – MPI and [(Fortran/C/C++) with OpenACC] • close to MPI + OpenMP and (Fortran/C/C++) • Multicore + Intel MIC/Xeon-Phi • Intel MIC/Xeon-Phi Only – MPI + OpenMP and (Fortan/C/C++) is possible • + Vectorization 61 Future of Supercomputers (1/2)

• Technical Issues – Power Consumption – Reliability, Fault Tolerance, Fault Resilience – Scalability (Parallel Performancce) • Petascale System – 2MW including A/C, 2M$/year, O(10 5~10 6) cores • Exascale System (103x Petascale) – After 2021 (?) • 2GW (2 B$/year !), O(10 8~10 9) cores – Various types of innovations are on-going • to keep power consumption at 20MW (100x efficiency) • CPU, Memory, Network ... – Reliability 62 Future of Supercomputers (2/2)

• Not only hardware, but also numerical models and algorithms must be improved: – 省電力( Power-Aware/Reducing Algorithms ) – 耐故障( Fault Resilient Algorithms ) – 通信削減( Communication Avoiding/Reducing Algorithms )

• Co-Design by experts from various area (SMASH) is important – Exascale system will be a special-purpose system, not a general- purpose one.