TSUBAMEによるペタスケール物理シミュレーション High Performance Low Electronic Consumption

What is GPU? GP GPU GP GPU

Inexpensive GPUスパコンTSUBAMEによるペタスケール物理シミュレーション High Performance Low Electronic Consumption

東京工業大学学術国際情報センター

青木尊之

Copyright © Takayuki Aoki / Global Scientific Information and Computing Center, Tokyo Institute of Technology 1 Copyright © Takayuki Aoki / Global Scientific Information and Computing Center, Tokyo Institute of Technology 2

TOP 500 Ranking 2013 November

TSUBAME Supercomputer MIC Xeon Phi GP GPU GP GPU

100PF GPU K20X K‐Computer 10.5 PF

Wire 10PF

４位 5位 5位 14位 17PF TSUBAME 2.0 Single GPU K20X 2287.6TF precision 1PF MIC Xeon Phi

2.5 Graph 500 41位 64位 25位 30位 2.0 No.3 (2011) 100TF 163.2TF 7位 9位 14位 109.7TF MEXT 49.5TF 1.1 1.2 Minister TSUBAME Award 10TF No.1 in Japan (2012) K20X 2007 2009 2011 2013 GPU Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology Copyright © Takayuki Aoki / Global Scientific Information and Computing Center, Tokyo Institute of Technology 4

System (58 racks) 1442 nodes: 2952 CPU sockets, GPU: Fermi core → Kepler core TSUBAME 2.5 4264 GPUs GP GPU GP GPU Performance: 224.7 TFLOPS (CPU) ※ Turbo boost 5.562 PFLOPS (GPU) Rack (30 nodes) Total: 5.7 PFLOPS (double precision) TSUBAME 2.0 TSUBAME 2.5 17.1 PFLOPS (single precision) NVIDIA Tesla M2050×4224 NVIDIA Tesla K20X×4224 Performance: 122 TFLOPS Memory: 116 TB Memory: 2.28 TB Compute Node (2 CPUs, 3 GPUs)

Performance: 4.08 TFLOPS Memory: 58.0GB(CPU) +18 GB(GPU)

Copyright © Takayuki Aoki / Global Scientific Information and Computing Center, Tokyo Institute of Technology 5 Copyright © Takayuki Aoki / Global Scientific Information and Computing Center, Tokyo Institute of Technology Green500 Ranking 2013 November CUDA GP GPU GP GPU (Compute Unified Device Architecture) CUDA is available for NVIDIA GPUs

■ GPU Computing (GPGPU) framework

C/C++ , FORTRAN Language + GPGPU extension (Compiler) Runtime API Libraries, Documentation, sample programs

■ Currently CUDA 5.5 is released.

Copyright © Takayuki Aoki / Global Scientific Information and Computing Center, Tokyo Institute of Technology Copyright © Takayuki Aoki / Global Scientific Information and Computing Center, Tokyo Institute of Technology 8

Bandwidth Bottlenecks Domain decomposition GP GPU 2D decomposition GP GPU ■ Memory Hierarchy Memory DDR3-1333 sub-domain #1 sub-domain #2 32 GB/sec Tesla K20X x

CPU GPU GPU

Memory Interface 384bit 250 GB/sec (1) GPU→CPU PCI Express (3) CPU→GPU 1000BASE-T VRAM (cudaMemcpy) 0.125 GB/sec PCI Express 2.0 x16 (2) CPU → CPU 64 Gbps (8 GB/sec) CPU InfiniBand QDR MPI 32 Gbps (4 GB/sec)

Overlapping between Computation and Communication Japanese GP GPU GP GPU Asynchronous Data transfer Asynchronous Data transfer Weather News

SendBuffer MPI RecBuffer Sync Stream 1

Stream 2

気象庁（） Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology 11 Copyright © Globalhttp://www.jma.go.jp/jma/index.html Scientific Information and Computing Center, Tokyo Institute of Technology Next Generation Weather Prediction Atmosphere Model GP GPU GP GPU Collaboration: Japan Meteorological Agency Dynamical Process: Meso-scale Atmosphere Model: Full 3-D Navior-Stokes Equation u 1 Cloud Resolving Non-hydrostatic model  uu   P  2 u  ( r)  g  F Compressible equation taking consideration of sound waves. t  Meso-scale a few km 2000 km Physical Process: Tornado, Down burst Typhoon Cloud Physics, Moist, Solar Radiation, Condensation, Heavy Rain Latent heat release, Chemical Process, Boundary Layer

“Parameterization” including sin, cos, exp, …in empirical rules.

WRF GPU Computing Full-GPU Implementation: ASUCA GP GPU GP GPU

 WRF (Weather Research and Forecast) Initial condition Dynamics Physics WSM5 (WRF Single Moment 5-tracer) Microphysics* CPU Represents condensation, precipitation and thermodynamic effects of latent heat release Full GPU Approach 1 % of lines of code, 25 % of elapsed time ⇒ 20 x boost in microphysics (1.2 - 1.3 x output overall improvement） GPU WRF-Chem** provides the capability to simulate chemistry and aerosols from cloud scales to regional ⇒ x 8.5 increase

Initial conditionDynamics Physics output Accelerator Approach CPU

*J. Michalakes, and M. Vachharajani: GPU Acceleration of Numerical Weather Prediction. Parallel Processing Letters Vol. 18 No. 4. World GPU Scientific. Dec. 2008. pp. 531—548 J. Ishida, C. Muroi, K. Kawano, Y. Kitamura, Development of a new nonhydrostatic **John C. Linford, John Michalakes, Manish Vachharijani, and Adrian Sandu. Multi-core acceleration of model “ASUCA” at JMA, CAS/JSC WGNE Reserch Activities in Atomospheric and chemical kinetics for simulation and prediction, proceedings of the 2009 ACM/IEEE conference on Oceanic Modelling. supercomputing (SC'09), ACM, 2009. Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology 15 Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology 16

Entire Porting Fortran to CUDA TSUBAME 2.0 (1 GPU) GP GPU GP GPU  Rewrite from Scratch

Fortran C/C++ CUDA

12x speedup50x speedup Original code Changing to comparedcompared to Xeon to GPU code X5670 1 socket1 CPU core at JMA array order

z,x,y (k,i,j)-ordering x,z,y (i,k,j)-ordering x,z,y (i,k,j)-ordering

 1 Year by one Ph.D student Introducing many optimizations, overlapping the computation with the communication, kernel fuse, re-ordering kernel, . . .

Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology 17 Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology 18 ASUCA Typhoon Simulation 5km-horizontal resolution 479×466×48 TSUBAME 2.0 Weak Scaling GP GPU GP GPU

145.0 Tflops Single precision 76.1 Tflops Doublele precision

Fermi core Tesla M2050 3990 GPU

■ SC’10 Technical Paper Best Student Award finalist

ASUCA Typhoon Simulation 500m-horizontal resolution 4792×4696×48 Using 437 GPUs GP GPU GP GPU

Air flow in a wide area of Tokyo

by using Lattice Boltzmann Method

Lattice Boltzmann Method LES (Large-Eddy Simulation) GP GPU GP GPU 15 fi 1 eq  ei fi   fi  fi 12 t  16 5 11 Relaxation time Energy spectrum eq  3 9 2 3  f  w 1 e u  e u  u u 3 for LES model i i  c2 i 2c4 i 2c2  8 7   2 1 10 4 9 Strongly Memory Bound Problem: 17 Collision step: Streaming step: 14 6 13 18 Molecular viscosity and Eddy viscosity relaxation time GS SGS : kinetic viscosity (SGS)

Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology 23 Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology SGS Model Computational Area GP GPU GP GPU Smagorinsky model ○ Simple Major part of Tokyo ᇞ inaccurate for the flow with wall boundary ᇞ emperical tuning for the constant model coefficient Including Shnjuku-ku, Chiyoda-ku, Minato-ku, Shinjyuku Tokyo Dynamic Smagorinsky model Meguro-ku, Chuou-ku, ○ applicable to wall boundary ᇞ complicated calculation ᇞ average process over the wide area 10km×10km → not available for complex shaped body → not suitable for large-scale problem Building Data: Shibuya Coherent-Structure Smagorinsky model *H.Kobayashi, Phys. Fluids.17, (2005). Pasco Co. Ltd. → model coefficient determined by the second invariant of the velocity gradient tensor TDM 3D Shinagawa ᇞ model coefficient ○ applicable to wall boundary ○ model coefficient is locally determined. Map ©2012 Google, ZENRIN

Weak Scalability GP GPU

600 TFLOPS on 4000 GPUs

DriVer: BMW-Audi TSUBAME 2.0 → 2.5 Lehrstuhl für Aerodynamik und Strömungsmechanik GP GPU Technische Universität München GP GPU

▲ TSUBAME 2.5 (overlap) TSUBAME 2.5 ● TSUBAME 2.0 (overlap) 1142 TFLOPS (3968 GPUs) [TFLOPS] 288 GFLOPS/GPU

TSUBAME 2.0 149 Tflops (1000 GPUs) 149 GFlops / GPU

Performance ×1.93

Number of GPUs Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology 30 Development New Materials GP GPU GP GPU Mechanical Structure Material Microstructure

Low-carbon society

Improvement of fuel efficiency by reducing the weight of transportation

Developing lightweight strengthening material by controlling microstructure Dendritic Growth

Phase-Field Model Al-Si: Binary Alloy GP GPU GP GPU The phase-field model is derived from non-equilibrium Time evolution of the phase-field  statistical physics and f = 0 represents the phase A and f = 1 for phase B. （Allen-Cahn equation) Phase B

Phase A

Phase-field 

1 diffusive interface Time evolution of the condensation: c with finite thickness

Requirement for Finite Difference Method Peta-scale Computing GP GPU GP GPU Previous works Phase Field : ×1000 large-scale computation  Condensation : c 2D computation 19 points to solve on TSUBAME 2.0 i, j,k 7 points to solve ci, j,k

z  k 1

3D computation z  k

z  k 1

Single dendrite ～ mm scale

Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology 35 Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology 36 Domain decomposition Weak scaling 4096 x 6400 x 12800 2D decomposition □ GPU-Only GP GPU 4000 (40 x 100) GPUs GP GPU （No overlapping） 16,000 CPU cores ◯ Hybrid-YZ （y,z boundary by CPU） sub-domain #1 sub-domain #2 ▲ Hybrid-Y x （yboundary by CPU） GPU Hybrid-Y method 2.0000045 PFlops GPU: 1.975 PFlops (1) GPU→CPU CPU: 24.69 TFlops PCI Express (3) CPU→GPU single precision (cudaMemcpy) Efficiency 44.5% (2) CPU → CPU (2.000 PFlops / 4.497 PFlops) CPU MPI • Mesh size: 4096 x160x128/GPU  NVIDIA Tesla M2050 card / Intel Xeon X5670 2.93 GHz on TSUBAME Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology 37 Copyright ©2.0 Global Scientific Information and Computing Center, Tokyo Institute of Technology 38

TSUBAME 2.0 → 2.5 性能向上 GP GPU Mesh size of a subdomain 4,096 x 5,022 x 16,640 (1GPU + 4 CPU cores): 4096 x 162 x 130 (3,968 GPUs+15,872 CPU cores) TSUBAME 2.5 3.406 PFLOPS

×1.75

TSUBAME 2.0 2.000 PFLOPS

4,096 x 6,480 x 13,000 (4,000 GPUs+16,000 CPU cores)

Dendrite Analysis Simulation for Granular Materials GP GPU GP GPU

DEM (Distinct Element Method) Normal direction

Contact interaction

Viscosity Spring

Tangential direction

Spring

Friction

Fij  kxij xij Viscosity

Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology 41 Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology DEM Simulation on Multi-GPU Particles Across Boundaries GP GPU GP GPU  Load balance among GPUs to keep parallel efficiency  De-fragmentation of GPU Memory  Particle search on GPU to move boundary

New boundary line Many particles 2th search Old boundary line

Few particles 1th search

Node1 Node2

De-fragmentation Optimization of De-fragmentation frequency GP GPU GP GPU

Node1 Computational time increase with linear by re-numbering.

11111 Without sorting Some particles are sent to neighbors. Time variation

1 2 1 3 1 Every 2000 Some particles newly come. Every 20 Every 1000 1112 3 11 Every 100

Re-numbering is executed on CPU.

11111

Benchmark Test Neighbor Particle List GP GPU GP GPU

Computational field are divided into 64 sub-domains. Linked-list method Single GPU is assigned to each sub-domain. Local domain 0 6

3 0 6 3

NULL

87 percent of memory usage is reduced compared to regular neighbor list.

Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology Granular Convey 粉体の撹拌シミュレーション

64 GPUs, 5 million particles 64 GPUs, 5 million particles

Gas-Liquid Two-Phase Flow Level-Set method (LSM) GP GPU GP GPU Mesh Method (Surface Capture) The Level-Set methods (LSM) use the signed distance ■ Navier-Stokes solver：Fractional Step function to capture the interface. The interface is Particle Method represented by the zero-level set (zero-contour). ■ Time integration：3rd TVD Runge-Kutta ex. SPH ■ Advection term：5th WENO : Level-Set function(distance function) ■ Diffusion term：4th FD Low accuracy ■ Poisson：MG-BiCGstab : Heaviside function < 106-7 particles ■ Surface tension：CSF model ■ Surface capture：CLSVOF(THINC + Level-Set)

High accuracy > 108-9 mesh points

Re-initialization for Level-Set function not splash

Advantage : Curvature calculation, Interface boundary Fig. Takehiro Himeno, et. Al., JSME, Numerical noise and unphysical oscillation Drawback : Volume conservation 65-635,B(1999),pp2333-2340 Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology Copyright © Takayuki Aoki / Global Scientific Information and Computing Center, Tokyo Institute of Technology

Continuous Surface Force Anti-diffusive Interface Capture (CSF) model by Brackbill, Kothe and Zemach (1991) GP GPU GP GPU The interfacial surface force is transformed to a volume THINC (tangent of hyperbola for interface capturing) Scheme force in the region near the interface via a delta function [ Xiao, etal, Int. J. Numer. Meth. Fluid. 48(2005)1023 ] ・VOF(volume of fluid) type interface capturing method Curvature ・Flux from tangent of hyperbola function Surface tension Normal ・Semi-Lagrangian time integration force vector

Surface tension represented by volume force

Approximate delta function ・1D implementation can be applied to 2D & 3D → Simple

・Finite Volume like usage ＊ THINC is the method how to compute flux → 3 krenel (x, y, z) can be fused to 1 kernel. Merit in memory R/W Copyright © Takayuki Aoki / Global Scientific Information and Computing Center, Tokyo Institute of Technology Copyright © Takayuki Aoki / Global Scientific Information and Computing Center, Tokyo Institute of Technology Sparse Matrix Solver BiCGStab + MG GP GPU GP GPU Collaboration with  1   u 1 Mizuho Information & Research Institute Set k  0 r0  p0  M b  Ax0 Ax = b for   p     t for k = 0; k < N; k++; r , r q , M 1 Aq  ×× × 0 k 1 k k ×   qk  rk k M Apk k  1 1 ××× × × k 1 ×× × × M Aq , M Aq Krylov sub-space methods: × × r0 , M Apk k k ××× × × CG, BiCGStab, GMRes, , , ××× × × ××× × × xk 1  xk  k pk  k qk × × ××× × × × ××× × × × 1 ××× × × × rk1  qk  k M Aqk × ××× × Pre-conditioner: × × ××× × 2 × ××× × Incomplete Cholesky, × ifr , r   b, b exit; × ××× × k1 k1 × ××× × ILU, MG, AMG, × × × × ××× × × ××× r0 , rk1 Block Diagonal Jacobi × × × ××× k  1 × ××× × ×  r , M Ap × × ××× k 0 k × ×× × × 1 Non-zero Packing: pk 1  rk 1  k pk  k M Apk  CRS → ELL, JDL loop end

Copyright © Takayuki Aoki / Global Scientific Information and Computing Center, Tokyo Institute of Technology 55 Copyright © Takayuki Aoki / Global Scientific Information and Computing Center, Tokyo Institute of Technology 56

MG V-Cycle GP GPU Milk Crown n Step Smoother : Red & Black ILU n+1 Step F 0 L0 f 0  S 0 A ： Fine Matrix G Poisosn Eq. AC ： Corse Matrix R : Restriction Restriction Prolongation P : Prolongation 1 1 1 1 L v  R L1v1  R1 G Correction Eq. AC = RAFP Restriction Prolongation

2 2 2 2 2 2 G 2 L v  R L v  R

Restriction Prolongation

3 3 3 3 3 3 G3 L v  R L v  R

Restriction Prolongation G 4 L4v4  R4

Drop on dry floor

60 Industrial Appl. Steering Oil GP GPU

SUMMARY

■ Peta-scale GPU applications successfully running on TSUBAME2.0/2.5 ・ Meso-scale weather prediction ・ Lattice Boltzmann Method ・ Gas-Liquid Two-phase flow … on going ・ Phase-Field model for developing new materials

■ GPU Green computing : less electrical power to get application results.

TSUBAMEによる ペタスケール物理シミュレーション High Performance Low Electronic Consumption

TSUBAMEによるペタスケール物理シミュレーション High Performance Low Electronic Consumption