Petascaling Commodity onto Exascale with GPUs on TSUBAME1.2 onto TSUBAME2.0

Satoshi Matsuoka, Professor/Dr.Sci.

Global Scientific Information and Computing Center (GSIC) Tokyo Inst. Technology

1 GPUs as Commodity Massively Parallel Vector Processors for Ultra Low Power • E.g., , AMD Firestream – High Peak Performance > 1TFlops • Good for tightly coupled code e.g. Nbody – High Memory bandwidth (>100GB/s) • Good for sparse codes e.g. CFD – Low latency over shared memory • Thousands threads hide latency w/zero overhead – Slow and Parallel and Efficient vector engines for HPC – Restrictions: Limited non-stream memory access, PCI-express overhead, programming model etc. How do we exploit them given vector computing experiences? GPUs as Commodity Vector Engines for Ultra Low Power Unlike the conventional accelerators, Accelerators GPUs have high memory bandwidth and N-body MatMul HPC dense &Ultra Low Power

Since latest high-end GPUs support Density Computational Computational Cache-based double precision, GPUs also work CPUs as commodity vector processors. The target application area for GPUs is FFT very wide. Vector Processor Restrictions: Limited non- Memory stream memory access, Access PCI-express overhead, etc. Frequency  How do we utilize them easily? From CPU Centric to GPU Centric Nodes for Scaling

CPU “Centric” GDDR5 >150GB STREAM GPU persocket DDR3 4-8GB 15-20GB STREAM persocket Core 2GB/core Core Dataset Dataset CPU Roles These are - OS Isomorphic - Services NB PCIe x16 - Irregular PCIe x8 4GB/s Sparse 2GB/s PCIe x8 NB 2GB/s x n 40Gbps IB IB QDR GPU IB QDR HCA 1GB IB QDR HCA Flash 200GB HCA 400MB/s 40Gbps IB x n All-to-all 3-D Protein Docking Challenge (Collaboration with Yutaka Akiyama, Tokyo Tech.)

P1 P2 P3 P4 P5 …. P1000

P1 P2 P3 1,000 x1,000 all-to-all docking P4 fitness evaluation will take only P5 1-2 months (15 deg. pitch) … with a 32-node HPC-GPGPU cluster (128 GPGPUs). Fast PPI candidate cf.

screening with FFT ~ 500 years with single CPU (sus. 1+GF) P1000 ~ 2 years with 1-rack BlueGene/L

Blue Protein system CBRC, AIST (4 rack, 8192 nodes) Algorithm for 3-D All-to-All Protein Docking

candidate Protein 1 docking sites (1) 3-D (3) 3-D voxel (5) cluster data complex analysis Convolution NVIDIA (3-D FFT and its Inverse) CUDA Host Protein 2 CPUs (2) 3-D voxel (4) rotation data (6 deg., 15 deg.) Docking Confidence level

Calculation for a single protein-protein pair: ~= 200 Tera ops. 3-D complex convolution O(N3 log N) , typically N = 256 x 200 Exa Ops for Possible rotations R = 54,000 (6 deg. pitch) 1000 x 1000 High Performance 3-D FFT on NVIDIA CUDA GPUs [Nukada et. al. SC08]

Fast Fourier Transform is an O(NlogN) algorithm with O(N) memory access.

Acceleration of FFT is much more difficult than N-body, MatMul, etc., but possible! At least, GPUs’ theoretical memory bandwidth is much higher than CPUs. multi-row FFT algorithm for GPU [SC08]

Input This algorithm accesses 4 streams multiple streams, but each of them is successive.

4-point FFT Since each thread compute independent set of small FFT,

Output many registers are required. 4 streams

For 256-point FFT, use two- pass 16-point FFT kernels. Performance of 1-D FFT

Note: An earlier sample with 1.3GHz is used for Tesla S1070. 3-D FFT Performance on various ComparisonCPUs, CellBE, with and CPUsGPUs

140 GFLOPS Overcoming the PCI-e BW Limitations-- -Keeping the Protein Data on the GPU

Receptor Ligand

Time (ms) 120 Rotate Device-to-Host

3-D FFT

80 Generate Grid Generate Grid Host-to-Device

FWD 3D FFT MUL FWD 3D FFT 40 3D Convolution using 3D FFT INV 3D FFT 0 8800 GT 8800 GTS 8800 GTX

Search Max Scores (Values)

MAIN LOOP on GPGPU Card => pack as much card in node as possible Heavily GPU Acclerated Windows HPC Prototype Cluster

• 32 compute nodes • 128 NVIDIA 8800GTS • one head node. • Gigabit Ethernet network • Three 40U rack cabinets. • Windows Compute Cluster Server 2008 • Visual Studio 2005 SP1 • nVidia CUDA 2.x Performance Estimation of 3D PPD Single Node

Power (W) Peak (GFLOPS) 3D-FFT Docking Nodes per (GFLOPS) (GFLOPS) 40 U rack Blue Gene/L 20 5.6 - 1.8 1024 1000 (est.) 76.8 (DP) 18.8 (DP) 26.7 (DP) 10 8800 GTS *4 570 1664 256 207 8~13

System Total ! Only CPUs for TSUBAME. DP=double precision. # of nodes Power Peak Docking MFLOPS/ (kW) (TFLOPS) (TFLOPS) W Blue Gene/L (Blue 4096 80 22.9 7.0 87.5 Protein@AIST,Japan) (4racks) TSUBAME. (Opteron 655 (~70 ~700 50.3 (DP) 17.5 (DP) 25 Only) racks) GPU Accel .WinHPC 32 (4racks) 18 53.2 6.5 361 Can compute 1000x1000 in 1 month (15 deg.) or 1 year (6 deg.) On full TSUBAME 1.2, ~100TFlops (1MW)-> ~52 rack BG/L TSUBAME 1.2 Experimental Evolution (Oct. 2008) The first “Petascale” SC in Japan Storage Voltaire ISR9288 Infiniband x8 1.5 Petabyte (Sun x4500 x 60) 10Gbps x2 ~1310+50 Ports NEC SX-8i 0.1Petabyte (NEC iStore) ~13.5Terabits/s 500GB Lustre FS, NFS, CIF, WebDAV (over IP) (3Tbits bisection) 48disks 60GB/s aggregate I/O BW 10Gbps+External NW 10,000 CPU Cores Sun x4600 (16 Opteron Cores) 32~128 GBytes/Node Unified Infiniband 300,000 SIMD Cores network 10480core/655Nodes > 20 Million Threads 21.4TeraBytes 50.4TeraFlops ~900TFlops-SFP, ~170TFlops-DFP OS Linux (SuSE 9, 10) NEW Deploy: 80TB/s Mem BW (1/2 ES) NAREGI Grid MW GCOE TSUBASA Harpertown-Xeon 90Node 720CPU 8.2TeraFlops PCI-e ClearSpeed CSX600 SIMD accelerator 648 boards, 52.2TeraFlops SFP/DFP 170 Nvidia Tesla 1070, ~680 Tesla cards High Performance in Many BW-Intensive Apps 10% power increase over TSUBAME 1.0 680 Unit Tesla Installation… While TSUBAME in Production Service (!)

15 TSUBAME 1.2 Node Configuration

Thread Processor nVidia Tesla T10: 55nm, 470m2, Cluster (TPC) Thread Processor 1.4billion transistors ArrayThread (TPA or ProcessorSM) SP >1TF SFP 90GF DFP

240 SP Cores “Powerful Scalar” PCI-e x8 gen1 30 SMs GDDR3 4GB 2GB/s 1.4Ghz x86 102 GBytes/s 16 cores 1.08TFlops SFP Tesla Accelerator 2.4Ghz 16 90GFlops DFP 80GFlops 240 SP Cores 20GBytes/s 32GB 30 SMs GDDR3 4GB 1.4Ghz Dual Rail x4 IB SDR 102 GBytes/s 16 2 x 1GB/s 1.08TFlops SFP Tesla Accelerator 90GFlops DFP TSUBAME in Top500 Ranking

Jun 06 Nov 06 Jun 07 Nov 07 Jun 08 Nov 08 Jun09

Rmax 38.18 47.38 48.88 56.43 67.70 77.48 87.01 (Tflops) [HPDC 2008] Rank 7 9 14 16 24 29 41

Opteron CS 360 CS 648 Xeon Tesla • Continuous improvement for 6 times • The 2nd fastest heterogeneous in the world (No.1 is RoadRunner) 17 Linpack on TSUBAME---Heterogeneity at Work [Endo et.al. IPDPS08, etc.]

• Linpack implementation that uses cooperatively: – 10,368 Opteron cores – 612 Tesla GPUs – 648 ClearSpeed accelerators – 640 Xeon cores (another cluster named “tsubasa”) • Different strategy than on Roadrunner is required • 87.01TFlops: The world’s highest Linpack performance on GPU-accelerated clusters + + Electric Power Consumption • About 1090kW during Linpack – An estimated value from sampling – Cooling, network are excluded 100% Even for dense 124.8 90% 16.2 problems GPUs 53.913 80% 30.28 30.4 70% are effective low 60% power vector- 50% 52.254 27.37 like engines 40% 920.16 30% 7.253 3.48 20% 49.766 25.87 Higher 10% 0% performance DP-peak (TF) Linpack (TF) power (kW) Total 163.2TF Total 87.0TF Total 1092kW with much lower

Opteron Xeon ClearSpeed Tesla memory footprint GPU vs. Standard Many Cores?

Shared Shared Shared Shared Shared Shared PE PE PE PE SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP PE PE PE PE SP SP SP SP SP SP SP SP SP SP SP SP

vs. PE PE PE PE

MC MC MC MC MC MC PE PE PE PE GDDR3 GDDR3 GDDR3 GDDR3 GDDR3 GDDR3

• Consider strong scaling simple stencil computation e.g., SOR Multi-GPU in CFD: Riken Himeno Benchmark (Joint Work w/NEC) RIKEN Himeno CFD Benchmark Himeno for CUDA 16 64 16 Poisson Equation: 64+2 1 block = (Generalized coordinate) p 16x16x8 compute 2 p 2 p 2 p 2 p 2 p 2 p region x2 y2 z 2 xy xz yz Block has 256 thread Discretized Form: 128+2 Total 256 p 2 p p p 2 p p p 2 p p i 1, j,k i, j,k i 1, j,k i, j 1,k i, j,k i, j 1,k i, j,k 1 i, j,k i, j,k 1 blocks = x2 y 2 y 2 65536 pi 1, j 1,k pi 1, j 1,k pi 1, j 1,k pi 1, j 1,k threads 4 x y p p p p i 1, j,k 1 i 1, j,k 1 i 1, j 1,k i 1, j,k 1 16+2 4 x z Block 16+2 pi, j 1,k 1 pi, j 1,k 1 pi, j 1,k 1 pi, j 1,k 1 shared mem 4 y z i, j,k =16kB 8+2

18 neighbor point access Boundary region used21 for transfer Parallelization of Himeno Benchmark

513 Communication of border data is required Divide X axis for processes 1025 X Z

513 Y

• Although original Himeno supports 3D division, our GPU version currently supports only 1D division Himeno Size L/L’ (257x257x513) CPU vs. GPU Scaling on TSUBAME 1.2 300 ***Strong Scaling *** GFLOPS 258 246 GPU Peaks at 250 232 ~260GFlops 199 200

147 CPU(Size L) 150 133 GPU(L' 2/node) GPU(L' 1/node) 100 74

50 CPU Peaks at 22 26 7 13 ~30GFlops 0 Nodes (16CPUs, 2 4 8 16 20GB/s per node) Conjugate Gradient Solver on a Multi-GPU Cluster (A. Cevahir, A. Nukada, S. Matsuoka) [ICCS09 and follow on] Hypergraph partitioning for reduction of communication • GPU computing is fast, communication bottleneck is more severe All matrix and vector operations are implemented on GPUs Auto-selection of MxV kernel. GPUs may run different kernel 152GFlops on 32 NVIDIA GPUs on TSUBAME (c.f. NPB CG ~3 GF on 32 node TSUBAME) GPUs vs CPUs on TSUBAME Performance comparisons over well-known matrices (Strong scaling)

Approximately x50-100 power efficient due to GPU + algorithmic improvement Fast CG Result (1) ***Strong Scaling *** GFLOPS Fast CG Result (2) ***Strong Scaling *** GFLOPS Fast CG Result (3) ***Strong Scaling *** GFLOPS (Faster Than) Real-time Tsunami Simulation (Prof. Takayuki Aoki, Tokyo Tech.) ADPC : Asian Disaster Preparedness Center Early Warning System: (Faster than) Data Based Real-time CFD Extrapolation high accuracy Shallow-Water Equation Conservative Form: Assuming hydrostatic balance in the vertical direction, 8 GPU 400km×800km 3D 2D equation ( 100m mesh) 28 Asynchronous Streams

SendBuffer MPI RecBuffer GPU Sync GPU

GPU Tsunami Prediction of Northern Japan Pacific Coast

• Bathymetry CPU Scalability 1200

1000

800

600

Ideal Speed UpSpeed (x) 400 Scalability

200

0 0 200 400 600 800 1000 1200

Number CPUs 16 nodes GPU Scalability 35

30

25

20

15 Ideal

Speedup(x) GPU2 10

5

0 0 5 10 15 20 25 30 35 Number GPUs Multi-node CPU/GPU Comparison *** Strong Scaling *** • Results on TSUBAME1.2 GPU-CPU Scalability 10000.00

1000.00

CPU 100.00 GPUs

CPU Nodes Speed(x) Up 10.00 16GPUs > 1,000 CPUs

1.00 1 10 100 1000 Number of Number Devices GPUs/CPUs/Nodes Lattice Boltzman Method on Multi-GPUs [Aoki etl al.] f 1 i e f f f eq t i i i i

U0 Re = 3000

Tesla S1070 64 GPU

33 Receive: Receive:

f4 , f9 , f10, f16, f18 f6 , f13, f14, f17, f18 Send : Send : Rank KT f3, f7 , f8 , f15, f17 f5 , f11, f12, f15, f16

Receive: f , f , f , f , f 3 7 8 15 17 Rank KF Send :

f4 , f9 , f10, f16, f18 Rank K

Rank KB Receive:

f5 , f11, f12, f15, f16 15 Send : 12 16 5 11 f6 , f13, f14, f17, f18 8 3 7 Rank KD 2 1 10 4 9

14 17 6 13 18 384 x 384 x 384 Lattice Boltzmann Scaling on Multi-GPUs on TSUBAME1.2

Comparison between the performances of multi-CPU and multi-GPU computations

35 Phase Field モデルによる 純金属の樹枝状凝固

Allen-Cahn方程式に基づく、 異方性を考慮した3次元Phase Fieldモデル

2 2 2 2 2 2 x x y y z z M x y z t 1 4W 1 a 2

界面厚さ[ m ] 4 4 4 2 15L T T x y z 界面エネルギー [Jm / ] m 1 3 4 1 4 カイネティック係数 [m / Ks ] 2W Tm TKm 融点[] 熱伝導方程式 ↑ 界面エネルギーへの L 潜熱 [ J / m2 ] 異方性の導入 熱拡散係数 [ms2 / ] C 単位体積あたりの比熱 [ J / Km3 ] T 2 L 2 k T 30 1 a ノイズの振幅 t C t [ 0.5,0.5]の乱数 離散化: 2次中心差分法・時間積分 1次前進オイ 異方性強度 界面幅決定のためのパラメータ ラー法 Satoi Ogawa @ Aoki Lab., GSIC, Tokyo Institute of Technology (2009) Scalability - 1 GPU/node- in Multi-GPU, Phase Field (New Ver. Code) 60 GPUs on TSUBAME:

10000 10TFlops

FLOPS]

G [

1000

512-ch(Perf) 512-noch(Perf) 960-ch(Perf) 960-noch(Perf)

Performance 1920-ch(p) 1920-nonch(p) 2400-ch(p) TSUBAME 2.0では10,0003 2400-noch(p) 100 問題で1 1PFも? 10 Number of GPUs 100 Satoi Ogawa @ Aoki Lab., GSIC, Tokyo Institute of Technology (2009) Auto-Tuning FFT for CUDA GPUs [SC09] HPC libraries should support GPU variations: • # of SPs/SMs, Core/memory clock speed, • Device/shared memory size, Property of memory bank, etc… Auto-tuning of various parameters! • Selection of kernel radices (FFT-specific) • Memory padding size, # of threads (GPU-specific) GFLOPS 350 MKL 11.0 Our auto-tuned library is 300 FFTW 3.2.1 • x4 power efficient than 250 CUFFT 2.3 libraries on CPUs 200 NukadaFFT • 150 x1.2 -- 4 faster than

100 vendor’s library 50 •RELEASE NOV. 0 128 140 150 180 240 243 256 320 400 420 512 Transform Size 2009(!) High Precision GPU Power Measurement (Reiji Suda, U-Tokyo, ULPHPC) • “Marker” = Compute + Data transfer • Automatically detect Markers • Sample 1000s execution in 10s seconds

84 “Add” kernel and meaasurement 82

80

78 Power (W) Power

76 nthr>=32 nthr=16 74 nthr=8 nthr=4 72 nthr=2

マーカーの自動検出例 nthr=1 70 0.0E+00 2.0E+10 4.0E+10 6.0E+10 8.0E+10 1.0E+11 1.2E+11 Performance () GPU Power Consumption Modeling Employ learning theory to predict GPU power from performance counters

GPU power = Σ ci×pi

pi: GPU perf. Counter (12 types)、ci: learning constant 54 GPU kernels: prediction vs. measurement 実測値 予測値

250 (leave-one-out) Average error 7% 200 • Do need to 150 compensate for extraneously 100 erroneous kernel (!!) 50 !!

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 TODO: Fewer couters for real-time prediction higher-precition, non- linear modeling Low Power Scheduling in GPU Clusters [IEEE IPDPS-HPPAC 09] EDP Objective:Optimize CPU/GPU 7000 6000

Heterogeneous 5000

4000

• Optimally schedule mixed sets of jobs 3000

2000 executable on either CPU or GPU but ED P (10^9sj) 1000

w/different performance & power 0 Fixed Proposed (static) • Assume GPU accel. factor (%) known ECT Proposed (on-line) 30% Improvement Energy-Delay Product SRAD Slowdown 16 TODO: More realistic environment 14 12 10 • Different app. power profile 8

6 実行時間増加率 • PCI bus vs. memory conflict 4 2 – GPU applications slow down by 10 % or 0 0 2000 4000 6000 8000 10000 more when co-scheduled with memory- Backgroundバックグラウンドプロセス( MCPU B /s) intensive CPU app. process BW (MB/s) Game Changing Cluster Design for 2010 and Beyond • Multi-petascale clusters should be built with: – Fat, teraflop-class, compute dense, high memory BW vector processors for single node scalability – Multithreaded shared memory processors to hide latency and limit memory capacity per node GPU – High but modest bandwidth, low latency, nearly-full bisection network for intra-node scalability – High-bandwidth node memory and I/O channels to accommodate all of above – Node-wise non-volatile silicone storage for scalable storage I/O – Software layers for attaining bandwidth, fault tolerance, programmability, and low power – Such an architecture is the basis towards Exascale Tokyo Tech. TSUBAME2.0 Oct 2010 Entire Japan 我国のスパコン(2010年合算1ペタフロップス以下)

Around Tokyo >= 40TFLOPS Hokkaido Univ. Japan Atomic Energy Agency (Tokai) 10TFLOPS- 40TFLOPS Japan Atomic Energy Agency (Naka) Tohoku Univ. Japan Atomic Energy 1TFLOPS 10TFLOPS (Information Synergy Center) - Agency (Oarai) Tohoku Univ. High Energy Accelerator < 1TFLOPS (Institute for Materials Research) Research Organization(KEK) Tohoku Univ. National Research Institute for Earth (Institute of Fluid Science) Science and Disaster Prevention Tsukuba Univ.(Academic Computing & JST (Institute for Advanced Biosciences, Keio Univ. ) Communications Center, Center for Computational Science) JST (Tsukuba Univ.) Japan Aerospace Exploration Agency (Kakuta) National Institute for Materials Science Japan Advanced Institute of Science and Technology JST(National Institute of Advanced Industrial Science and Technology) National Institute for Fusion Science Tokyo Univ.(Institute for Solid State Physics) JST Kyoto Univ.(Academic Center for Computing and Media Studies) (Institute for Solid Tokyo Univ. State Physics, >> (Information Technology Hiroshima Univ. Tokyo Univ.) Center ) Yamaguchi Univ. JST (Tokyo Univ.) Chiba 1~X times faster,Kyushu Univ. Univ.

RIKEN National Astronomical peak and real apps Observatory of Japan Tokyo Metropolitan National University Institute of Kobe Univ. Japan Aerospace Informatics Oosaka Univ. Exploration Agency Tokyo Univ. Kyoto Univ. (Institute for Chemical Research, (Sagamihara) (The Institute of Research Institute for Sustainable Humanosphere) Medical Science) Japan Aerospace The Institute Japan Atomic Energy Agency (Kansai Photon Science Institute) Exploration Agency of Statistical ~3 Pflop CPU+GPU (DFP) (Choufu) Mathematics Nagoya Univ. JST (The University of Tokyo Institute of Institute for Molecular Science Electro-Communications) Technology1 (GSIC) Japan Agency for Marine- JST(Tokyo Institute National Institute of Genetics 1 Earth Science and TechnologyKento Aida, National Instituteof Technology) of Informatics ~PB/s Memory BW Yokohama City University RIKEN(Yokohama) ~PB SSD, 0.5~1TB/s I/O BW Several hundred Tb/s Bisection 2010 Total < 1PF ~65 Racks, incl. Storage (200m2) 60 sites 1MW Operational Power > $300 million Dynamic provisioning of Windows HPC + Linux Highlights of TSUBAME 2.0 Design (Oct. 2010) • Next gen multi-core x86 + next gen GPU – 1 CPU + 1 GPU ratio (or more), several thousands nodes – Massively Parallel Vector multithreading for scalability • Near petabyte/s aggregate mem BW, – restrained memory capacity • Modest IB-QDR BW, full bisection BW • Flash memory per node, ~Petabyte, ~Terabyte/s I/O • Low power & efficient cooling, comparable to TSUBAME 1.0 (despite ~x40 speedup) • Virtualization and Dynamic Provisioning of Windows HPC + Linux, job migration, etc. TSUBAME 2.0 Performance Earth Simulator ⇒ TSUBAME 4years x 40 Downsizing Japanese NLP >10PF(2012) US >10P (2011~12) 10PF US/EU US NSF/DoE Petascales (2009 2H) (Peak) TSUBAME ⇒ (2008) 1PF TSUBAME 2.0 x40 downsizing again? BlueGene/L 360TF(2005) 100TF

Earth Simulator 40TF (2002) 10TF

Titech Campus Grid KEK 59TF 1TF 1.3TF BG/L+SR11100 2002 2004 2006 2008 2010 2012