Being Very Green with Tsubame 2.5 towards 3.0 and beyond to Exascale

Satoshi Matsuoka Professor Global Scientific Information and Computing (GSIC) Center Tokyo Institute of Technology ACM Fellow / SC13 Tech Program Chair

NVIDIA Theater Presentation 2013/11/19 Denver, Colorado TSUBAME2.0

NEC Confidential TSUBAME2.0 Nov. 1, 2010 “The Greenest Production in the World”

TSUBAME 2.0 New Development

>600TB/s Mem BW 220Tbps NW >12TB/s Mem BW >400GB/s Mem BW >1.6TB/s Mem BW Bisecion BW 80Gbps NW BW 35KW Max 1.4MW Max 32nm 40nm ~1KW max 3 Performance Comparison of CPU vs. GPU 1750 GPU 200

GPU ] 1500 160

1250 GByte/s 1000 120

750 80 500 CPU CPU 250 40

Peak Performance [GFLOPS] Performance Peak 0

Memory Bandwidth [ Bandwidth Memory 0 x5-6 socket-to-socket advantage in both compute and memory bandwidth, Same power (200W GPU vs. 200W CPU+memory+NW+…) NEC Confidential TSUBAME2.0 Compute Node 1.6 Tflops Thin 400GB/s Productized Node Mem BW as HP 80GBps NW ProLiant Infiniband QDR x2 (80Gbps) ~1KW max SL390s

HP SL390G7 (Developed for TSUBAME 2.0) GPU: Fermi M2050 x 3 515GFlops, 3GByte memory /GPU CPU: Intel Westmere-EP 2.93GHz x2 (12cores/node) Multi I/O chips, 72 PCI-e (16 x 4 + 4 x 2) lanes --- 3GPUs + 2 IB QDR Memory: 54, 96 GB DDR3-1333 SSD:60GBx2, 120GBx2 Total Perf 2.4PFlops Mem: ~100TB

NEC Confidential SSD: ~200TB 4-1 2010: TSUBAME2.0 as No.1 in Japan

> All Other Japanese Centers on the Top500 COMBINED 2.3 PetaFlops Total 2.4 Petaflops #4 Top500, Nov. 2010

NEC Confidential TSUBAME Wins Awards…

“Greenest Production Supercomputer in the World” the Green 500 Nov. 2010, June 2011 (#4 Top500 Nov. 2010)

3 times more power efficient than a laptop! TSUBAME Wins Awards…

ACM Gordon Bell Prize 2011 2.0 Petaflops Dendrite Simulation Special Achievements in Scalability and Time-to-Solution “Peta-Scale Phase-Field Simulation for Dendritic Solidification on the TSUBAME 2.0 Supercomputer” TSUBAME Three Key Application Areas

“Of High National Interest and Societal Benefit to the Japanese Taxpayers”

1. Safety/Disaster & Environment 2. Medical & Pharmaceutical 3. Manufacturing & Materials Plus Co-Design for general IT Industry and Ecosystem impact (IDC, Big Data, etc.) 9 Lattice-Boltzmann-LES with Coherent-structure SGS model [Onodera&Aoki2013]

Coherent-structure Smagorinsky model

Second invariant of the velocity gradient The model parameter is locally determined by thetensor(Q) and second invariant of the velocity gradient tensor. Energy dissipation(ε)

◎ Turbulent flow around a complex object ◎ Large-scale parallel computation Computational Area – Entire Downtown Tokyo

Major part of Tokyo Including Shnjuku-ku, Chiyoda-ku, Minato-ku, Shinjyuku Tokyo Meguro-ku, Chuou-ku,

10km×10km

Building Data: Shibuya Pasco Co. Ltd. TDM 3D Achieved 0.592 Petaflops Shinagawa using over 4000 GPUs (15% efficiency) Map ©2012 Google, ZENRIN 11 Copyright © Takayuki Aoki / Global Scientific Information and Computing Center, Tokyo Institute of Technology Area Around Metropolitan Government Building

Flow profile at the 25m height on the ground Wind

960 m

640 m 640

地図データ ©2012 Google, ZENRIN

Copyright © Takayuki Aoki / Global Scientific Information and Computing Center, Tokyo Institute of Technology 13 Copyright © Takayuki Aoki / Global Scientific Information and Computing Center, Tokyo Institute of Technology 14 3,000x1,500x1,500 LBM: DriVer: BMW-Audi Re = 1,000,000 Lehrstuhl für Aerodynamik und Strömungsmechanik Technische Universität München

* Number of grid points: 3,623,878,656 60 km/h (3,072 × 1,536 × 768)

*Grid resolution:4.2mm (13m x 6.5 m x 3.25m)

*Number of GPUs: 288 (96 Nodes)

15 16 TSUBAME2.0 Contributed Significantly to Creation of the National Earthquake Hazard Map Industry prog.: TOTO INC.

TSUBAME 150 GPUs In-House Cluster Accelerate In- silico screeninig and data mining Mixed Precision Amber on Tsubame2.0 for Industrial Drug Discovery

x10 faster Mixed-Precision ヌクレオソーム (25095 粒子)

75% Energy Efficient $500mil~$1bil dev. cost per drug

> 10% improvement of the process will more than pay for TSUBAME Towards TSUBAME 3.0 Interim Upgrade TSUBAME2.0 to 2.5 (Sept.10th, 2013) • Upgrade the TSUBAME2.0s GPUs NVIDIA Fermi M2050 to Kepler K20X SFP/DFP peak from 4.8PF/2.4PF => 17PF/5.7PF c.f. The K Computer 11.2/11.2 Acceleration of Important Apps Considerable Improvement TSUBAME2.0 Compute Node Summer 2013 Fermi GPU 3 x 1408 = 4224 GPUs

Significant Capacity Improvement at low cost & w/o Power Increase

TSUBAME3.0 2H2015 TSUBAME2.0⇒2.5 Thin Node Upgrade Peak Perf. Thin 4.08 Tflops Productized Node ~800GB/s as HP Mem BW ProLiant Infiniband QDR 80GBps NW SL390s x2 (80Gbps) ~1KW max Modified for TSUABME2.5 HP SL390G7 (Developed for TSUBAME 2.0, Modified for 2.5) GPU: NVIDIA Kepler K20X x 3 1310GFlops, 6GByte Mem(per GPU) CPU: Intel Westmere-EP 2.93GHz x2 Multi I/O chips, 72 PCI-e (16 x 4 + 4 x 2) lanes --- 3GPUs + 2 IB QDR Memory: 54, 96 GB DDR3-1333 SSD:60GBx2, 120GBx2

NVIDIA Fermi NVIDIA Kepler M2050 K20X 1039/515 3950/1310 GFlops GFlops TSUBAME2.0 => 2.5 Changes • Doubled~Tripled performance – 2.4(DFP)/4.8(SFP) Petaflops => 5.76(x 2.4)/17.1(x3.6) – Preliminary results: ~2.7PF Linpack (x2.25) , ~3.4PF Dandrite GB app (x1.7) • Bigger and higher bandwidth GPU memory – 3GB=>6GB per GPU, 150GB/s => 250GB/s • Higher reliability – Resolved minor HW bug, compute node fail-stop occurrence to decrease up to 50% • Lower Power – Observing ~20% drop in power/energy (tentative) • Better programmability w/new GPU features – Dynamic tasks, HyperQ, CPU/GPU shared memory • Prolongs TSUBAME2 lifetime by at least 1 year – TSUBAME 3.0 FY 2015 Q4 TSUBAME2.0 TSUBAME2.5 Thin Node x 1408 Units Node Machine HP Proliant SL390s ← No Change CPU Intel X5670 ← No Change (6core 2.93GHz, Westmere) x 2 GPU M2050 x 3 NVIDIA Tesla K20X x 3  448 CUDA cores (Fermi)  2688 CUDA cores (Kepler)  SFP 1.03TFlops  SFP 3.95TFlops  DFP 0.515TFlops  DFP 1.31TFlops  3GiB GDDR5 memory  6GiB GDDR5 memory  150GB Peak, ~90GB/s  250GB Peak, ~180GB/s STREAM Memory BW STREAM Memory BW Node  SFP 3.40TFlops  SFP 12.2TFlops Performance  DFP 1.70TFlops  DFP 4.08TFlops (incl. CPU  ~500GB Peak, ~300GB/s  ~800GB Peak, ~570GB/s Turbo boost) STREAM Memory BW STREAM Memory BW TOTAL System  SFP 4.80PFlops  SFP 17.1PFlops (x3.6) Total System  DFP 2.40PFlops  DFP 5.76PFlops (x2.4) Performance  Peak ~0.70PB/s, STREAM  Peak ~1.16PB/s, STREAM ~0.440PB/s Memory BW ~0.804PB/s Memory BW (x1.8) 2013: TSUBAME2.5 No.1 in Japan* in Single Precision FP, 17 Petaflops (*but not in Linpack)

All University Centers ~=COMBINED 9 Petaflops SFP

Total 17.1 Petaflops SFP K Computer 5.76 Petaflops DFP 11.4 Petaflops SFP/DFP

Phase-field simulation for Dendritic Solidification [Shimokawabe, Aoki et. al.] Gordon Bell 2011 Winner Weak scaling on TSUBAME (Single precision) Mesh size(1GPU+4 CPU cores):4096 x 162 x 130

TSUBAME 2.5 3.444 PFlops (3,968 GPUs+15,872 CPU cores) 4,096 x 5,022 x 16,640

Developing lightweight strengthening TSUBAME 2.0 material by controlling microstructure 2.000 PFlops (4,000 GPUs+16,000 CPU cores) Low-carbon society 4,096 x 6,480 x 13,000 • Peta-Scale phase-field simulations can simulate the multiple dendritic growth during solidification required for the evaluation of new materials. • 2011 ACM Gordon Bell Prize Special Achievements in Scalability and Time-to-Solution Peta-scale stencil application : A Large-scale LES Wind Simulation using Lattice Boltzmann Method [Onodera, Aoki]

Large-scale Wind Simulation for a Weak scalability in single precision 10km x 10km Area in Metropolitan Tokyo (N = 192 x 256 x 256) 10,080 x 10,240 x 512 (4,032 GPUs)

▲ TSUBAME 2.5 (overlap) ] ● TSUBAME 2.0 (overlap) TSUBAME 2.5

TFlops 1142 TFlops (3968 GPUs) x 1.93 288 GFlops / GPU

TSUBAME 2.0

Performance [ Performance 149 TFlops (1000 GPUs) 149 GFlops / GPU The above peta-scale simulations were executed as the TSUBAME Grand Challenge Program, Category A in 2012 fall. Number of GPUs • The LES wind simulation for the area 10km × 10km with 1-m resolution has never been done before in the world. • We achieved 1.14 PFLOPS using 3968 GPUs on the TSUBAME 2.5 supercomputer. AMBER pmemd benchmark Nucleosome = 25,095 atoms K20X×8 11.39 K20X×4 6.66 K20X×2 4.04 K20X×1 3.11 M2050×8 3.44 2.22 M2050×4 1.85 M2050×2 M2050×1 0.99 0.31 MPI 4node 0.15 MPI 2node 0.11 Dr.Sekijima@Tokyo Tech MPI 1node (12 core) 0 2 4 6 8 10 12 TSUBAME2.0 M2050 ns/day TSUBAME2.5 K20X Application TSUBAME2.0 TSUBAME2.5 Boost Performance Performance Ratio Top500/Linpack 1.192 2.843 2.39 4131 GPUs (PFlops) Semi-Definite Programming 1.019 1.713 1.68 Nonlinear Optimization 4080 GPUs (PFlops) Gordon Bell Dendrite Stencil 2.000 3.444 1.72 3968 GPUs (PFlops) LBM LES Whole City Airflow 0.592 1.142 1.93 3968 GPUs (PFlops) Amber 12 pmemd 3.44 11.39 3.31 4 nodes 8 GPUs (nsec/day) GHOSTM Genome Homology 19361 10785 1.80 Search 1 GPU (Sec) MEGADOC Protein Docking 37.11 83.49 2.25 1 node 3GPUs (vs. 1CPU core) Stay tuned for the TSUBAME2.5 and TSUBAME-KFC submission numbers later this talk

TSUBAME2.0 => 2.5 10~20%power reduction

Sept Sept 2012 2013

TSUBAME Evolution Towards Exascale and Extreme Big Data

25-30PF 1TB/s 5.7PF Graph 500 No. 3 (2011)

Phase23.0 2.5 Fast I/O 5~10PB Phase1 Fast I/O 10TB/s 250TB > 100mil 300GB/s iOPs 30PB/Day 1ExaB/Day Awards 2015H2 Copyright © Takayuki Aoki / Global Scientific Information and Computing Center, Tokyo Institute of Technology 31 Focused Research Towards Tsubame 3.0 and Beyond towards Exa • New memory systems – Pushing the envelops of low Power vs. Capacity, Communication and Synchroniation Reducing Algorithms (CSRA) • Post Petascale Networks – HW, Topology, Routing Algorithms, Placement… • Green Computing: Ultra Power Efficient HPC • Scientific “Extreme” Big Data – Ultra Fast I/O, Hadoop Acceleration, Large Graphs • Fault Tolerance – Group-based Hierarchical Checkpointing, Fault Prediction, Hybrid Algorithms • Post Petascale Programming – OpenACC and other many- core programming substrates, Task Parallel • Scalable Algorithms for Many Core – Apps/System/HW Co-Design

JST-CREST “Ultra Low Power (ULP)-HPC” Project 2007-2012

Auto-Tuning for Perf. & Power Ultra Multi-Core Slow & Parallel ULP-HPC SIMD- モデルと実測の Bayes 的融合 ABCLibScript: アルゴリズム選択 (& ULP) Vector 実行起動前自動チューニング指定、 • Bayes モデルと事前分布 !ABCLib$ static select region start アルゴリズム選択処理の指定 (GPGPU, etc.) モデルによる !ABCLib$ parameter (in CacheS, in NB, in NPrc) 2 コスト定義関数で使われる 所要時間の推定 yi ~ N(i , i ) !ABCLib$ select sub region start 入力変数 !ABCLib$ according estimated  | , 2 ~ N(xT , 2 / ) i i i i 0 !ABCLib$ (2.0d0*CacheS*NB)/(3.0d0*NPrc) コスト定義関数 2 2 2 対象1(アルゴリズム1)  i ~ Inv-  (v0 , 0 ) !ABCLib$ select sub region end 所要時間の実測データ !ABCLib$ select sub region start 対象領域1、2 • n 回実測後の事後予測分布 !ABCLib$ according estimated 2 !ABCLib$ (4.0d0*ChcheS*dlog(NB))/(2.0d0*NPrc) MRAM yi | (yi1, yi2 ,, yin ) ~ t (in , inn1 /n ) n 対象2(アルゴリズム2) 0    n,     n,   ( xT   ny ) / !ABCLib$ select sub region end PRAM n 0 n 0 n 0 i i n !ABCLib$ static select region end 2 2 2 T 2 1  n n  0 0  (ym  yi ) 0n(yi  xi ) /n ULP-HPC 1 Flash Low Power y  y i n  im Networks etc. High Perf Model Optimization Point Power Optimize using Novel Components in HPC Power Perf x10 Power Efficiencty

0 Perf. Model Algorithms x1000 Power-Aware and Optimizable Applications Improvement in 10 years How do we achive x1000? Process Shrink x100 X Many-Core GPU Usage x5 ULP-HPC X Project 2007-12 DVFS & Other LP SW x1.5 X Ultra Green Efficient Cooling x1.4 Supercomputing Project 2011-15 x1000 !!! Power Efficiency in Denderite Applications on TSUBAME1.0 thru JST-CREST ULPHPC Prototype running Gordon Bell Denderite App 1680倍の見込み

GT200 からK20への外挿ライン K10 K20 Fermi GT200

10年で1000倍ライン TSUBAME-KFC: Ultra-Green Supercomputer Testbed [2011-2015] Fluid Submersion Cooling + Outdoor Air Cooling + High Density GPU Supercomputing in a 20-feet container GRC Submersion Rack Heat Exchanger Compute Nodes Processors 80~90℃ K20X GPU Oil 35~45℃

⇒ Coolant oil 35~45℃ ⇒ Water 25~35℃

Heat NEC/SMC 1U server x 40 Dissipation • Intel IvyBridge 2.1GHz 6core×2 • NVIDIA Tesla K20X GPU ×4 • DDR3 memory 64GB, SSD 120GB • 4x FDR InfiniBand 56Gbps Per node Cooling Tower: Total Peak Facility Water 25~35℃ 2 210TFlops (DP) 20 feet container(16m ) ⇒ Outdoor air 630TFlops (SP) • Coolant oil: Spectrasyn8 Target • World’s top power efficiency (>3GFlops/Watt) • Average PUE 1.05, lower component power • Field test ULP-HPC results TSUBAME-KFC Towards TSUBAME3.0 and Beyond Shooting for #1 on Nov. 2013 Green 500! TSUBAME KFC 4238.49 or greater

TSUBAME 2.5 3068.71

TSUBAME 2.0 958.35 Machine Power Linpack Linpack Factor Total Mem Mem BW Factor (incl. Perf MFLOPs/ BW TB/s MByte/S cooling) (PF) W (STREAM) / W Earth Simulator 1 10MW 0.036 3.6 13,400 160 16 312 Tsubame1.0 1.8MW 0.038 21 2,368 13 7.2 692 (2006Q1) ORNL Jaguar ~9MW 1.76 196 256 432 48 104 (XT5. 2009Q4) Tsubame2.0 1.8MW 1.2 667 75 440 244 20 (2010Q4) x31.6 x34 K Computer ~16MW 10 625 80 3300 206 24 (2011Q2) BlueGene/Q ~12MW? 17 ~1400 ~35 3000 250 20 (2012Q1) TSUBAME2.5 1.4MW ~2.8 ~2000 ~25 802 572 8.7 (2013Q3) Tsubame3.0 1.5MW ~20 ~13,000 ~4 6000 4000 1.25 (2015Q4~2016Q1) ~x20 ~x13.7 EXA (2019~20) 20MW 1000 50,000 1 100K 5000 1 Summary • TSUBAME1.0->2.0->2.5->3.0->… – Tsubame 2.5 Number 1 in Japan, 17 Petaflops SFP – Template for future and IDC machines • TSUBAME3.0 Early 2016 – New supercomputing leadership – Tremendous power efficiency, extreme big data, extreme high reliability,… • Lots of background R&D for TSUBAME3.0 and towards Exascale – Green Computing: ULP-HPC & TSUBAME-KFC – Extreme Big Data – Convergence of HPC and IDC! – Exascale Resilience – Programming with Millions of Cores – … • Please stay tuned!