Tsubame 2.5 Towards 3.0 and Beyond to Exascale
Total Page:16
File Type:pdf, Size:1020Kb
Being Very Green with Tsubame 2.5 towards 3.0 and beyond to Exascale Satoshi Matsuoka Professor Global Scientific Information and Computing (GSIC) Center Tokyo Institute of Technology ACM Fellow / SC13 Tech Program Chair NVIDIA Theater Presentation 2013/11/19 Denver, Colorado TSUBAME2.0 NEC Confidential TSUBAME2.0 Nov. 1, 2010 “The Greenest Production Supercomputer in the World” TSUBAME 2.0 New Development >600TB/s Mem BW 220Tbps NW >12TB/s Mem BW >400GB/s Mem BW >1.6TB/s Mem BW Bisecion BW 80Gbps NW BW 35KW Max 1.4MW Max 32nm 40nm ~1KW max 3 Performance Comparison of CPU vs. GPU 1750 GPU 200 GPU ] 1500 160 1250 GByte/s 1000 120 750 80 500 CPU CPU 250 40 Peak Performance [GFLOPS] Performance Peak 0 Memory Bandwidth [ Bandwidth Memory 0 x5-6 socket-to-socket advantage in both compute and memory bandwidth, Same power (200W GPU vs. 200W CPU+memory+NW+…) NEC Confidential TSUBAME2.0 Compute Node 1.6 Tflops Thin 400GB/s Productized Node Mem BW as HP 80GBps NW ProLiant Infiniband QDR x2 (80Gbps) ~1KW max SL390s HP SL390G7 (Developed for TSUBAME 2.0) GPU: NVIDIA Fermi M2050 x 3 515GFlops, 3GByte memory /GPU CPU: Intel Westmere-EP 2.93GHz x2 (12cores/node) Multi I/O chips, 72 PCI-e (16 x 4 + 4 x 2) lanes --- 3GPUs + 2 IB QDR Memory: 54, 96 GB DDR3-1333 SSD:60GBx2, 120GBx2 Total Perf 2.4PFlops Mem: ~100TB NEC Confidential SSD: ~200TB 4-1 2010: TSUBAME2.0 as No.1 in Japan > All Other Japanese Centers on the Top500 COMBINED 2.3 PetaFlops Total 2.4 Petaflops #4 Top500, Nov. 2010 NEC Confidential TSUBAME Wins Awards… “Greenest Production Supercomputer in the World” the Green 500 Nov. 2010, June 2011 (#4 Top500 Nov. 2010) 3 times more power efficient than a laptop! TSUBAME Wins Awards… ACM Gordon Bell Prize 2011 2.0 Petaflops Dendrite Simulation Special Achievements in Scalability and Time-to-Solution “Peta-Scale Phase-Field Simulation for Dendritic Solidification on the TSUBAME 2.0 Supercomputer” TSUBAME Three Key Application Areas “Of High National Interest and Societal Benefit to the Japanese Taxpayers” 1. Safety/Disaster & Environment 2. Medical & Pharmaceutical 3. Manufacturing & Materials Plus Co-Design for general IT Industry and Ecosystem impact (IDC, Big Data, etc.) 9 Lattice-Boltzmann-LES with Coherent-structure SGS model [Onodera&Aoki2013] Coherent-structure Smagorinsky model Second invariant of the velocity gradient The model parameter is locally determined by thetensor(Q) and second invariant of the velocity gradient tensor. Energy dissipation(ε) ◎ Turbulent flow around a complex object ◎ Large-scale parallel computation Computational Area – Entire Downtown Tokyo Major part of Tokyo Including Shnjuku-ku, Chiyoda-ku, Minato-ku, Shinjyuku Tokyo Meguro-ku, Chuou-ku, 10km×10km Building Data: Shibuya Pasco Co. Ltd. TDM 3D Achieved 0.592 Petaflops Shinagawa using over 4000 GPUs (15% efficiency) Map ©2012 Google, ZENRIN 11 Copyright © Takayuki Aoki / Global Scientific Information and Computing Center, Tokyo Institute of Technology Area Around Metropolitan Government Building Flow profile at the 25m height on the ground Wind 960 m 640 m 地図データ ©2012 Google, ZENRIN Copyright © Takayuki Aoki / Global Scientific Information and Computing Center, Tokyo Institute of Technology 13 Copyright © Takayuki Aoki / Global Scientific Information and Computing Center, Tokyo Institute of Technology 14 3,000x1,500x1,500 LBM: DriVer: BMW-Audi Re = 1,000,000 Lehrstuhl für Aerodynamik und Strömungsmechanik Technische Universität München * Number of grid points: 3,623,878,656 60 km/h (3,072 × 1,536 × 768) *Grid resolution:4.2mm (13m x 6.5 m x 3.25m) *Number of GPUs: 288 (96 Nodes) 15 16 TSUBAME2.0 Contributed Significantly to Creation of the National Earthquake Hazard Map Industry prog.: TOTO INC. TSUBAME 150 GPUs In-House Cluster Accelerate In- silico screeninig and data mining Mixed Precision Amber on Tsubame2.0 for Industrial Drug Discovery x10 faster Mixed-Precision ヌクレオソーム (25095 粒子) 75% Energy Efficient $500mil~$1bil dev. cost per drug > 10% improvement of the process will more than pay for TSUBAME Towards TSUBAME 3.0 Interim Upgrade TSUBAME2.0 to 2.5 (Sept.10th, 2013) • Upgrade the TSUBAME2.0s GPUs NVIDIA Fermi M2050 to Kepler K20X SFP/DFP peak from 4.8PF/2.4PF => 17PF/5.7PF c.f. The K Computer 11.2/11.2 Acceleration of Important Apps Considerable Improvement TSUBAME2.0 Compute Node Summer 2013 Fermi GPU 3 x 1408 = 4224 GPUs Significant Capacity Improvement at low cost & w/o Power Increase TSUBAME3.0 2H2015 TSUBAME2.0⇒2.5 Thin Node Upgrade Peak Perf. Thin 4.08 Tflops Productized Node ~800GB/s as HP Mem BW ProLiant Infiniband QDR 80GBps NW SL390s x2 (80Gbps) ~1KW max Modified for TSUABME2.5 HP SL390G7 (Developed for TSUBAME 2.0, Modified for 2.5) GPU: NVIDIA Kepler K20X x 3 1310GFlops, 6GByte Mem(per GPU) CPU: Intel Westmere-EP 2.93GHz x2 Multi I/O chips, 72 PCI-e (16 x 4 + 4 x 2) lanes --- 3GPUs + 2 IB QDR Memory: 54, 96 GB DDR3-1333 SSD:60GBx2, 120GBx2 NVIDIA Fermi NVIDIA Kepler M2050 K20X 1039/515 3950/1310 GFlops GFlops TSUBAME2.0 => 2.5 Changes • Doubled~Tripled performance – 2.4(DFP)/4.8(SFP) Petaflops => 5.76(x 2.4)/17.1(x3.6) – Preliminary results: ~2.7PF Linpack (x2.25) , ~3.4PF Dandrite GB app (x1.7) • Bigger and higher bandwidth GPU memory – 3GB=>6GB per GPU, 150GB/s => 250GB/s • Higher reliability – Resolved minor HW bug, compute node fail-stop occurrence to decrease up to 50% • Lower Power – Observing ~20% drop in power/energy (tentative) • Better programmability w/new GPU features – Dynamic tasks, HyperQ, CPU/GPU shared memory • Prolongs TSUBAME2 lifetime by at least 1 year – TSUBAME 3.0 FY 2015 Q4 TSUBAME2.0 TSUBAME2.5 Thin Node x 1408 Units Node Machine HP Proliant SL390s ← No Change CPU Intel Xeon X5670 ← No Change (6core 2.93GHz, Westmere) x 2 GPU NVIDIA Tesla M2050 x 3 NVIDIA Tesla K20X x 3 448 CUDA cores (Fermi) 2688 CUDA cores (Kepler) SFP 1.03TFlops SFP 3.95TFlops DFP 0.515TFlops DFP 1.31TFlops 3GiB GDDR5 memory 6GiB GDDR5 memory 150GB Peak, ~90GB/s 250GB Peak, ~180GB/s STREAM Memory BW STREAM Memory BW Node SFP 3.40TFlops SFP 12.2TFlops Performance DFP 1.70TFlops DFP 4.08TFlops (incl. CPU ~500GB Peak, ~300GB/s ~800GB Peak, ~570GB/s Turbo boost) STREAM Memory BW STREAM Memory BW TOTAL System SFP 4.80PFlops SFP 17.1PFlops (x3.6) Total System DFP 2.40PFlops DFP 5.76PFlops (x2.4) Performance Peak ~0.70PB/s, STREAM Peak ~1.16PB/s, STREAM ~0.440PB/s Memory BW ~0.804PB/s Memory BW (x1.8) 2013: TSUBAME2.5 No.1 in Japan* in Single Precision FP, 17 Petaflops (*but not in Linpack) All University Centers ~=COMBINED 9 Petaflops SFP Total 17.1 Petaflops SFP K Computer 5.76 Petaflops DFP 11.4 Petaflops SFP/DFP Phase-field simulation for Dendritic Solidification [Shimokawabe, Aoki et. al.] Gordon Bell 2011 Winner Weak scaling on TSUBAME (Single precision) Mesh size(1GPU+4 CPU cores):4096 x 162 x 130 TSUBAME 2.5 3.444 PFlops (3,968 GPUs+15,872 CPU cores) 4,096 x 5,022 x 16,640 Developing lightweight strengthening TSUBAME 2.0 material by controlling microstructure 2.000 PFlops (4,000 GPUs+16,000 CPU cores) Low-carbon society 4,096 x 6,480 x 13,000 • Peta-Scale phase-field simulations can simulate the multiple dendritic growth during solidification required for the evaluation of new materials. • 2011 ACM Gordon Bell Prize Special Achievements in Scalability and Time-to-Solution Peta-scale stencil application : A Large-scale LES Wind Simulation using Lattice Boltzmann Method [Onodera, Aoki] Large-scale Wind Simulation for a Weak scalability in single precision 10km x 10km Area in Metropolitan Tokyo (N = 192 x 256 x 256) 10,080 x 10,240 x 512 (4,032 GPUs) ▲ TSUBAME 2.5 (overlap) ] ● TSUBAME 2.0 (overlap) TSUBAME 2.5 TFlops 1142 TFlops (3968 GPUs) x 1.93 288 GFlops / GPU TSUBAME 2.0 Performance [ 149 TFlops (1000 GPUs) 149 GFlops / GPU The above peta-scale simulations were executed as the TSUBAME Grand Challenge Program, Category A in 2012 fall. Number of GPUs • The LES wind simulation for the area 10km × 10km with 1-m resolution has never been done before in the world. • We achieved 1.14 PFLOPS using 3968 GPUs on the TSUBAME 2.5 supercomputer. AMBER pmemd benchmark Nucleosome = 25,095 atoms K20X×8 11.39 K20X×4 6.66 K20X×2 4.04 K20X×1 3.11 M2050×8 3.44 2.22 M2050×4 1.85 M2050×2 M2050×1 0.99 0.31 MPI 4node 0.15 MPI 2node 0.11 Dr.Sekijima@Tokyo Tech MPI 1node (12 core) 0 2 4 6 8 10 12 TSUBAME2.0 M2050 ns/day TSUBAME2.5 K20X Application TSUBAME2.0 TSUBAME2.5 Boost Performance Performance Ratio Top500/Linpack 1.192 2.843 2.39 4131 GPUs (PFlops) Semi-Definite Programming 1.019 1.713 1.68 Nonlinear Optimization 4080 GPUs (PFlops) Gordon Bell Dendrite Stencil 2.000 3.444 1.72 3968 GPUs (PFlops) LBM LES Whole City Airflow 0.592 1.142 1.93 3968 GPUs (PFlops) Amber 12 pmemd 3.44 11.39 3.31 4 nodes 8 GPUs (nsec/day) GHOSTM Genome Homology 19361 10785 1.80 Search 1 GPU (Sec) MEGADOC Protein Docking 37.11 83.49 2.25 1 node 3GPUs (vs. 1CPU core) Stay tuned for the TSUBAME2.5 and TSUBAME-KFC Green500 submission numbers later this talk TSUBAME2.0 => 2.5 10~20%power reduction Sept Sept 2012 2013 TSUBAME Evolution Towards Exascale and Extreme Big Data 25-30PF 1TB/s 5.7PF Graph 500 No.