Programming on K Computer

Programming on K computer Koh Hotta The Next Generation Technical Computing Fujitsu Limited Copyright 2010 FUJITSU LIMITED System Overview of “K computer” Target Performance : 10PF over 80,000 processors Over 640K cores Over 1 Peta Bytes Memory Cutting-edge technologies CPU : SPARC64 VIIIfx 8 cores, 128GFlops Extension of SPARC V9 Interconnect, “Tofu” : 6-D mesh/torus Parallel programming environment. 1 Copyright 2010 FUJITSU LIMITED I have a dream that one day you just compile your programs and enjoy high performance on your high-end supercomputer. So, we must provide easy hybrid parallel programming method including compiler and run-time system support. 2 Copyright 2010 FUJITSU LIMITED User I/F for Programming for K computer K computer Client System FrontFront End End BackBack End End (80000 (80000 Nodes) Nodes) IDE Interface CCommommandand JobJob C Controlontrol IDE IntInteerfrfaceace debugger DDebugebuggerger App IntInteerfrfaceace App Interactive Debugger GUI Data Data Data CoConvnveerrtteerr Data SamplSampleerr debugging partition official op VisualizedVisualized SSaammpplingling Stage out partition Profiler DaDattaa DaDattaa (InfiniBand) 3 Copyright 2010 FUJITSU LIMITED Parallel Programming 4 Copyright 2010 FUJITSU LIMITED Hybrid Parallelism on over-640K cores Too large # of processes to manipulate To reduce number of processes, hybrid thread-process programming is required But Hybrid parallel programming is annoying for programmers Even for multi-threading, procedure level or outer loop parallelism was desired Little opportunity for such coarse grain parallelism System support for “fine grain” parallelism is required 5 Copyright 2010 FUJITSU LIMITED Targeting inner-most loop parallelization Automatic vectorization technology has become mature, and vector-tuning is easy for programmers. Inner-most loop parallelism, which is fine-grain, should be an important portion for peta-scale parallelization. Inner-most loop acceleration by multi-threading technology 6 Copyright 2010 FUJITSU LIMITED Inner-most loop acceleration by multi-threading technology CPU architecture is designed to reuse vectorization methodology efficiently. Targeting the inner-most loop automatic parallelization for multi- core processor. 7 Copyright 2010 FUJITSU LIMITED VISIMPACTTM : you need not think about multi-cores Efficient multi-thread execution on multiple cores tightly coupled with each other Collaboration between hardware architecture and compiler optimization makes high efficiency Shared L2 cache on a chip High speed hardware barrier on a chip Automatic parallelization Automatic parallelization facility makes multi-cores like a single high-speed core You need not think about cores in a CPU chip. 8 Copyright 2010 FUJITSU LIMITED VISIMPACTTM Performance on DAXPY SPARC64™ VIIIfx 2.0GHz simd8 threads ×8並列 Euroben 8 (DAXPY) FX1(SPARC64™ VII) 2.52GHz 4並列4 threads BX900 Nehalem 2.93GHz 8並列(2チップ)8 threads (2 chips) Do i = 1, n 10000 y(i,jsw) = y(i,jsw) + c0*x1(i) End Do 1000 Shared cache provides Rate (Mflop/s) twice performance than Nehalem 2.93GHz. 100 100 1000 10000 9 Problem size Copyright 2010 FUJITSU LIMITED MPI Open MPI based Tuned to “Tofu” interconnect 10 Copyright 2010 FUJITSU LIMITED MPI Approach for the K computer Open MPI based Open Standard, Open Source, Multi-Platform including PC Cluster Adding extension to Open MPI for “Tofu” interconnect High Performance Short-cut message path for low latency communication Torus oriented protocol: Message Size, Location, Hop Sensitive Trunking Communication utilizing multi-dimensional network links by Tofu selective routing. 11 Copyright 2010 FUJITSU LIMITED Goal for MPI on K system High Performance Low Latency & High Bandwidth Highly Scalability Collective Performance Optimized for Tofu interconnect High Availability, Flexibility and Easy to Use Providing Logical 3D-Torus for each JOB with eliminating failure nodes. Providing New up version of MPI Standard functions as soon as possible 12 Copyright 2010 FUJITSU LIMITED MPI Software stack Original Open MPI Software Stack Supported special Bcast・Allgather・ Alltoall・Allreduce for Tofu (Using openib BTL) Extension MPI MPI Extension MPI MPI COLL COLL PML ob1 PML PML ob1 PML PML (Point-to-Point Messaging Layer) BML BML BML r2 BML Adapting BML r2 BML to tofu tofu (BTL Management Layer) BTL LLP BTL BTL openib BTL Hardware BTL tofu BTL dependent (Byte Transfer Layer) tofu common OpenFabrics Verbs Tofu Library LLP (Low Latency Path) Providing Common Data processing and structures for BTL・LLP・COLL Special Hardware dependent layer For Tofu Interconnect Rendezvous Protocol Optimization etc 13 Copyright 2010 FUJITSU LIMITED Flexible Process Mapping to Tofu environment You can allocate your processes as you like. Dimension Specification for each rank 1D ：(x) (0) (0,0) (0,0,0) (1) (1,0) (1,0,0) 2D ：(x,y) (2) (2,0) (0,1,0) (3) (3,0) (1,1,0) 3D ：(x,y,z) (7) (3,1) (0,0,1) (6) (2,1) (0,1,1) (5) (1,1) (1,0,1) (4) (0,1) (1,1,1) 4 5 6 7 7 6 5 4 5 7 y 0 1 2 3 0 1 2 3 2 3 4 6 x y 0 1 z x 14 Copyright 2010 FUJITSU LIMITED Performance Tuning Not only by compiler optimization, but also you can manipulate performance Compiler directives to tune programs. Tools to help your effort to tune your programs ex. Watch your program using event counter 15 Copyright 2010 FUJITSU LIMITED Performance Tuning (Event Counter Example) 3-D job example Display 4096 procs in 16 x 16 x 16 cells Cells painted in colors according to the proc status (e.g. CPU time) Cut a slice of jobs along x-, y-, or z-axis to view 16 Copyright 2010 FUJITSU LIMITED Conclusion: Automatic and transparency of performance VISIMPACTTM lets you treat 8-cored CPU as a single high-speed core. Collaboration by the CPU architecture and the compiler. High-speed hardware barrier to reduce the overhead of synchronization Shared L2 cache to improve memory access Automatic parallelization to recognize parallelism and accelerate your program Open MPI based MPI to utilize “Tofu” interconnect. Tuning facility shows the activity of parallel programs. 17 Copyright 2010 FUJITSU LIMITED 18 Copyright 2010 FUJITSU LIMITED.

Programming on K Computer

Supercomputer Fugaku

Computer Architectures an Overview

40 Jahre Erfolgsgeschichte BS2000 Vortrag Von 2014

Toward Automated Cache Partitioning for the K Computer

Fujitsu Standard Tool

World's Fastest Computer

Mainframes in Tomorrow's Data Center

Introduction of Fujitsu's Next-Generation Supercomputer

Post-K Computer Development

Overview of the K Computer System

Exascale” Supercomputer Fugaku & Beyond

Supercomputer "Fugaku"