Programming on K computer

Koh Hotta The Next Generation Technical Computing Limited Copyright 2010 FUJITSU LIMITED System Overview of “K computer”

 Target Performance : 10PF  over 80,000 processors  Over 640K cores  Over 1 Peta Bytes Memory  Cutting-edge technologies  CPU : SPARC64 VIIIfx

8 cores, 128GFlops Extension of SPARC V9  Interconnect, “Tofu” : 6-D mesh/torus

 Parallel programming environment.

1 Copyright 2010 FUJITSU LIMITED

I have a dream that one day you just compile your programs and enjoy high performance on your high-end .

So, we must provide easy hybrid parallel programming method including compiler and run-time system support.

2 Copyright 2010 FUJITSU LIMITED User I/F for Programming for K computer

K computer Client System FrontFront End End BackBack End End (80000 (80000 Nodes) Nodes)

IDE Interface

CCommommandand JobJob C Controlontrol IDE IntInteerfrfaceace

debugger DDebugebuggerger App IntInteerfrfaceace App Interactive Debugger GUI Data Data Data CoConvnveerrtteerr Data SamplSampleerr debugging partition

official op VisualizedVisualized SSaammpplingling Stage out partition Profiler DaDattaa DaDattaa (InfiniBand)

3 Copyright 2010 FUJITSU LIMITED Parallel Programming

4 Copyright 2010 FUJITSU LIMITED Hybrid Parallelism on over-640K cores

Too large # of processes to manipulate  To reduce number of processes, hybrid thread-process programming is required  But Hybrid parallel programming is annoying for programmers Even for multi-threading, procedure level or outer loop parallelism was desired  Little opportunity for such coarse grain parallelism  System support for “fine grain” parallelism is required

5 Copyright 2010 FUJITSU LIMITED Targeting inner-most loop parallelization

Automatic vectorization technology has become mature, and vector-tuning is easy for programmers. Inner-most loop parallelism, which is fine-grain, should be an important portion for peta-scale parallelization.

Inner-most loop acceleration by multi-threading technology

6 Copyright 2010 FUJITSU LIMITED Inner-most loop acceleration by multi-threading technology

 CPU architecture is designed to reuse vectorization methodology efficiently.  Targeting the inner-most loop automatic parallelization for multi- core processor.

7 Copyright 2010 FUJITSU LIMITED VISIMPACTTM : you need not think about multi-cores

 Efficient multi-thread execution on multiple cores tightly coupled with each other  Collaboration between hardware architecture and compiler optimization makes high efficiency Shared L2 cache on a chip High speed hardware barrier on a chip

Automatic parallelization Automatic parallelization facility makes multi-cores like a single high-speed core  You need not think about cores in a CPU chip.

8 Copyright 2010 FUJITSU LIMITED

VISIMPACTTM Performance on DAXPY

SPARC64™ VIIIfx 2.0GHz simd8 threads ×8並列  Euroben 8 (DAXPY) FX1(SPARC64™ VII) 2.52GHz 4並列4 threads BX900 Nehalem 2.93GHz 8並列(2チップ)8 threads (2 chips) Do i = 1, n 10000 y(i,jsw) = y(i,jsw) + c0*x1(i) End Do

1000

Shared cache provides Rate (Mflop/s) twice performance than Nehalem 2.93GHz. 100 100 1000 10000 9 Problem size Copyright 2010 FUJITSU LIMITED

MPI

 Open MPI based  Tuned to “Tofu” interconnect

10 Copyright 2010 FUJITSU LIMITED MPI Approach for the K computer

 Open MPI based  Open Standard, Open Source, Multi-Platform including PC Cluster  Adding extension to Open MPI for “Tofu” interconnect  High Performance  Short-cut message path for low latency communication  Torus oriented protocol: Message Size, Location, Hop Sensitive  Trunking Communication utilizing multi-dimensional network links by Tofu selective routing.

11 Copyright 2010 FUJITSU LIMITED Goal for MPI on K system

 High Performance  Low Latency & High Bandwidth  Highly Scalability  Collective Performance Optimized for Tofu interconnect  High Availability, Flexibility and Easy to Use

 Providing Logical 3D-Torus for each JOB with eliminating failure nodes.  Providing New up version of MPI Standard functions as soon as possible

12 Copyright 2010 FUJITSU LIMITED MPI Software stack

Original Open MPI Software Stack Supported special Bcast・Allgather・ Alltoall・Allreduce for Tofu (Using openib BTL) Extension MPI MPI Extension MPI MPI COLL COLL PML ob1 PML PML ob1 PML PML (Point-to-Point Messaging Layer) BML BML BML r2 BML Adapting BML r2 BML to tofu tofu (BTL Management Layer) BTL LLP BTL BTL openib BTL Hardware BTL tofu BTL dependent (Byte Transfer Layer) tofu common OpenFabrics Verbs Tofu Library

LLP (Low Latency Path) Providing Common Data processing and structures for BTL・LLP・COLL Special Hardware dependent layer For Tofu Interconnect Rendezvous Protocol Optimization etc 13 Copyright 2010 FUJITSU LIMITED Flexible Process Mapping to Tofu environment

 You can allocate your processes as you like.  Dimension Specification for each rank

 1D :(x) (0) (0,0) (0,0,0) (1) (1,0) (1,0,0)  2D :(x,y) (2) (2,0) (0,1,0) (3) (3,0) (1,1,0)  3D :(x,y,z) (7) (3,1) (0,0,1) (6) (2,1) (0,1,1) (5) (1,1) (1,0,1) (4) (0,1) (1,1,1)

4 5 6 7 7 6 5 4 5 7 y 0 1 2 3 0 1 2 3 2 3 4 6 x y 0 1 z x 14 Copyright 2010 FUJITSU LIMITED Performance Tuning

 Not only by compiler optimization, but also you can manipulate performance  Compiler directives to tune programs.  Tools to help your effort to tune your programs  ex. Watch your program using event counter

15 Copyright 2010 FUJITSU LIMITED

Performance Tuning (Event Counter Example)

 3-D job example  Display 4096 procs in 16 x 16 x 16 cells  Cells painted in colors according to the proc status (e.g. CPU time)  Cut a slice of jobs along x-, y-, or z-axis to view

16 Copyright 2010 FUJITSU LIMITED Conclusion: Automatic and transparency of performance

 VISIMPACTTM lets you treat 8-cored CPU as a single high-speed core.  Collaboration by the CPU architecture and the compiler. High-speed hardware barrier to reduce the overhead of synchronization Shared L2 cache to improve memory access

Automatic parallelization to recognize parallelism and accelerate your program  Open MPI based MPI to utilize “Tofu” interconnect.  Tuning facility shows the activity of parallel programs.

17 Copyright 2010 FUJITSU LIMITED 18 Copyright 2010 FUJITSU LIMITED