Programming for Fujitsu Supercomputers

Koh Hotta The Next Generation Technical Computing Fujitsu Limited To Programmers who are busy on their own research, Fujitsu provides environments for Parallel Programming Tuning Parallel Programs

User/ISV Applications HPC Portal / System Management Portal Technical Computing Suite System operations management High-performance file system Compilers  Hybrid parallel programming  System configuration management  Lustre-based distributed file system  Sector cache support  System control  High scalability  SIMD / Register file extensions  System monitoring  IO bandwidth guarantee  System installation & operation  High reliability & availability Support Tools  IDE TM  Profiler & Tuning tools Job operations management VISIMPACT  Interactive debugger  Job manager  Shared L2 cache on a chip  Job scheduler  Hardware intra-processor MPI Library  Resource management synchronization  Scalability of High-Func.  Parallel execution environment  Barrier Comm.

 Expanded Registers  reduces the wait-time of instructions  SIMD instructions  reduces the number of instructions  Sector Cache  improves cache efficiency

 NPB3.3 LU: operation wait-time dramatically reduced  NPB3.3 MG: number of instruction is halved

NPB3.3 LU NPB3.3 MG Execution time comparison (relative values) Execution time comparison (relative values) Memory wait Cache misses Memory wait Cache misses Operation wait Instructions committed Operation wait Instructions committed 1.2 1.2 Efficient use of SIMD implementation 1 Expanded registers 1 reduces Instructions 0.8 reduces operation wait 0.8 committed 0.6 0.6 0.4 0.4 0.2 0.2 Faster Faster 0 0 FX1 PRIMEHPC FX10 FX1 PRIMEHPC FX10 Copyright 2012 FUJITSU LIMITED Parallel Programming for Busy Researchers

 Large # of Parallelism for Large Scale Systems  Large # processes need Large Memory & Overhead • hybrid thread-process programming to reduce number of processes  Hybrid parallel programming is annoying for programmers  Even for multi-threading, the more coarse grain, the better.  Procedure level or outer loop parallelism is desired  Little opportunity for such coarse grain parallelism  System support for “fine grain” parallelism is required  VISIMPACT solves these problems

 Mechanism that treats multiple cores as one high-speed CPU through automatic parallelization  Just program and compile, and enjoy high-speed

 You need not think about hybrid CPU CPU L2$ Core Core

Process Process

. L2$ Inter-core

Memory Memory thread

. .

parallel L2$ Core processing

Process Core

 Technologies to make multiple cores to single high speed CPU  Shared L2 cache memory to avoid false sharing  Inter-core hardware barrier facilities to reduce overhead of thread synchronization Barrier synchronization

Core Core • • • Core

Thread 1 Thread 2 Thread N

• • • Time  Thread parallel programs Hardware barrier synchronization: 10 times faster than conventional system

use these technologies • • •

 thread parallel in a chip  Auto-parallel or explicit parallel description using OpenMPTM API  VISIMPACT enables low overhead thread execution

 process parallel beyond CPU  MPI programming  Collective communication is tuned using Tofu barrier facility

Original Open MPI Software Stack Supported special Bcast・Allgather・ Alltoall・Allreduce for Tofu (Using openib BTL) Extension MPI MPI Extension MPI MPI COLL COLL PML ob1 PML PML ob1 PML PML (Point-to-Point Messaging Layer) BML BML BML r2 BML Adapting BML r2 BML to tofu tofu (BTL Management Layer) BTL LLP BTL BTL openib BTL Hardware BTL tofu BTL dependent (Byte Transfer Layer) tofu common OpenFabrics Verbs Tofu Library

LLP (Low Latency Path) Providing Common Data processing and structures for BTL・LLP・COLL Special Hardware dependent layer For Tofu Interconnect Rendezvous Protocol Optimization etc Copyright 2012 FUJITSU LIMITED Customized MPI Library for High Scalability

Point-to-Point communication •Two methods for inside communication •The transfer method selection by •the data length •process location •number of hops Quotation from K computer Collective communication performance data •Barrier, Allreduce, Bcast and Reduce use Tofu-barrier facility •Bcast, Allgather, Allgatherv, Allreduce and Alltoall use Tofu-optimized algorithm

Collecting Analysis & Overall Job Info. Tuning tuning Tofu-PA Profiler PAPI Vampir-trace RMATT

MPI Tuning Rank Execution mapping CPU Tuning

Fujitsu Tools Profiler Vampir-trace Open Source

 3-D job example  Display 4096 procs in 16 x 16 x 16 cells  Cells painted in colors according to the proc status (e.g. CPU time)  Cut a slice of jobs along x-, y-, or z-axis to view

input Network Construction RMATT Optimized Rank Map Communication Pattern (Communication Reduce number of hop and congestion processing contents between Rank) output

Apply MPI_Allgather Communication Processing Performance apply •Rank number : 4096 rank •Network construction : 16x16x16 node (4096) Remapping used RMATT 5.5ms x,y,z order mapping 22.3ms 4 times performance Up •8 •6 •2 •0 •1 •2

•3 •4 •5 •1 •3 •0

•6 •7 •8 •5 •7 •4