Quick viewing(Text Mode)

Programming for Fujitsu Supercomputers

Programming for Fujitsu Supercomputers

Programming for Fujitsu

Koh Hotta The Next Generation Technical Computing Fujitsu Limited To who are busy on their own research, Fujitsu provides environments for Parallel Programming Tuning Parallel Programs

Copyright 2012 FUJITSU LIMITED System Software Stack

User/ISV Applications HPC Portal / System Management Portal Technical Computing Suite System operations management High-performance Compilers  Hybrid parallel programming  System configuration management  Lustre-based distributed file system  Sector cache support  System control  High  SIMD / extensions  System monitoring  IO bandwidth guarantee  System installation & operation  High reliability & availability Support Tools  IDE TM  Profiler & Tuning tools Job operations management VISIMPACT  Interactive debugger  Job manager  Shared L2 cache on a chip  Job scheduler  Hardware intra- MPI Library  Resource management  Scalability of High-Func.  Parallel environment  Barrier Comm.

Linux-based enhanced Red Hat Enterprise Super Computer: PRIMEHPC FX10 PC cluster: PRIMERGY Copyright 2012 FUJITSU LIMITED HPC-ACE architecture & Compiler

 Expanded Registers  reduces the wait-time of instructions  SIMD instructions  reduces the number of instructions  Sector Cache  improves cache efficiency

Copyright 2012 FUJITSU LIMITED Compiler Gives Life to HPC-ACE

 NPB3.3 LU: operation wait-time dramatically reduced  NPB3.3 MG: number of instruction is halved

NPB3.3 LU NPB3.3 MG Execution time comparison (relative values) Execution time comparison (relative values) Memory wait Cache misses Memory wait Cache misses Operation wait Instructions committed Operation wait Instructions committed 1.2 1.2 Efficient use of SIMD implementation 1 Expanded registers 1 reduces Instructions 0.8 reduces operation wait 0.8 committed 0.6 0.6 0.4 0.4 0.2 0.2 Faster Faster 0 0 FX1 PRIMEHPC FX10 FX1 PRIMEHPC FX10 Copyright 2012 FUJITSU LIMITED Parallel Programming for Busy Researchers

 Large # of Parallelism for Large Scale Systems  Large # processes need Large Memory & Overhead • hybrid thread- programming to reduce number of processes  Hybrid parallel programming is annoying for programmers  Even for multi-threading, the more coarse grain, the better.  Procedure level or outer loop parallelism is desired  Little opportunity for such coarse grain parallelism  System support for “fine grain” parallelism is required  VISIMPACT solves these problems

Copyright 2012 FUJITSU LIMITED VISIMPACTTM (Virtual Single Processor by Integrated Multi-core Parallel Architecture)

 Mechanism that treats multiple cores as one high-speed CPU through automatic parallelization  Just program and compile, and enjoy high-speed

 You need not think about hybrid CPU CPU L2$ Core Core

Process Process

.

. L2$ Inter-core

.

.

Memory Memory thread

. .

parallel L2$ Core processing

Process Core

Copyright 2012 FUJITSU LIMITED VISIMPACTTM (Virtual Single Processor by Integrated Multi-core Parallel Architecture)

 Technologies to make multiple cores to single high speed CPU  Shared L2 cache memory to avoid false  Inter-core hardware barrier facilities to reduce overhead of thread synchronization Barrier synchronization

Core Core • • • Core

Thread 1 Thread 2 Thread N

• • • Time  Thread parallel programs Hardware barrier synchronization: 10 times faster than conventional system

use these technologies • • •

Copyright 2012 FUJITSU LIMITED thread & process Hybrid-Parallel Programming

 thread parallel in a chip  Auto-parallel or explicit parallel description using OpenMPTM API  VISIMPACT enables low overhead thread execution

 process parallel beyond CPU  MPI programming  Collective communication is tuned using Tofu barrier facility

Copyright 2012 FUJITSU LIMITED MPI Software stack

Original Open MPI Software Stack Supported special Bcast・Allgather・ Alltoall・Allreduce for Tofu (Using openib BTL) Extension MPI MPI Extension MPI MPI COLL COLL PML ob1 PML PML ob1 PML PML (Point-to-Point Messaging Layer) BML BML BML r2 BML Adapting BML r2 BML to tofu tofu (BTL Management Layer) BTL LLP BTL BTL openib BTL Hardware BTL tofu BTL dependent (Byte Transfer Layer) tofu common OpenFabrics Verbs Tofu Library

LLP (Low Latency Path) Providing Common Data processing and structures for BTL・LLP・COLL Special Hardware dependent layer For Tofu Interconnect Rendezvous Protocol Optimization etc Copyright 2012 FUJITSU LIMITED Customized MPI Library for High Scalability

Point-to-Point communication •Two methods for inside communication •The transfer method selection by •the data length •process location •number of hops Quotation from K computer Collective communication performance data •Barrier, Allreduce, Bcast and Reduce use Tofu-barrier facility •Bcast, Allgather, Allgatherv, Allreduce and Alltoall use Tofu-optimized algorithm

Copyright 2012 FUJITSU LIMITED Application Tuning Cycle and Tools

Collecting Analysis & Overall Job Info. Tuning tuning Tofu-PA Profiler PAPI Vampir-trace RMATT

MPI Tuning Rank Execution mapping CPU Tuning

Fujitsu Tools Profiler Vampir-trace Open Source

Tools Copyright 2012 FUJITSU LIMITED Program Profile (Event Counter Example)

 3-D job example  Display 4096 procs in 16 x 16 x 16 cells  Cells painted in colors according to the proc status (e.g. CPU time)  Cut a slice of jobs along x-, y-, or z-axis to view

Copyright 2012 FUJITSU LIMITED Rank Mapping Optimization (RMATT)

input Network Construction RMATT Optimized Rank Communication Pattern (Communication Reduce number of hop and congestion processing contents between Rank) output

Apply MPI_Allgather Communication Processing Performance apply •Rank number : 4096 rank •Network construction : 16x16x16 node (4096) Remapping used RMATT 5.5ms x,y,z order mapping 22.3ms 4 times performance Up •8 •6 •2 •0 •1 •2

•3 •4 •5 •1 •3 •0

•6 •7 •8 •5 •7 •4

Copyright 2012 FUJITSU LIMITED Copyright 2012 FUJITSU LIMITED