Computer Architecture Past & Future Why Non blocking Collectives? LibNBC AndApplications? OngoingEfforts

Non-Blocking Collectives for MPI – overlap at the highest level –

Torsten Höfler

Open Systems Lab Indiana University Bloomington, IN, USA

Institut für Wissenschaftliches Rechnen Technische Universität Dresden Dresden, Germany 27th May 2008 Computer Architecture Past & Future Why Non blocking Collectives? LibNBC AndApplications? OngoingEfforts

Outline

1 Computer Architecture Past & Future

2 Why Non blocking Collectives?

3 LibNBC

4 And Applications?

5 Ongoing Efforts Computer Architecture Past & Future Why Non blocking Collectives? LibNBC And Applications? Ongoing Efforts

Outline

1 Computer Architecture Past & Future

2 Why Non blocking Collectives?

3 LibNBC

4 And Applications?

5 Ongoing Efforts Computer Architecture Past & Future Why Non blocking Collectives? LibNBC And Applications? Ongoing Efforts

Fundamental Assumptions (I)

We need more powerful machines! Solutions for real-world scientific problems need huge processing power (more than available)

Capabilities of single PEs have fundamental limits The scaling/frequency race is currently stagnating Moore’s law is still valid (number of transistors/chip) Instruction level parallelism is limited (pipelining, VLIW, multi-scalar)

Explicit parallelism seems to be the only solution Single chips and transistors get cheaper Implicit transistor use (ILP, branch prediction) have their limits Computer Architecture Past & Future Why Non blocking Collectives? LibNBC And Applications? Ongoing Efforts

Fundamental Assumptions (II)

Parallelism requires communication Local or even global data-dependencies exist Off-chip communication becomes necessary Bridges a physical distance (many PEs)

Communication latency is limited It’s widely accepted that the speed of light limits data-transmission Example: minimal 0-byte latency for 1m ≈ 3.3ns ≈ 13 cycles on a 4GHz PE

Bandwidth can hide latency only partially Bandwidth is limited (physical constraints) The problem of “scaling out” (especially iterative solvers) Computer Architecture Past & Future Why Non blocking Collectives? LibNBC And Applications? Ongoing Efforts

Assumptions about Parallel Program Optimization

Collective Operations Collective Operations (COs) are an optimization tool CO performance influences application performance optimized implementation and analysis of CO is non-trivial

Hardware Parallelism More PEs handle more tasks in parallel Transistors/PEs take over communication processing Communication and computation could run simultaneously

Overlap of Communication and Computation Overlap can hide latency Improves application performance Computer Architecture Past & Future Why Non blocking Collectives? LibNBC And Applications? Ongoing Efforts

The LogGP Model

level Sender Receiver o o s r CPU

L Network

g + m*G g + m*G

time Computer Architecture Past & Future Why Non blocking Collectives? LibNBC And Applications? Ongoing Efforts

Interconnect Trends

Technology Change modern interconnects offload communication to co-processors (Quadrics, InfiniBand, Myrinet) TCP/IP is optimized for lower host-overhead (e.g., Gamma) even legacy Ethernet supports protocol offload L + g + m · G >> o

⇒ we prove our expectations with benchmarks of the user CPU overhead Computer Architecture Past & Future Why Non blocking Collectives? LibNBC And Applications? Ongoing Efforts

LogGP Model Examples - TCP

MPICH2 - G*s+g 600 MPICH2 - o TCP - G*s+g 500 TCP o

400

300

200 Time in microseconds 100

0 0 10000 20000 30000 40000 50000 60000 Datasize in bytes (s) Computer Architecture Past & Future Why Non blocking Collectives? LibNBC And Applications? Ongoing Efforts

LogGP Model Examples - Myrinet/GM

350 Open MPI - G*s+g Open MPI - o 300 Myrinet/GM - G*s+g Myrinet/GM - o 250

200

150

100 Time in microseconds 50

0 0 10000 20000 30000 40000 50000 60000 Datasize in bytes (s) Computer Architecture Past & Future Why Non blocking Collectives? LibNBC And Applications? Ongoing Efforts

LogGP Model Examples - InfiniBand/OpenIB

90 Open MPI - G*s+g 80 Open MPI - o OpenIB - G*s+g 70 OpenIB - o 60 50 40 30

Time in microseconds 20 10 0 0 10000 20000 30000 40000 50000 60000 Datasize in bytes (s) Computer Architecture Past & Future Why Non blocking Collectives? LibNBC AndApplications? OngoingEfforts

Outline

1 Computer Architecture Past & Future

2 Why Non blocking Collectives?

3 LibNBC

4 And Applications?

5 Ongoing Efforts Computer Architecture Past & Future Why Non blocking Collectives? LibNBC AndApplications? OngoingEfforts

Isend/Irecv is there - Why Collectives?

Gorlach, ’04: “Send-Receive Considered Harmful” ⇔ Dijkstra, ’68: “Go To Statement Considered Harmful”

point to point if ( rank == 0) then call MPI_SEND(...) else call MPI_RECV(...) end if

vs. collective call MPI_GATHER(...)

cmp. math libraries vs. loops Computer Architecture Past & Future Why Non blocking Collectives? LibNBC AndApplications? OngoingEfforts

Sparse Collectives

But my algorithm only needs nearest neighbor communication!? ⇒ this is a collective too, just sparse (cf. sparse BLAS) sparse communication with neighbors on topologies graph topology makes it generic many optimization possibilities (process placing, overlap, message scheduling/forwarding) easy to implement not part of MPI but fully implemented in LibNBC and proposed to the MPI Forum Computer Architecture Past & Future Why Non blocking Collectives? LibNBC AndApplications? OngoingEfforts

Why non blocking Collectives

scale typically with O(log2P) sends wasted CPU time: log2P · (L + Gall ) Fast Ethernet: L = 50-60 Gigabit Ethernet: L = 15-20 InfiniBand: L = 2-7 1µs ≈ 6000 FLOP on a 3GHz Machine ... and many collectives synchronize unneccessarily Computer Architecture Past & Future Why Non blocking Collectives? LibNBC AndApplications? OngoingEfforts

Modelling the Benefits

LogGP Model for Allreduce: tallred = 2 · (2o + L + m · G) · ⌈log2P⌉ + m · γ · ⌈log2P 1000 CPU overhead (1kiB) Network time (1kiB) 15us non-blocking overhead 100

10

Time (us) 1

0.1

0.01 10 100 1000 10000 Computer Architecture Past & Future Why Non blocking Collectives? LibNBC AndApplications? OngoingEfforts

CPU Overhead Benchmarks

Allreduce, LAM/MPI 7.1.2/TCP over GigE

CPU Usage (share)

0.03 0.025 0.02 0.015 0.01 0.005 0 100000 10000 10 1000 20 100 30 Data Size 40 10 50 Communicator Size 60 1 Computer Architecture Past & Future Why Non blocking Collectives? LibNBC AndApplications? OngoingEfforts

Performance Benefits

overlap leverage hardware parallelism (e.g. InfiniBandTM) overlap similar to non-blocking point-to-point

pseudo synchronization avoidance of explicit pseudo synchronization limit the influence of OS noise Computer Architecture Past & Future Why Non blocking Collectives? LibNBC AndApplications? OngoingEfforts

Process Skew

caused by OS interference or unbalanced application worse if processors are overloaded multiplies on big systems can cause dramatic performance decrease all nodes wait for the last

Example Petrini et. al. (2003) ”The Case of the Missing Performance: Achieving Optimal Performance on the 8,192 Processors of ASCI Q” Computer Architecture Past & Future Why Non blocking Collectives? LibNBC AndApplications? OngoingEfforts

Process Skew

caused by OS interference or unbalanced application worse if processors are overloaded multiplies on big systems can cause dramatic performance decrease all nodes wait for the last

Example Petrini et. al. (2003) ”The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8,192 Processors of ASCI Q” Computer Architecture Past & Future Why Non blocking Collectives? LibNBC AndApplications? OngoingEfforts

MPI_Bcast with P0 delayed - Jumpshot

P0

P1

processes P2

P3

time Computer Architecture Past & Future Why Non blocking Collectives? LibNBC AndApplications? OngoingEfforts

MPI_Ibcast with P0 delayed + overlap - Jumpshot

P0

P1

processes P2

P3

time Computer Architecture Past & Future Why Non blocking Collectives? LibNBC And Applications? Ongoing Efforts

Outline

1 Computer Architecture Past & Future

2 Why Non blocking Collectives?

3 LibNBC

4 And Applications?

5 Ongoing Efforts Computer Architecture Past & Future Why Non blocking Collectives? LibNBC And Applications? Ongoing Efforts

Non-Blocking Collectives - Interface

extension to MPI-2 ”mixture” between non-blocking ptp and collectives uses MPI_Requests and MPI_Test/MPI_Wait

Interface MPI_Ibcast(buf, count, MPI_INT, 0, comm, &req); MPI_Wait(&req);

Proposal Hoefler et. al. (2006): ”Non-Blocking Collective Operations for MPI-2” Computer Architecture Past & Future Why Non blocking Collectives? LibNBC And Applications? Ongoing Efforts

Non-Blocking Collectives - Interface

extension to MPI-2 ”mixture” between non-blocking ptp and collectives uses MPI_Requests and MPI_Test/MPI_Wait

Interface MPI_Ibcast(buf, count, MPI_INT, 0, comm, &req); MPI_Wait(&req);

Proposal Hoefler et. al. (2006): ”Non-Blocking Collective Operations for MPI-2” Computer Architecture Past & Future Why Non blocking Collectives? LibNBC And Applications? Ongoing Efforts

Non-Blocking Collectives - Implementation

implementation available with LibNBC written in ANSI-C and uses only MPI-1 central element: collective schedule a coll-algorithm can be represented as a schedule trivial addition of new algorithms Example: dissemination barrier, 4 nodes, node 0:

send to 1 recv from 3 end send to 2 recv from 2 end

LibNBC download: http://www.unixer.de/NBC Computer Architecture Past & Future Why Non blocking Collectives? LibNBC And Applications? Ongoing Efforts

Overhead Benchmarks - Gather with InfiniBand/MVAPICH on 64 nodes

30000 MPI_Gather NBC_Igather 25000

20000

15000

Runtime (s) 10000

5000

0 0 50000 100000 150000 200000 250000 300000 Datasize (bytes) Computer Architecture Past & Future Why Non blocking Collectives? LibNBC And Applications? Ongoing Efforts

Overhead Benchmarks - Scatter with InfiniBand/MVAPICH on 64 nodes

30000 MPI_Scatter NBC_Iscatter 25000

20000

15000

Runtime (s) 10000

5000

0 0 50000 100000 150000 200000 250000 300000 Datasize (bytes) Computer Architecture Past & Future Why Non blocking Collectives? LibNBC And Applications? Ongoing Efforts

Overhead Benchmarks - Alltoall with InfiniBand/MVAPICH on 64 nodes

50000 MPI_Alltoall 45000 NBC_Ialltoall 40000 35000 30000 25000 20000 Runtime (s) 15000 10000 5000 0 0 50000 100000 150000 200000 250000 300000 Datasize (bytes) Computer Architecture Past & Future Why Non blocking Collectives? LibNBC And Applications? Ongoing Efforts

Overhead Benchmarks - Allreduce with InfiniBand/MVAPICH on 64 nodes

60000 MPI_Allreduce NBC_Iallreduce 50000

40000

30000

Runtime (s) 20000

10000

0 0 50000 100000 150000 200000 250000 300000 Datasize (bytes) Computer Architecture Past & Future Why Non blocking Collectives? LibNBC And Applications? Ongoing Efforts

Outline

1 Computer Architecture Past & Future

2 Why Non blocking Collectives?

3 LibNBC

4 And Applications?

5 Ongoing Efforts Computer Architecture Past & Future Why Non blocking Collectives? LibNBC And Applications? Ongoing Efforts

Linear Solvers - Domain Decomposition

First Example Naturally Independent Computation - 3D Poisson Solver

iterative linear solvers are used in many scientific kernels often used operation is vector-matrix-multiply matrix is domain-decomposed (e.g., 3D) only outer (border) elements need to be communicated can be overlapped Computer Architecture Past & Future Why Non blocking Collectives? LibNBC And Applications? Ongoing Efforts

Domain Decomposition

nearest neighbor communication can be implemented with MPI_Alltoallv or sparse collectives       P0 P1 P2 P3              P4 P5 P6 P7               P8 P9 P10 P11    

 Process−local data 2D Domain   Halo−data Computer Architecture Past & Future Why Non blocking Collectives? LibNBC And Applications? Ongoing Efforts

Parallel (Best Case)

Eth blocking IB blocking 100 Eth non-blocking 100 IB non-blocking

80 80

60 60

Speedup 40 Speedup 40

20 20

0 0 8 16 24 32 40 48 56 64 72 80 88 96 8 16 24 32 40 48 56 64 72 80 88 96 Number of CPUs Number of CPUs Cluster: 128 2 GHz Opteron 246 nodes Interconnect: Gigabit Ethernet, InfiniBandTM System size 800x800x800 (1 node ≈ 5300s) Computer Architecture Past & Future Why Non blocking Collectives? LibNBC And Applications? Ongoing Efforts

Parallel Data Compression

Second Example Data Parallel Loops - Parallel Compression

Automatic transformations (C++ templates) typical loop structure:

for (i=0; i < N/P; i++) { compute(i); } comm(N/P); Computer Architecture Past & Future Why Non blocking Collectives? LibNBC And Applications? Ongoing Efforts

Parallel Compression Communication Overhead

0.5 MPI/BL MPI/NBC OF/NBC 0.4

0.3

0.2

0.1 Communication Overhead (s)

0 64 32 16 8 Computer Architecture Past & Future Why Non blocking Collectives? LibNBC And Applications? Ongoing Efforts

Parallel 3d Fast Fourier Transform

Third Example Specialized Algorithms - A parallel 3d-FFT with overlap

Specialized design to achieve the highest overlap. Less cache-friendly! Computer Architecture Past & Future Why Non blocking Collectives? LibNBC And Applications? Ongoing Efforts

Non-blocking Collectives - 3D-FFT

Derivation from “normal” implementation distribution identical to “normal” 3D-FFT first FFT in z direction and index-swap identical

Design Goals to Minimize Communication Overhead start communication as early as possible achieve maximum overlap time

Solution start MPI_Ialltoall as soon as first xz-plane is ready calculate next xz-plane start next communication accordingly ... collect multiple xz-planes (tile factor) Computer Architecture Past & Future Why Non blocking Collectives? LibNBC And Applications? Ongoing Efforts

Non-blocking Collectives - 3D-FFT

Derivation from “normal” implementation distribution identical to “normal” 3D-FFT first FFT in z direction and index-swap identical

Design Goals to Minimize Communication Overhead start communication as early as possible achieve maximum overlap time

Solution start MPI_Ialltoall as soon as first xz-plane is ready calculate next xz-plane start next communication accordingly ... collect multiple xz-planes (tile factor) Computer Architecture Past & Future Why Non blocking Collectives? LibNBC And Applications? Ongoing Efforts

Non-blocking Collectives - 3D-FFT

Derivation from “normal” implementation distribution identical to “normal” 3D-FFT first FFT in z direction and index-swap identical

Design Goals to Minimize Communication Overhead start communication as early as possible achieve maximum overlap time

Solution start MPI_Ialltoall as soon as first xz-plane is ready calculate next xz-plane start next communication accordingly ... collect multiple xz-planes (tile factor) Computer Architecture Past & Future Why Non blocking Collectives? LibNBC And Applications? Ongoing Efforts

Transformation in z Direction

Data already transformed in y direction

z

y x 1 block = 1 double value (3x3x3 grid) Computer Architecture Past & Future Why Non blocking Collectives? LibNBC And Applications? Ongoing Efforts

Transformation in z Direction

Transform first xz plane in z direction

           z        y x pattern means that data was transformed in y and z direction Computer Architecture Past & Future Why Non blocking Collectives? LibNBC And Applications? Ongoing Efforts

Transformation z Direction

start MPI_Ialltoall of first xz plane and transform second plane

                             z              y x cyan color means that data is communicated in the background Computer Architecture Past & Future Why Non blocking Collectives? LibNBC And Applications? Ongoing Efforts

Transformation in z Direction

start MPI_Ialltoall of second xz plane and transform third plane                                              z                y x data of two planes is not accessible due to communication Computer Architecture Past & Future Why Non blocking Collectives? LibNBC And Applications? Ongoing Efforts

Transformation in x Direction

start communication of the third plane and ...                                              z                y x we need the first xz plane to go on ... Computer Architecture Past & Future Why Non blocking Collectives? LibNBC And Applications? Ongoing Efforts

Transformation in x Direction

... so MPI_Wait for the first MPI_Ialltoall!                                              z                y x and transform first plane (new pattern means xyz transformed) Computer Architecture Past & Future Why Non blocking Collectives? LibNBC And Applications? Ongoing Efforts

Transformation in x Direction

Wait and transform second xz plane                                              z                y x first plane’s data could be accessed for next operation Computer Architecture Past & Future Why Non blocking Collectives? LibNBC And Applications? Ongoing Efforts

Transformation in x Direction

wait and transform last xz plane                                              z                y x done! → 1 complete 1D-FFT overlaps a communication Computer Architecture Past & Future Why Non blocking Collectives? LibNBC And Applications? Ongoing Efforts

10243 3d-FFT over InfiniBand 7 MPI/BL MPI/NB 6 NBC/NB

5

4

3 FFT Time (s) 2

1

0 1 ppn 2 ppn P=128, “Coyote”@LANL - 128/64 dual socket 2.6GHz Opteron nodes Computer Architecture Past & Future Why Non blocking Collectives? LibNBC And Applications? Ongoing Efforts

10243 3d-FFT on the XT4 20 MPI/BL 18 NBC/NB 16 14 12 10 8 FFT Time (s) 6 4 2 0 32 procs 64 procs 128 procs “Jaguar”@ORNL - Cray XT4, dual socket dual core 2.6GHz Opteron Computer Architecture Past & Future Why Non blocking Collectives? LibNBC And Applications? Ongoing Efforts

10243 3d-FFT on the XT4 (Communication Overhead) 4.5 MPI/BL 4 NBC/NB

3.5

3

2.5

2

1.5

1 Communication Overhead (s) 0.5

0 32 procs 64 procs 128 procs “Jaguar”@ORNL - Cray XT4, dual socket dual core 2.6GHz Opteron Computer Architecture Past & Future Why Non blocking Collectives? LibNBC And Applications? Ongoing Efforts

6403 3d-FFT InfiniBand (Communication Overhead)

0.7 MPI/BL MPI/NBC OF/NBC 0.6

0.5

0.4

0.3

0.2

FFT Communication Time (s) 0.1

0 64 32 16 8 4 2 “Odin”@IU - dual socket dual core 2.0GHz Opteron InfiniBand Computer Architecture Past & Future Why Non blocking Collectives? LibNBC AndApplications? Ongoing Efforts

Outline

1 Computer Architecture Past & Future

2 Why Non blocking Collectives?

3 LibNBC

4 And Applications?

5 Ongoing Efforts Computer Architecture Past & Future Why Non blocking Collectives? LibNBC AndApplications? Ongoing Efforts

Ongoing Work

LibNBC analysis of multi-threaded implementation optimized collectives

Collective Communication optimized collectives for InfiniBandTM (topology-aware) using special hardware support

Applications work on more applications ⇒ interested in collaborations (ask me!) Computer Architecture Past & Future Why Non blocking Collectives? LibNBC AndApplications? Ongoing Efforts

Discussion THE END Questions?

Thank you for your attention!