FUTURE TECHNOLOGIES GROUP

One-sided vs. Two-sided Communication Paradigms on Relaxed Ordering Interconnects

Khaled Ibrahim

Paul Hargrove, Costin Iancu, Kathy Yelick

LAWRENCE BERKELEY NATIONAL LABORATORY Overview

FUTURE TECHNOLOGIES GROUP

 Application performance of one-sided vs. two-sided on Cray XE06 (Gemini Interconnect).  Brief Overview of Cray stack and hardware on Hopper. . Support of Relaxation  Performance comparison of communication primitives using strict and relaxed ordering.  Single-sided vs. two-sided communication paradigms interaction with relaxed ordering.

LAWRENCE BERKELEY NATIONAL LABORATORY Motivation

FUTURE TECHNOLOGIES GROUP

250% 64 procs p u 256 procs d

e 200% e p s

I

P 40% M r e 30% o

C 20% P U e

g 10% a t n e 0% c r e ep ft is lu mg sp bt Harmonic mean P -10%

Also, why Gasnet is not matching vendor performance? Why single-sided performs better than two-sided MPI?

LAWRENCE BERKELEY NATIONAL LABORATORY Cray Software Stack

FUTURE TECHNOLOGIES GROUP

 uGNI Multiple Fortran, C UPC communication protocols MPICH2 SHMEM libpgas . RDMA (Block Transfer Engine (BTE)). • Optimized for large messages GNI DMAPP . FMA • Optimized for small I messages O D D C I I T R R

L Kernel Level E E

 DMAPP communication GNI (kGNI) C C protocol T T . High-level protocol for GNI Core single-sided Generic Hardware Abstraction Layer (GHAL) communication . support collectives Hardware (Gemini/Arie)

LAWRENCE BERKELEY NATIONAL LABORATORY Hopper Hardware

FUTURE TECHNOLOGIES GROUP

Hopper Node Hopper Node (Cray XE 6) Core 0 Core 0 2 twelve-core AMD 'MagnyCours' DDR3 Core 1 L3 L3 Core 1 DDR3 2.1-GHz processors per node. Cache Cache Core 2 6MB 6MB Core 2 Four DDR3 1333-MHz memory channels per twelve-core Core 3 Core 3 'MagnyCours' processor DDR3 Core 4 HT3 Core 4 DDR3

Core 5 H Core 5

201.6 Gflops/node T H 3

3 T 3 T H L1: 64 KB, L2 caches::512KB Core 0 Core 0 DDR3 Core 1 Core 1 DDR3 6-MB L3 cache shared between 6 HT3 cores. Core 2 Core 2 Core 3 L3 L3 Core 3 Cache Cache DDR3 Core 4 6MB 6MB Core 4 DDR3 Core 5 Core 5 3 3 T T H H

HT3 HT3 NIC 0 NIC 1 Block 3D torus 48 port YARC Router ...

LAWRENCE BERKELEY NATIONAL LABORATORY Relaxation of Memory Transfers

FUTURE TECHNOLOGIES GROUP

 Hypertransport transactions terms . Posted (writes without ack) . Non-Posted (reads or writes with Ack)  Relaxation on Gemini . Strict (no reodering) . Default (non-posted get pass posted writes) . Relaxed (all non-posted pass posted writes)  Relaxation affects both in-node memory transactions and remote memory transactions  Interconnect Relaxation: How to? . Dynamic Routing . Multiple virtual channels  Why? . Performance: easier way to improve throughput (BW), than to improve latency (wire delay) . Resilience and fault tolerance.

LAWRENCE BERKELEY NATIONAL LABORATORY Microbenchmark

FUTURE TECHNOLOGIES GROUP

 Objective: . Compare the performance of low- level APIs vs. high level runtimes. Node 0&1 Node 2&3  Dialects . uGNI (FMA & BTE) UPCT1 UPCT1+m . DMAPP UPCT2 UPCT2+m . Berkeley UPC . . . Cray UPC . . . OSU MBW (MPI) . .  Ranks: m  Window: w number of outstanding UPCT UPCT non-blocking operations. m 2m  Testbed: default . bidirectional . 48 communication pairs . Multiple outstanding messages (Window size) • Default 1

LAWRENCE BERKELEY NATIONAL LABORATORY DMAPP vs. uGNI (FMA and BTE) - strict

FUTURE TECHNOLOGIES GROUP

DMAPP strict FMA strict BTE strict 16000

10000

12000 ) ) s / s / B B M M (

( 1000

h h t t

8000 d i d i w w d d n n a a B B 100 4000

8 16 32 64 128 256 512 1K 2K 4K 8K 16K K K K K K K K K K 2 4 8 6 2 4 8 6 2 M M M 1 3 6 2 5 1 1 2 4 Msg Size 1 2 5

LAWRENCE BERKELEY NATIONAL LABORATORY DMAPP vs. uGNI (FMA and BTE)- Relaxed

FUTURE TECHNOLOGIES GROUP

DMAPP relaxed FMA relaxed BTE relaxed 16000

10000

12000 ) ) s / s / B B M M (

( 1000

h h t t

8000 d i d i w w d d n n a a B B 100 4000

8 16 32 64 128 256 512 1K 2K 4K 8K 16K K K K K K K K K K 2 4 8 6 2 4 8 6 2 M M M 1 3 6 2 5 1 1 2 4 Msg Size 1 2 5

DMAPP can hide complexity of uGNI (FMA and BTE) LAWRENCE BERKELEY NATIONAL LABORATORY Strict Ordering and Concurrency

FUTURE TECHNOLOGIES GROUP

Concurrency (proc/node): 1 2 4 8 16 24 10000

12000 ) s / ) s B / 1000 M B (

M h (

t

h 8000 d i t d w i d w n d a n 100 B a B 4000

FMA transport BTE transport 10 8 16 32 64 128 256 512 1K 2K 4K 8K 16K K K K K K K K K K 2 4 8 6 2 4 8 6 2 M M M 1 3 6 2 5 1 1 2 4 Msg Size 1 2 5

LAWRENCE BERKELEY NATIONAL LABORATORY Relaxed Ordering and Concurrency

FUTURE TECHNOLOGIES GROUP

Concurrency (proc/node): 1 2 4 8 16 24 10000

12000 ) s / ) B s / 1000 M ( B

M h ( t

d h

8000 i t w d i d w n a d n 100 B a B 4000

FMA transport BTE transport 10 8 16 32 64 128 256 512 1K 2K 4K 8K 16K K K K K K K K K K 2 4 8 6 2 4 8 6 2 M M M 1 3 6 2 5 1 1 2 4 Msg Size 1 2 5

LAWRENCE BERKELEY NATIONAL LABORATORY Low-Level API Summary

FUTURE TECHNOLOGIES GROUP

 DMAPP hides switching complexity between FMA and BTE Fortran, C UPC . It also provides collectives MPICH2 SHMEM libpgas  Relaxed ordering improves performance (expected).

 Strict ordering hurts performance GNI DMAPP MOSTLY under high concurrency.

 I Registration is an expensive O D D C I I T R R

L Kernel Level

memory operations that better be E E

GNI (kGNI) C C removed from critical path of T T execution. GNI Core  Performance Difference between Generic Hardware Abstraction Layer (GHAL) One-sided and Two sided is NOT Hardware (Gemini/Arie) due to different Performance of low-level APIs.

LAWRENCE BERKELEY NATIONAL LABORATORY Communication Paradigms

FUTURE TECHNOLOGIES GROUP

One-sided UPC Two-sided MPI

P0 P1 P0 P1 Local Local Local Local shared shared promoted promoted

put/get operations send/recv operations LAWRENCE BERKELEY NATIONAL LABORATORY Communication Paradigm Territory

FUTURE TECHNOLOGIES GROUP Traditional  - Coherent Caches . One-sided load/store . Default to relaxed ordering • Strict for synchronization

 Message Passing - no coherence . Send/recv matching . Strict matching

LAWRENCE BERKELEY NATIONAL LABORATORY Communication Paradigm Territory

FUTURE TECHNOLOGIES GROUP Traditional PGAS  Shared Memory - Coherent  Shared Memory - Coherent Caches Caches . One-sided load/store . One-sided load/store . Default to relaxed ordering . Default to relaxed ordering • Strict for synchronization • Strict for synchronization

 Message Passing - no  Partition Global Address space coherence between nodes . One-sided put/get . Load/store matching . strict ordering if blocking . Strict matching • Relaxed if non-blocking.

LAWRENCE BERKELEY NATIONAL LABORATORY Communication Paradigm Territory

FUTURE TECHNOLOGIES GROUP Traditional MPI+MPI  Shared Memory - Coherent  Shared Memory - Coherent Caches Caches . One-sided load/store . Send/recv matching . Default to relaxed ordering . Shared memory bypass • Strict for synchronization

 Message Passing - no  Message passing coherence between nodes . Send/recv Matching . Send/recv matching . Matching strict, progress is not . Strict matching

LAWRENCE BERKELEY NATIONAL LABORATORY Relaxed Ordering and Communication Paradigm FUTURE TECHNOLOGIES GROUP

 Requirement to service multiple outstanding requests . Minimal remote-end involvement . Unambiguous destination  HPC runtime expectation from APIs . point-to-point ordering • Node-to-node ordering • Rank-to-rank ordering (multi-core makes it different from node-to-node) . Or, relaxed ordering, with multiple completion semantic (local and global)

LAWRENCE BERKELEY NATIONAL LABORATORY Unambiguous Destination (Registration) FUTURE TECHNOLOGIES GROUP Registration (Resolving remote ambiguity): Prevents memory swapping Allows offloading communication to NIC a) Registration b) Unregistration Allows out-of-order progress of many transfers1000 1000 Expensive (hopefully not frequent) 1 proc (4K page) 4 proc (4K page) Limited and shared resources 24 proc (4K page) 100 100 24 proc (2M page) P0 P1 P0 P1 c Local Local e

s

u 10 10 shared shared

promoted 1 1

2K 4K 8K 6K 2K 4K 8K 6K 2K 1M 2M 4M 8M 2K 4K 8K 6K 2K 4K 8K 6K 2K 1M 2M 4M 8M 1 3 6 12 25 51 1 3 6 12 25 51 Memory segment size promoted

put/get UPC send/recv MPI LAWRENCE BERKELEY NATIONAL LABORATORY UPC Shared Memory Ordering Challenge- Example FUTURE TECHNOLOGIES GROUP

 Cray definition of “Global Completion” . GASnet Heisenbug!

P0: Globally address Memory 01: Get (m(A0,0) m(A1,0)) Memory Address: Anode,offset 02: m(A0,0)m(A0,0)+1 All initialized to zeros 03: Put(m(A0,0)m(A1,0)) 04: m(A0,1) P0 Gemini Gemini P2 HT3 HT3 NIC 3D torus NIC P1 P3 P1: 01 11: Wait until m(A0,1) A0,0 A1,0 12: Get (m(A0,2) m(A1,0)) 13: Print m(A0,2) A0,1 03 A1,1 A0,2 12 A1,2 ......

LAWRENCE BERKELEY NATIONAL LABORATORY MPI Ordering

FUTURE TECHNOLOGIES GROUP

 Non-overtake rule (determinism) . If a sender sends two messages (Message 1 and Message 2) in succession to the same destination, and both match the same receive, the receive operation will receive Message 1 before Message 2. . If a receiver posts two receives (Receive 1 and Receive 2), in succession, and both are looking for the same message, Receive 1 will receive the message before Receive 2.

order Send: S source rankdest rank 2 1 0 order 0 1 2 S 0 S 0 S 0 Receive: R source rank R ANY S 2 R ANY

P0 Gemini Gemini P2 HT3 HT3 NIC 3D torus NIC P1 P3

0 1 R ANY R ANY

Gemini P4 HT3 NIC P6

LAWRENCE BERKELEY NATIONAL LABORATORY Cray MPI Method to Exploit Gemini Relaxed Ordering FUTURE TECHNOLOGIES GROUP

Typical eager vs. rendezvous Cray’s Two eager modes Two rendezvous mode Switching is controlled based on Message size Registration cache is used (may not be able to tell which memory will be used for communication)

E0 E1 R0 R1

0 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1MB 4MB

MPICH_GNI_MAX_VSHORT_MSG_SIZE

MPICH_GNI_MAX_EAGER_MSG_SIZE

MPICH_GNI_NDREG_MAXSIZE

LAWRENCE BERKELEY NATIONAL LABORATORY MPI Eager 0

FUTURE TECHNOLOGIES GROUP

Sender Receiver (PE0) SMSG (PE1) Mailboxes PEn 1. G NI S MSG (he sen PE4 ader+ d data) PE0

2 3.copy Completion Queue

Max size change with the number of ranks.

LAWRENCE BERKELEY NATIONAL LABORATORY MPI Eager 1

FUTURE TECHNOLOGIES GROUP

Sender Receiver (PE0) (PE1) SMSG PEn 2. G Mailboxes NI SM 1 SG PE4 . ( sen hea d c der o ) PE0 p e y c a

p CQ s

n 3 o i t a c i l p p y A Pre-allocated 4. RDMA FMA p o c GET (registered) . Runtime buffers 6 ) 5. GNI SMSG (Recv Done

LAWRENCE BERKELEY NATIONAL LABORATORY MPI Rendezvous 0

FUTURE TECHNOLOGIES GROUP

Sender Receiver (PE0) (PE1) SMSG PEn 2. G Mailboxes NI SM 4. Register SG s PE4 (he end receive data ader) PE0 (or lookup in cache) CQ 5. RDMA BTE GET 3

1. Register send data (or lookup in cache) ) 6. GNI SMSG (Recv Done

LAWRENCE BERKELEY NATIONAL LABORATORY MPI Rendezvous 1

FUTURE TECHNOLOGIES GROUP

Sender Receiver (PE0) (PE1) SMSG PEn 1. G Mailboxes NI SM SG s PE4 3. Register (hea end der) PE0 receive data (in chunks) 4. G NI S CQ MSG (C sen 2 TS) d 5. p ipelin ed G NI R DMA 4. Register BTE 6. PU send data GNI Ts SMS (in chunks) G Se nd (d one)

LAWRENCE BERKELEY NATIONAL LABORATORY Two-sided Summary of Challenges

FUTURE TECHNOLOGIES GROUP

 Remote involvement . Rank-to-rank strict ordering can cause • Node-to-node strict ordering by hardware • How to handle message cancellation?! . Message size ambiguity • int MPI_Irecv(buf, count,datatype, source, tag, …) – count reflects buffer size NOT msg size. – Receiver cannot tell what protocol will be used before matching • Probing obstruct relaxed ordering progress.  Remote readiness for transfer . All memory space default to local . Registration is on-demand • Variability of performance based on hit/miss in registration cache.

LAWRENCE BERKELEY NATIONAL LABORATORY MPI vs. UPC 8B Transactions

MPIFUTURE default TECHNOLOGIESMPI buffers (n oGROUP BTE) 8B MPI default 8MBPI bUuPffCe rssh (anroe dB TdEe)st UPC non-shared dest 16 16 8B 8B 8B 8B 16 16 16 16 100% 100% 87.5% 100% 87.5% 87.5% 75.0% 75.0% 75.0% e 75.0% e z

e 62.5% i

z 87.5%

z 62.5% i 25.0% 62.5% i s 25.0% s

25.0% s 25.0% 4 4 44 50.0% w 50.0% w 4 4 50.0% w o o o 12.51%2.5% d d d 37.5%

n 37.5% 37.5% n n i 50.0% 25.0% i i w

w 25.0% w 25.0% 50.0% 25.0% 25.0% 12.5% 12.5% 12.5% 0.00% 0.00% 0.00% 1 1 1 1 1 1 2 4 8 16 24 1 1 2 2 4 4 8 816 1624 24 1 2 4 8 16 24 1 8K2 B Concurrency4 8 16 24 1 2 4 8 16 24 16 16 8KB8KB 8KB 8KB 16 16 16 16 8KB 100% 100%  Experiment: 90.6% 1008%6.4% 93.2% . Vary number of messages per rank. 81.3% 86.4% e 81.3% 43.8% 90.6% e z 71.9% i

z 79.5% i s

s 81.3% 4 . Vary the number of ranks4 per node. 62.5% 71.9% 34.4% w 53.1% 4 93.2% 4 79.5% 72.7% w o e o d 81.3% 43.8% 53.1% z 71.9% d n i 65.9% i n s i

w 43.8%

4 4 w 62.5% 71.9% 34.4% 59.1% w 53.1% 79.5% o LAWRENCE BERKELEY NATIONAL LABORATORY 34.4% d 53.1% 52.3% n i 25.0% 65.9% 45.5% w 43.8% 1 1 1 2 4 8 16 24 1 1 2 4 8 16 24 1 1 2 4 8 16 24 1 34.4%2 4 8 16 24

256 KB 16 16 256 K25B6 KB 25.0% 16 16 256 KB 1 1 100% 1 2 4 8 16 24 1 2 4 8 16 24 100% 90.6% 96.8% 256 KB 81.3% 16 16 256 KB 93.7% e

z 71.9% e i z s 90.5% i

4 81.3% 4 43s .8% 71.9% 62.5% 100% w

o 4 4 90.5% 87.3% w

d 96.8% o 53.1% 90.6% n d i 84.2% n w i 43.8% 81.3% w 81.0% 62.5%

e 34.4% z 71.9% 77.8% i

s 25.0% 4 81.3% 4 43.8% 71.9% 62.5% 74.7% w 1 1 o d 1 2 4 8 16 24 1 1 2 4 8 16 24 1 53.1% n 1 2 4 8 16 24 1 2 4 8 16 24 i

w 2 MB 2 MB 16 16 2 M43B.8% 16 2 MB 16 62.5% 100% 34.4% 100% 90.6% 25.0% 96.8% 81.3% 1 1

e 93.7%

z 71.9%

1 i 2 4 8 16 24 1 2 4 8 16 24 e s z 90.5% 4 4 i 34.4% 53.1% 71.9% 62.5% w s

2 Mo B 2 M4 B 4 90.5% 87.3% w 96.8% 16 d 16 53.1% o n

i 71.9% d 84.2% w n

i 43.8% 96.8% 100% w 81.0% 34.4% 90.6% 77.8% 25.0% 81.3% 74.7% 1 1 e 1 2 4 8 16 24 1 2 4 8 16 24 z 1 1 71.9% i s Concurrency 1 2Concurre4ncy 8 16 24 1 2 4 8 16 24 4 4 34.4% 53.1% 71.9% 62.5% w Concurrency Concurrency o d 53.1% n

i 71.9% w 43.8% 34.4% 25.0% 1 1 1 2 4 8 16 24 1 2 4 8 16 24 Concurrency Concurrency MPMIP dIe dfaeufaltult MPMUI PbPuIC fb fseuhrffsae r(rensdo ( ndBoeT sBEtT) E) UPC non-shared dest 8B 8B 816B 81B6 8B 8B 16 16 16 16 100% 100% 100% 87.5% 87.5% 87.5% 75.0% 75.0% 75.0%

e 75.0% e z e 62.5% i 87.5% z z 25.0% 62.5% 62.5% i s i

25.0% 25.0% s s

4 25.0% 4 50.0% 4 w 4 4 4 50.0%

w 50.0% w o 12.5% o o d 12.5% 37.5% d d n

i 37.5% 37.5% n n 50.0% 25.0% i i w 25.0% w w 25.0% 50.0% 25.0% MPI vs. UPC 8KB Transactions 12.5% 25.0% 12.5% 12.5% 0.00% 0.00% 0.00% FUTURE TECHNOLOGIES GROUP 1 1 1 1 1 2 4 8 16 241 1 2 4 8 16 241 1 2 MPI4 Default8 16 24 1 1 2 2 4 UPC4 8 shared8 16 dest16 24 24 1 2 4 8 16 24 816K8BKB 16 8KB 16 16 81K6 8BKB 16 8KB 100% 100% 100% 90.6%90.6% 86.4% 93.2% 81.3% 81.3% 86.4% e e 81.3% 43.8% z 81.3% e 43.8% 71.9% z

i 71.9% z i 79.5% i s s

s

4 4 62.5% 71.9% 34.4% 4w 4 62.5% 71.9% 34.4% w 53.15%3.1% 4 93.2% 4 79.5% 72.7% w o o o d d 53.1%53.1% d n

n 65.9% i i n i w w 43.8%43.8% w 79.5% 59.1% 34.4% 34.4% 52.3% 25.0% 25.0% 65.9% 45.5% 1 1 1 1 1 2 4 8 16 241 1 2 4 8 16 241 1 2 4 8 16 24 1 1 2 2 4 4 8 8 16 16 24 24 1 2 4 8 16 24

25265 K6B KB 16 16 16 21562625 5K66B KKBB Application developer perspective:16 16 256 KB 100% 100% 100% What level of concurrency should I use? 90.6%90.6% 96.8% What is the implication for intra-node communications? 81.3%81.3% 93.7% e e z z e 71.9%71.9% i i

z 90.5% s i s

4 4 s 4 81.3% 443.8% 71.9% 62.5% w 81.3% 43.8% 71.9% 62.5% w 4 4 90.5% 87.3% o w o

d 96.8% o d 53.1%53.1% n d

n AWRENCE ERKELEY ATIONAL ABORATORY

i L B N L

i 84.2% n i w w 43.8%43.8% w 81.0% 62.5%62.5% 34.4%34.4% 77.8% 25.0%25.0% 74.7% 1 1 1 1 1 1 2 2 4 4 8 8 16 16 24 1241 1 2 2 4 4 8 8 16 16 24 241 1 2 4 8 16 24 1 2 4 8 16 24 2 MB 2 M2B MB 16 16 2 MB 16 16 2 MB 16 2 MB 16 100% 100% 100% 90.6%90.6% 96.8% 81.3%81.3% 93.7% e e z 71.9% i z e 71.9% i s z 90.5%

i s

4 4

s 34.4% 53.1% 71.9% 62.5% w

4 4 34.4% 53.1% 71.9% 62.5% w 90.5% o 4 4 87.3% w

o 96.8% d

o 53.1% d n 53.1% i 71.9% d n

i 71.9% 84.2% n w

i 43.8% w 96.8% 43.8% w 81.0% 34.4% 34.4% 77.8% 25.0% 25.0% 74.7% 1 1 1 1 2 4 8 16 24 1 1 2 4 8 16 24 1 2 4 8 16 241 1 2 4 8 16 241 Concurrency 1 2 Concu4rrency 8 16 24 1 2 4 8 16 24 Concurrency Concurrency Concurrency Concurrency MPI default MPIU bPuCff esrhsa (rneod BdTeEst) UPC non-shared dest 8B 8B 8B 81B6 16 16 16 100% 100% 87.5% 87.5% 75.0% 75.0% 75.0% e 87.5% z 62.5% e i z

s 62.5% i 25.0% s 4 4 50.0% 4 25.0% 4 w

o 50.0% w d o 12.5% 37.5% n d 50.0% 25.0% i 37.5% n i w 50.0% 25.0% w 25.0% 25.0% 12.5% 12.5% 0.00% 0.00% 1 1 1 1 1 2 4 8 16 24 1 2 4 8 16 24 1 2 4 8 16 24 1 2 4 8 16 24

16 8KB 16 8KB 16 8KB 16 8KB 100% 100% 90.6% 86.4% 93.2% 81.3% 86.4% e e 81.3% z 43.8% 79.5% i z 71.9% i s s 4 93.2% 4 79.5% 72.7% 4 4 w 62.5% 71.9% 34.4% w 53.1% o o d 65.9% d

n 53.1% i n i

w 59.1% w 79.5% 43.8% 34.4% 52.3% 65.9% MPI default MPI buffers (no BTE) 25.0% 45.5% 1 1 1 1 1 8B 2 4 8 16 24 1 18B2 2 4 4 8 8 16 16 24 24 1 2 4 8 16 24 126 56 KB 16 256 KB 256 KB 16 16 21566 KB 16 100% 100% 100% 87.5% 90.6% 96.8% 75.0% 93.7%

e 81.3%

z 62.5% i e e 25.0% s z z 90.5%

71.9% i i 25.0%

4 s 4 s 50.0% w

o 4 4 4 4 90.5% 87.3%

81.3% w 43.8% 71.9% 62.5% w 12.5%

d 96.8% o o 37.5% n d d i 53.1% 84.2% n n i w i 25.0% w w 43.8% 81.0% 12.5% MPI vs. UPC 2MB Transactions62.5% 34.4% 77.8% 0.00% 25.0% 74.7% 1 FUTURE TECHNOLOGIES1 GROUP 1 1 2 4 8 16 1 24 1 1 2 4 8 16 214 1 2 4 8 16 24 1 1 2 2 4 4 8 8 16 16 24 24 1 2 4 8 16 24 2 8MKBB MPI Default 2 M8BKB UPC shared dest 2 MB 1616 16 16 2 MB 16

100% 100% 100% 90.6% 90.6% 96.8% 81.3% 81.3% 93.7% e e

81.3% e 43.8% z z 71.9% 71.9% z i i 90.5% i s s s

4 4 90.5% 4 4 34.4% 53.1% 71.9% 62.5% 62.5% w 71.9% 4 34.4% 4 w 87.3% 53.1% w 96.8% o o o d d d 53.1% 53.1% n 84.2% n n i 71.9% i i w 96.8% w w 43.8% 43.8% 81.0% 34.4% 34.4% 77.8% 25.0% 25.0% 74.7% 1 1 1 1 11 2 4 8 16 24 1 1 2 4 8 16 24 1 2 4 8 16 24 1 22 44 88 1616 24241 2 4 8 16 24 Concurrency Concurrency 256 KB Concur rency Concurrency 16 16 256 KB

100% 90.6% 81.3% e

z 71.9% i s 4 81.3% 4 43.8% 71.9% 62.5% w o

d 53.1% n i AWRENCE ERKELEY ATIONAL ABORATORY w L B N L 43.8% 62.5% 34.4% 25.0% 1 1 1 2 4 8 16 24 1 2 4 8 16 24 2 MB 16 2 MB 16

100% 90.6% 81.3% e

z 71.9% i s 4 4 34.4% 53.1% 71.9% 62.5% w o

d 53.1% n

i 71.9%

w 43.8% 34.4% 25.0% 1 1 1 2 4 8 16 24 1 2 4 8 16 24 Concurrency Concurrency Two-sided MPI vs. One-sided UPC

FUTURE TECHNOLOGIES GROUP

UPC shared UPC non-shared MPI 100% window size = 1 Difference is NOT due: 90% h UPC vs. MPI t 80% d

Runtime optimization i

w 70% Registration overhead d n a

b 60% or, Copying k

a 50% e p

f

o 40% Difference is due to: e

g 30% a language semantic t n

e 20% and ability to exploit relaxed c r e 10% ordering. P

0% oc cs cs oc cs cs r ro ro r ro ro 1 p p p 1 p p p 4 24 4 24 8 B 2MB

LAWRENCE BERKELEY NATIONAL LABORATORY Conclusion

FUTURE TECHNOLOGIES GROUP

 One-sided communication exploits relaxed ordering with ease . Communicate-able memory is easier to identify • annotated shared - can be registered upfront. . Relaxed vs. strict ordering is explicitly specified • by programmer (or programming language). (Default to relaxed) . Large percentage of the peak performance for different communication patterns  Two-sided faces the following challenges . Strict matching between send and receives (large startup overhead) . Locality notion (everything default to local) • Expensive to prepare buffer for communication (registration). . Receiver ambiguity about transactions (probing or over allocation) . Performance may need high node concurrency (problematic to in-node communication).

LAWRENCE BERKELEY NATIONAL LABORATORY