Recent Activities in Programming Models and Runtime Systems at ANL

Rajeev Thakur Deputy Director Mathemacs and Computer Science Division Argonne Naonal Laboratory

Joint work with Pavan Balaji, Darius Bunnas, Jim Dinan, Dave Goodell, Bill Gropp, Rusty Lusk, Marc Snir Programming Models and Runtime Systems Efforts at Argonne

MPI and Low-level Data High-level Tasking and Global Movement Runme Systems Address Programming Models

Accelerators and Heterogeneous Compung Systems

2 MPI at Exascale

. We believe that MPI has a role to play even at exascale, parcularly in the context of an MPI+X model . The investment in applicaon codes and libraries wrien using MPI is on the order of hundreds of millions of dollars . Unl a viable alternave to MPI for exascale is available, and applicaons have been ported to the new model, MPI must evolve to run as efficiently as possible on future systems . Argonne has long-term strength in all things related to MPI

3 MPI and Low-level Runtime Systems

. MPICH implementaon of MPI – of MPI to very large systems • Scalability improvements to the MPI implementaon • Scalability to large numbers of threads – MPI-3 algorithms and interfaces • Remote memory access • Fault Tolerance – Research on features for future MPI standards • Interacon with accelerators • Interacon with other programming models . Data movement models on complex processor and memory architectures – Data movement for heterogeneous memory (accelerator memory, NVRAM)

4 Pavan Balaji wins DOE Early Career Award

. Pavan Balaji was recently awarded the highly compeve and presgious DOE early career award ($500K/yr for 5 years) . “Exploring Efficient Data Movement Strategies for Exascale Systems with Deep Memory Hierarchies”

5 MPICH-based Implementations of MPI

. IBM MPI for the Blue Gene/Q – IBM has successfully scaled the LAMMPS applicaon to over 3 million MPI ranks and submied a special Gordon Bell prize entry • So it’s not true that MPI won’t scale! – Will be used on Livermore (Sequoia), Argonne (Mira), and other major BG/Q installaons . Unified IBM implementaon for BG and Power systems . Cray MPI for XE/XK-6 – On Blue Waters, Oak Ridge (Jaguar, Titan), NERSC (Hopper), HLRS Stugart, and other major Cray installaons . Intel MPI for clusters . Microso MPI . Myricom MPI . MVAPICH2 from Ohio State

6 MPICH Collaborators/Partners . Core MPICH developers . IBM . INRIA . Microso . Intel . University of Illinois . University of Brish Columbia . Derivave implementaons . Cray . Myricom . Ohio State University . Other Collaborators . Abso . Pacific Northwest Naonal Laboratory . QLogic . Queen’s University, Canada . Totalview Technologies

. University of Utah 7 11/17/10 Accelerators and Heterogeneous Computing Systems

. Accelerator-augmented data movement models – Integrated data movement models that can move data from any memory region of a given to any other memory region of another process • E.g., move data from accelerator memory of one process to that of another – Internal runme opmizaons for efficient data movement – External programming model improvements for improved producvity . Virtualized accelerators – Allow applicaons to (semi-)transparently ulize light-weight virtual accelerators instead of physical accelerators – Backend improvements for performance, power, fault tolerance, etc.

8 Recent Work

1. MPI-ACC (ANL, NCSU, VT) – Integrate awareness of accelerator memory in MPICH2 – Producvity and performance benefits – To be presented at HPCC 2012 Conference, Liverpool, UK, June 2012

2. Virtual OpenCL (ANL, VT, SIAT CAS) – OpenCL implementaon allows program to use remote accelerators – One-to-many model, beer resource usage, load balancing, FT, … – Published at CCGrid 2012, Oawa, Canada, May 2012

3. Scioto-ACC (ANL, OSU, PNNL) – Task parallel programming model, scalable runme system – Coscheduling CPU and GPU, automac data movement

9 Current MPI+GPU Programming

. MPI operates on data in host memory only . Manual copy between host and GPU memory serializes PCIe, Interconnect – Can do beer than this, but will incur protocol overheads mulple mes . Producvity: Manual data movement . Performance: Inefficient, unless large, non-portable investment in tuning

10 MPI-ACC Interface: Passing GPU Buffers to MPI

UVA Lookup Overhead (CPU-CPU) '$" +,-./"012" Aributes '#" 012"3"4567./.8"6,9,:;8;9-"/<;/=" 012"3">,8,8?6;",889.@A8;"/<;/=" CL_Context '!" 012"3"BA8C:,8./"D;8;/8.CE" ( *+

ʅ &" MPI CL_Mem %" Datatype !"#$%&'()

$" CL_Device_ID

#" CL_Cmd_queue !" !" '" #" $" &" '%" (#" %$" '#&" #)%" )'#" '!#$"#!$&"$!*%"&'*#" ,"#"(-./$()0'#$*+(

. Unified Virtual Address (UVA) space – Allow device pointer in MPI rounes directly – Currently supported only by CUDA and newer NVIDIA GPUs – Query cost is high and added to every operaon (CPU-CPU) . Explicit Interface – e.g. MPI_CUDA_Send(…) . MPI Datatypes – Compable with MPI and many accelerator models

11 MPI-ACC: Integrated, Optimized Data Movement

. Use MPI for all data movement – Support mulple accelerators and prog. models (CUDA, OpenCL, …) – Allow applicaon to portably leverage system-specific opmizaons

. Inter-node data movement: – Pipelining: Fully ulize PCIe and network links – GPU direct (CUDA): Mul-device pinning eliminates data copying – Handle caching (OpenCL): Avoid expensive command queue creaon

. Intra-node data movement: – protocol • Sender and receiver drive independent DMA transfers – Direct DMA protocol • GPU-GPU DMA transfer (CUDA IPC) – Both protocols needed, PCIe limitaons serialize DMA across I/O hubs

12 Integrated Support for User-Defined Datatypes

4D Subarray 1000 Pack Manual CUDA

100

10 Packing Time (Microseconds) 26 28 210 212 214 216 218 220 222 Packed Buffer Size (Bytes) MPI_Send(buffer, datatype, count, to, … )! MPI_Recv(buffer, datatype, count, from, … )!

. What if the datatype is nonconguous? . CUDA doesn’t support arbitrary nonconguous transfers . Pack data on the GPU – Flaen datatype tree representaon – Packing kernel that can saturate memory bus/banks

13 Inter-node GPU-GPU Latency

"(($"# $%+"'# ,-./011# ,23425#65789:3;## !"$'&# ,23425#<73#=5789:3;#>-:?@5:3@AB# '!*%# ,-.#C@3ADE@8@:F@#>1-G/735HI#J7K@L#6743AB#

( &)*"# *+

ʅ %)&'# !)%&# (!%# !"#$%&'() %("# !%'# "&# $%# !"#

,"#"(-./$()0'#$*+(

. Pipelining (PCIe + IB) pays off for large messages – 2/3 latency

14 GPUs as a Service: Virtual OpenCL

. Clusters, cloud systems, provide GPUs on subset of nodes . OpenCL provides a powerful abstracon – Kernels compiled on-the-fly – i.e. at the device – Enable transparent virtualizaon, even across different devices . Support GPUs as a service – One-to-Many: One process can drive many GPUs – Resource Ulizaon: Share GPU across applicaons, use hybrid nodes – System Management, Fault Tolerance: Transparent migraon

15 Virtual OpenCL (VOCL) Framework Components

Local node Remote node Application Proxy

OpenCL API VOCL Library Native OpenCL Library

VGPU GPU GPU

. VOCL library (local) and proxy process (system service) . API and ABI compable with OpenCL – transparent to app. . Ulize both local and remote GPUs – Local GPUs: Call nave OpenCL funcons – Remote GPUs: Runme uses MPI to forward funcon calls to proxy

16 High-level Tasking and Global Address Programming Models

. Led by Jim Dinan . High-level tasking execuon model for performing computaons in dynamic environments – Programming model interface and implementaon – Mulple programming model implementaons for this tasking model to work with MPI, PGAS models, etc. . Interoperable global address space and tasking models – Unified runme infrastructure . Benchmarking – Unbalanced Tree Search family of benchmarks

17 Scioto-ACC: Accelerated, Scalable Collections of Task Objects Process 0 … Process n

SPMD

Task Parallel

Terminaon SPMD

. Express computaon as collecon of tasks – Tasks operate on data in Global Address Space (GA, MPI-RMA, …) – Executed in collecve task parallel phases . Scioto runme system manages task execuon – Load balancing, locality opt., fault resilience – Mapping to Accelerator/CPU, data movement

Led by Jim Dinan 18 Other Recent Papers

. “Scalable Distributed Consensus to Support MPI Fault Tolerance” – Darius Bunnas, IPDPS 2012, Shanghai, China, May 2012 – Scalable algorithm for a set of MPI processes to collecvely determine which subset of processes among them have failed

. “Supporng the Global Arrays PGAS Model Using MPI One-Sided Communicaon” – James Dinan, Pavan Balaji, Jeff Hammond, Sriram Krishnamoorthy, Vinod Tipparaju, IPDPS 2012, Shanghai, China, May 2012 – Implements ARMCI interface over MPI one-sided

. “Efficient Multhreaded Context ID Allocaon in MPI” – James Dinan, David Goodell, William Gropp, Rajeev Thakur, Pavan Balaji, submied to EuroMPI 2012 – Clever algorithm for multhreaded context id generaon for MPI_Comm_create_group (new funcon in MPI-3)

19 Non-Collective Communicator Creation

. Create a communicator collecvely only on new members

. Global Arrays process groups  – Past: collecves using MPI Send/Recv

. Overhead reducon – Mul-level parallelism – Small communicators when parent is large

. Recovery from failures – Not all ranks in parent can parcipate

. Load balancing

20 Non-Collective Communicator Creation Algorithm (using MPI 2.2 functionality)

0 1 2 3 4 5 6 7

MPI_COMM_SELF MPI_Intercomm_create (intracommunicator) MPI_Intercomm_merge (intercommunicator)

0 1 2 3 4 5 6 7

0 1 2 3 4 5 6 7

0 1 2 3 4 5 6 7

21 Non-Collective Algorithm in Detail

. R

Translate group ranks to ordered list of ranks on parent communicator

Calculate my group ID

Le group

Right group

22 Evaluation: Microbenchmark

18 Group-collective 2 16 MPI_Comm_create O(log g) 14 12 10 8 6 4 O(log p) Creation Time (msec) 2 0 1 4 16 64 256 1024 4096 Group Size (processes)

23 New MPI-3 Function MPI_Comm_create_group

MPI_Comm_create_group(MPI_Comm in, MPI_Group grp, int tag, MPI_Comm *out) – Collecve only over process in “grp” – Tag allows for safe communicaon and disncon of calls

. Eliminate O(log g) factor by direct implementaon

24 scenario in existing multithreaded context id generation algorithm in MPICH2

25 Avoiding the deadlock

. Simple soluon: Add a barrier on entry to the algorithm . However, it is addional synchronizaon overhead

. Beer soluon – Use the synchronizaon to aempt to acquire a context id – Paron the context id space into two segments: eager and base – Instead of the barrier, do an Allreduce on the eager segment – If eager allocaon fails, the allreduce acts like a barrier, and the context id is acquired in the second call to allreduce – If eager allocaon succeeds (which in most cases it will), the overhead of an addional barrier is avoided

Eager Base

Context id space

26 Performance

27 Global Arrays over MPI One sided

28 Global Arrays, a Global-View Data Model

Proc0 Proc1 Procn

X[M][M][N] Shared space X X[1..9] Private

Global address [1..9][1..9]

. Distributed, shared muldimensional arrays – Aggregate memory of mulple nodes into global data space – Programmer controls data distribuon, can exploit locality . One-sided data access: Get/Put({i, j, k}…{i’, j’, k’}) . NWChem data management: Large coeff. tables (100GB+)

29 ARMCI: The Aggregate Remote Memory Copy Interface

. GA runme system – Manages shared memory – Provides portability GA_Put({x,y},{x’,y’}) ARMCI_PutS(rank, addr, …) – Nave implementaon . One-sided communicaon – Get, put, accumulate, … 0 1 – Load/store on local data

– Nonconguous operaons . Mutexes, atomics, collecves,

processor groups, … 2 3 . Locaon consistent data access – I see my operaons in issue order

30 Implementing ARMCI

. ARMCI Support Nave ARMCI-MPI – Navely implemented NWChem NWChem – Sparse vendor support Global Arrays Global Arrays ARMCI ARMCI – Implementaons lag systems MPI MPI . MPI is ubiquitous Native Native – Support one-sided for 15 years . Goal: Use MPI RMA to implement ARMCI 1. Portable one-sided communicaon for NWChem users 2. MPI-2: drive implementaon performance, one-sided tools 3. MPI-3: movate features 4. Interoperability: Increase resources available to applicaon • ARMCI/MPI share progress, buffer pinning, network and host resources . Challenge: Mismatch between MPI-RMA and ARMCI

31 ARMCI/MPI-RMA Mismatch

1. Shared data segment representaon – MPI: , window rank – ARMCI:

, absolute rank → Perform translaon

2. Data – ARMCI: Relaxed, locaon consistent for RMA, CCA undefined – MPI: Explicit (lock and unlock), CCA error → Explicitly maintain consistency, avoid CCA

32 NWChem Performance (XE6)

Cray XE6 18 30 ARMCI-MPI CCSD ARMCI-Native CCSD 15 ARMCI-MPI (T) 25 ARMCI-Native (T) 12 20

9 15

6 10 (T) Time (min) CCSD Time (min) 3 5

0 0 744 1488 2232 2976 3720 4464 5208 5952 Number of Cores

33 NWChem Performance (BG/P)

Blue Gene/P 35 ARMCI-MPI CCSD 30 ARMCI-Native CCSD

25

20

15

10 CCSD Time (min) 5

0 0 256 512 768 1024 Number of Nodes

34 An idea and a proposal in need of funding

35 Domain-Specific Old library codes Old MPI LEGEND Language code Applicaon codes New programming DSL-to-PM translaon Eclipse Refactoring models (ROSE) Translators Plug-ins (Technology New codes leveraged) (miniapps) PM-to-DDM translaon Major new (ROSE, Orio) components

Swi (TAU) DPLASMA Exposé DDM-to-DEE translaon

(DAGuE) Cost models

DXS library ( DXS Runme Pbound (Charm++, DAGuE) (Nemesis) ) Low-level system services Commercial power mgt, resilience compiler

ANL, UIUC, UTexas, UTK, Instrumented hardware UOregon, LLNL (real, emulated)

36