Support Hybrid MPI+PGAS (UPC/OpenSHMEM/CAF) Programming Models through a Unified Runtime: An MVAPICH2-X Approach

Talk at OSC theater (SC ’15) by

Dhabaleswar K. (DK) Panda The Ohio State University E-mail: [email protected] http://www.cse.ohio-state.edu/~panda Parallel Programming Models Overview

P1 P2 P3 P1 P2 P3 P1 P2 P3

Logical Shared Memory Memory Memory Memory Memory Memory Memory

Shared Memory Model Model Partitioned Global Address Space (PGAS) DSM MPI (Message Passing Interface) , UPC, Chapel, X10, CAF, …

• Programming models provide abstract machine models • Models can be mapped on different types of systems – e.g. Distributed Shared Memory (DSM), MPI within a node, etc. • Additionally, OpenMP can be used to parallelize computation within the node • Each model has strengths and drawbacks - suite different problems or applications

SC15 2 Partitioned Global Address Space (PGAS) Models • Key features - Simple shared memory abstractions - Light weight one-sided communication - Easier to express irregular communication • Different approaches to PGAS - Languages • (UPC) • Co-array Fortran (CAF) • X10 • Chapel - Libraries • OpenSHMEM • Global Arrays

SC15 3 OpenSHMEM: PGAS library

• SHMEM implementations – Cray SHMEM, SGI SHMEM, SHMEM, HP SHMEM, GSHMEM • Subtle differences in API, across versions – example: SGI SHMEM Quadrics SHMEM Cray SHMEM Initialization start_pes(0) shmem_init start_pes ID _my_pe my_pe shmem_my_pe • Made application codes non-portable • OpenSHMEM is an effort to address this:

“A new, open specification to consolidate the various extant SHMEM versions into a widely accepted standard.” – OpenSHMEM Specification v1.0 by University of Houston and Oak Ridge National Lab SGI SHMEM is the baseline

SC15 4 Unified Parallel C: PGAS language • UPC: a parallel extension to the C standard • UPC Specifications and Standards: – Introduction to UPC and Language Specification, 1999 – UPC Language Specifications, v1.0, Feb 2001 – UPC Language Specifications, v1.1.1, Sep 2004 – UPC Language Specifications, v1.2, June 2005 – UPC Language Specifications, v1.3, Nov 2013 • UPC Consortium – Academic Institutions: GWU, MTU, UCB, U. Florida, U. Houston, U. Maryland… – Government Institutions: ARSC, IDA, LBNL, SNL, US DOE… – Commercial Institutions: HP, Cray, Intrepid Technology, IBM, … • Supported by several UPC compilers – Vendor-based commercial UPC compilers: HP UPC, Cray UPC, SGI UPC – Open-source UPC compilers: Berkeley UPC, GCC UPC, Michigan Tech MuPC • Aims for: high performance, coding efficiency, irregular applications, …

SC15 5 Co-array Fortran (CAF): Language-level PGAS support in Fortran

• An extension to Fortran to support global shared array in parallel Fortran applications • CAF = CAF compiler + CAF runtime (libcaf)

• Coarray syntax and basic • Collective communication and synchronization support in atomic operations in upcoming Fortran 2008 Fortran 2015 E.g. real :: a(n)[*] x(:)[q] = x(:) + x(:)[p] name[i] = name ATOMIC_ADD sync all ATOMIC_FETCH_ADD ......

SC15 6 MPI+PGAS for Exascale Architectures and Applications

• Hierarchical architectures with multiple address spaces • (MPI + PGAS) Model – MPI across address spaces – PGAS within an address space • MPI is good at moving data between address spaces • Within an address space, MPI can interoperate with other shared memory programming models

• Applications can have kernels with different communication patterns • Can benefit from different models

• Re-writing complete applications can be a huge effort • Port critical kernels to the desired model instead

SC15 7 MVAPICH2 Software • High Performance open-source MPI Library for InfiniBand, 10-40Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE) – MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.0), Available since 2002 – MVAPICH2-X (MPI + PGAS), Available since 2011 – Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014 – Support for Virtualization (MVAPICH2-Virt), Available since 2015 – Support for Energy-Awareness (MVAPICH2-EA), Available since 2015 – Used by more than 2,475 organizations in 76 countries – More than 308,000 downloads from the OSU site directly – Empowering many TOP500 clusters (June ‘15 ranking) • 10th ranked 519,640-core cluster (Stampede) at TACC • 13th ranked 185,344-core cluster (Pleiades) at NASA • 25th ranked 76,032-core cluster (Tsubame 2.5) at Tokyo Institute of Technology and many others – Available with software stacks of many vendors and Linux Distros (RedHat and SuSE) – http://mvapich.cse.ohio-state.edu • Empowering Top500 systems for over a decade – System-X from Virginia Tech (3rd in Nov 2003, 2,200 processors, 12.25 TFlops) -> – Stampede at TACC (8th in Jun’15, 519,640 cores, 5.168 Plops) SC15 8 Hybrid (MPI+PGAS) Programming

• Application sub-kernels can be re-written in MPI/PGAS based on communication characteristics HPC Application

• Benefits: Kernel 1 – Best of Model MPI – Best of Shared Memory Computing Model Kernel 2 PGASMPI • Exascale Roadmap*: Kernel 3 – “Hybrid Programming is a practical way to MPI program exascale systems” KernelKernel NN PGASMPI

* The International Exascale Software Roadmap, Dongarra, J., Beckman, P. et al., Volume 25, Number 1, 2011, International Journal of High Performance Computer Applications, ISSN 1094-3420

SC15 9 MVAPICH2-X for Hybrid MPI + PGAS Applications

MPI, OpenSHMEM, UPC, CAF or Hybrid (MPI + PGAS) Applications

OpenSHMEM Calls CAF Calls UPC Calls MPI Calls

Unified MVAPICH2-X Runtime

InfiniBand, RoCE, iWARP

• Unified communication runtime for MPI, UPC, OpenSHMEM, CAF available with MVAPICH2-X 1.9 (2012) onwards! – http://mvapich.cse.ohio-state.edu • Feature Highlights – Supports MPI(+OpenMP), OpenSHMEM, UPC, CAF, MPI(+OpenMP) + OpenSHMEM, MPI(+OpenMP) + UPC – MPI-3 compliant, OpenSHMEM v1.0 standard compliant, UPC v1.2 standard compliant (with initial support for UPC 1.3), CAF 2008 standard (OpenUH) – Scalable Inter-node and intra-node communication – point-to-point and collectives

SC15 10 OpenSHMEM Collective Communication: Performance Reduce (1,024 processes) Broadcast (1,024 processes) 100000 10000 MV2X-SHMEM 10000 Scalable-SHMEM

1000 OMPI-SHMEM 1000 100

100 12X 18X

Time (us) Time Time (us) Time 10 10

1 1 1 4 16 64 256 1K 4K 16K 64K 4 16 64 256 1K 4K 16K 64K 256K Message Size Message Size Collect (1,024 processes) Barrier 10000000 400 1000000 350

300 100000 10000 250 200 1000 180X 3X

Time (us) Time 150 100 (us) Time 100 10 50 1 0 128 256 512 1024 2048 Message Size No. of Processes

SC15 11 Hybrid MPI+UPC NAS-FT

35 30

25

UPC (GASNet) 20 34%

15 UPC (MV2-X) Time(s) 10 MPI+UPC (MV2-X) 5 0 B-64 C-64 B-128 C-128 NAS Problem Size – System Size • Modified NAS FT UPC all-to-all pattern using MPI_Alltoall • Truly hybrid program • For FT (Class C, 128 processes) • 34% improvement over UPC (GASNet) • 30% improvement over UPC (MV2-X)

J. J.Jose, Jose, M. M. Luo, Luo, S. S. Sur Sur and and D. D. K. K. Panda, Panda, Unifying Unifying UPC UPC and and MPI MPI Runtimes: Runtimes: Experience Experience with with MVAPICH, MVAPICH, PGAS Fourth 2010 Conference on Partitioned Global Address Space Programming Model (PGAS '10), October 2010 SC15 12 CAF One-sided Communication: Performance Get Put NAS-CAF

6000 6000

GASNet-IBV GASNet-IBV 5000 5000 sp.D.256 GASNet-MPI GASNet-MPI 4000 4000 12% MV2X MV2X mg.D.256 3000 3000 3.5X 2000 2000 ft.D.256

1000 1000

Bandwidth(MB/s) Bandwidth(MB/s) 0 0 ep.D.256

cg.D.256 Message Size (byte) Message Size (byte)

bt.D.256

10 10 8 8 18% 6 6 0 100 200 300 4 4 29% 2 2 Time (sec)

0 0 Latency Latency (ms) Latency Latency (ms) GASNet-IBV GASNet-MPI MV2X GASNet-IBV GASNet-MPI MV2X GASNet-IBV GASNet-MPI MV2X

• Micro-benchmark improvement (MV2X vs. GASNet-IBV, UH CAF test-suite) – Put bandwidth: 3.5X improvement on 4KB; Put latency: reduce 29% on 4B • Application performance improvement (NAS-CAF one-sided implementation) – Reduce the execution time by 12% (SP.D.256), 18% (BT.D.256)

J. Lin, K. Hamidouche, X. Lu, M. Li and D. K. Panda, High-performance Co-array Fortran support with MVAPICH2-X: Initial experience and evaluation, HIPS’15

SC15 13 Graph500 - BFS Traversal Time Performance Strong Scaling 35 9.00E+09 MPI-Simple 30 8.00E+09 MPI-CSC

Second 7.00E+09 25 MPI-CSR 6.00E+09 20 5.00E+09 Hybrid

15 13X 4.00E+09 Time (s) Time 10 3.00E+09 2.00E+09 5 7.6X 1.00E+09

0 Per Edges Traversed 0.00E+00 4K 8K 16K 1024 2048 4096 8192 No. of Processes # of Processes • Hybrid design performs better than MPI implementations • 16,384 processes - 1.5X improvement over MPI-CSR - 13X improvement over MPI-Simple (Same communication characteristics) • Strong Scaling Graph500 Problem Scale = 29

SC15 14 Hybrid MPI+OpenSHMEM Sort Application

Execution Time Weak 700 0.4 0.35 600 MPI 38% MPI 0.3

500 Hybrid-SR Hybrid-SR 0.25 400 Hybrid-ER Hybrid-ER 0.2 300

0.15 Time (sec) Time 200 0.1

45% (TB/min) Rate Sort 100 0.05 0 0 1024-1TB 2048-1TB 4096-1TB 8192-1TB 512-1TB 1024-2TB 2048-4TB 4096-8TB No. of Processes - Input Data No. of Processes - Input Data • Performance of Hybrid (MPI+OpenSHMEM) Sort Application • Execution Time (seconds) - 1TB Input size at 8,192 cores: MPI – 164, Hybrid-SR (Simple Read) – 92.5, Hybrid-ER (Eager Read) - 90.36 - 45% improvement over MPI-based design • Weak Scalability (configuration: input size of 1TB per 512 cores) J. Jose, S. Potluri,- At 4,096K. Tomko cores: and D. K. MPIPanda, – Designing 0.25 TB/min Scalable Graph500, Hybrid Benchmark-SR – 0.34 with Hybrid TB/min, MPI+OpenSHMEM Hybrid-SR Programming – 0.38 TB/minModels, International Supercomputing Conference (ISC’13), June 2013 - 38% improvement over MPI based design

SC15 15 Accelerating MaTEx k-NN with Hybrid MPI and OpenSHMEM

• MaTEx: MPI-based Machine learning algorithm library • k-NN: a popular supervised algorithm for classification • Hybrid designs: – Overlapped Data Flow; One-sided Data Transfer; Circular-buffer Structure

KDD-tranc (30MB) on 256 cores KDD (2.5GB) on 512 cores

27.6% 9.0%

• Benchmark: KDD Cup 2010 (8,407,752 records, 2 classes, k=5) • For truncated KDD workload on 256 cores, reduce 27.6% execution time • For full KDD workload on 512 cores, reduce 9.0% execution time

J. Lin, K. Hamidouche, J. Zhang, X. Lu, A. Vishnu, D. Panda. Accelerating k-NN Algorithm with Hybrid MPI and OpenSHMEM, OpenSHMEM 2015

SC15 16 CUDA-Aware Concept

• Before CUDA 4: Additional copies • Low performance and low productivity • After CUDA 4: Host-based pipeline • Unified Virtual Address • Pipeline CUDA copies with IB transfers • High performance and high productivity • After CUDA 5.5: GPUDirect-RDMA support • GPU to GPU direct transfer

• Bypass the host memory CPU • Hybrid design to avoid PCI bottlenecks

GPU Chip set InfiniBand

GPU Memory Limitations of OpenSHMEM for GPU Computing

• OpenSHMEM memory model does not support disjoint memory address spaces - case with GPU clusters Existing OpenSHMEM Model with CUDA

PE 0 host_buf = shmalloc (…) cudaMemcpy (host_buf,(host_buf, dev_buf,dev_buf, ...... ) ) PE 0 shmem_putmem (host_buf, host_buf, size, pe) shmem_barriershmem_barrier (…) GPU-to-GPU Data Movement PE 1 host_buf = shmalloc (…) PE 1 shmem_barrier ((…) . . . ) cudaMemcpy (dev_buf,(host_buf, host_buf, dev_buf, size, . . . .) . . ) • Copies severely limit the performance • Synchronization negates the benefits of one-sided communication

• Similar issues with UPC SC15 18

Global Address Space with Host and Device Memory

Host Memory Host Memory

• Extended : Private Private • heap_on_device/heap_on_host • a way to indicate location of heap Shared shared space Shared on host memory • host_buf = shmalloc (sizeof(int), 0); N N • dev_buf = shmalloc (sizeof(int), 1); Device Memory Device Memory

CUDA-Aware OpenSHMEM Private Private Same design for UPC PE 0 Shared shared space Shared on device memory dev_buf = shmalloc (size, 1); N N

shmem_putmem (dev_buf, dev_buf, size, pe)

PE 1 S. Potluri, D. Bureddy, H. Wang, H. Subramoni and D. K. dev_buf = shmalloc (size, 1); Panda, Extending OpenSHMEM for GPU Computing, IPDPS’13

SC15 19 CUDA-aware OpenSHMEM and UPC runtimes

• After device memory becomes part of the global shared space: -Accessible through standard UPC/OpenSHMEM communication APIs -Data movement transparently handled by the runtime -Preserves one-sided semantics at the application level • Efficient designs to handle communication -Inter-node transfers use host-staged transfers with pipelining -Intra-node transfers use CUDA IPC -Possibility to take advantage of GPUDirect RDMA (GDR)

• Goal: Enabling High performance one-sided communications semantics with GPU devices

SC15 20

Exploiting GDR: OpenSHMEM: Inter-node Evaluation Small Message shmem_put D-D Large Message shmem_put D-D

25 800

700 20 Host-Pipeline 600 Host-Pipeline 15 7X 500 GDR 400 10 GDR 300 Latency Latency (us) 5 200

Latency Latency (us) 100 0 0 1 4 16 64 256 1K 4K 8K 32K 128K 512K 2M Message Size (bytes) Message Size (bytes) Small Message shmem_get D-D • GDR for small/medium message sizes 35

30 • Host-staging for large message to avoid PCIe 25 bottlenecks 20 Host-Pipeline • Hybrid design brings best of both 15 9X Latency (us) 10 GDR • 3.13 us Put latency for 4B (7X improvement ) 5 and 4.7 us latency for 4KB 0 1 4 16 64 256 1K 4K • 9X improvement for Get latency of 4B Message Size (bytes) SC15 21 Application Evaluation: GPULBM and 2DStencil

Weak Scaling Host-Pipeline Enhanced-GDR

0.1 Host-Pipeline Enhanced-GDR 19% 200 0.08 45 % 150 0.06 100 0.04

50 0.02 Execution time (sec) time Execution

Evolution Time (msec) Time Evolution 0 0 8 16 32 64 8 16 32 64 Number of GPU Nodes Number of GPU Nodes GPULBM: 64x64x64 2DStencil 2Kx2K

• Redesign the application - Platform: Wilkes (Intel Ivy Bridge + NVIDIA • CUDA-Aware MPI : Send/Recv=> hybrid Tesla K20c + Mellanox Connect-IB) CUDA-Aware MPI+OpenSHMEM - New designs achieve 20% and 19% • cudaMalloc =>shmalloc(size,1); • MPI_Send/recv => shmem_put + fence improvements on 32 and 64 GPU nodes • 53% and 45% K. Hamidouche, A. Venkatesh, A. Awan, H. Subramoni, C. • Degradation is due to small Ching and D. K. Panda, Exploiting GPUDirect RDMA in Input size Designing High Performance OpenSHMEM for GPU Clusters. IEEE Cluster 2015 SC15 22 Concluding Remarks

• Presented an overview of a Hybrid MPI+PGAS models • Outlined research challenges in designing efficient runtimes for these models on clusters with InfiniBand • Demonstrated the benefits of Hybrid MPI+PGAS models for a set of applications on Host based systems • Presented the model extension and runtime support for Accelerator-Aware hybrid MPI+PGAS model • Hybrid MPI+PGAS model is an emerging paradigm which can lead to high-performance and scalable implementation of applications on exascale computing systems using accelerators

SC15 23 Funding Acknowledgments

Funding Support by

Equipment Support by

SC15 24 Personnel Acknowledgments Current Research Current Senior Research Current Students – K. Kulkarni (M.S.) Scientists Associate – A. Augustine (M.S.) – H. Subramoni - K. Hamidouche – A. Awan (Ph.D.) – M. Li (Ph.D.) – M. Rahman (Ph.D.) – X. Lu Current Programmer – A. Bhat (M.S.) – D. Shankar (Ph.D.) – J. Perkins – S. Chakraborthy (Ph.D.) Current Post-Docs – A. Venkatesh (Ph.D.) – C.-H. Chu (Ph.D.) – J. Lin Current Research – J. Zhang (Ph.D.) – N. Islam (Ph.D.) – D. Shankar Specialist Past Students – M. Arnold – M. Luo (Ph.D.) – P. Balaji (Ph.D.) – W. Huang (Ph.D.) – G. Santhanaraman (Ph.D.) – A. Mamidala (Ph.D.) – D. Buntinas (Ph.D.) – W. Jiang (M.S.) – A. Singh (Ph.D.) – G. Marsh (M.S.) – S. Bhagvat (M.S.) – J. Jose (Ph.D.) – J. Sridhar (M.S.) – V. Meshram (M.S.) – L. Chai (Ph.D.) – S. Kini (M.S.) – S. Sur (Ph.D.) – A. Moody (M.S.) – B. Chandrasekharan (M.S.) – M. Koop (Ph.D.) – H. Subramoni (Ph.D.) – N. Dandapanthula (M.S.) – R. Kumar (M.S.) – S. Naravula (Ph.D.) – K. Vaidyanathan (Ph.D.) – V. Dhanraj (M.S.) – S. Krishnamoorthy (M.S.) – R. Noronha (Ph.D.) – A. Vishnu (Ph.D.) – T. Gangadharappa (M.S.) – K. Kandalla (Ph.D.) – X. Ouyang (Ph.D.) – J. Wu (Ph.D.) – K. Gopalakrishnan (M.S.) – P. Lai (M.S.) – S. Pai (M.S.) – W. Yu (Ph.D.) – J. Liu (Ph.D.) – S. Potluri (Ph.D.) Past Post-Docs – R. Rajachandrasekar (Ph.D.) – E. Mancini – H. Wang Past Research Scientist – X. Besseron – S. Marcarelli – S. Sur Past Programmers – H.-W. Jin – J. Vienne – D. Bureddy – M. Luo SC15 25 Thank You! [email protected]

Network-Based Computing Laboratory The MVAPICH2 Project http://nowlab.cse.ohio-state.edu/ http://mvapich.cse.ohio-state.edu/

SC15 26