High-Performance MPI and Deep Learning on OpenPOWER Platform

Talk at OpenPOWER SUMMIT (Mar ‘18) by

Dhabaleswar K. (DK) Panda Hari Subramoni The Ohio State University The Ohio State University E-mail: [email protected] E-mail: [email protected] http://www.cse.ohio-state.edu/~panda http://www.cse.ohio-state.edu/~subramon Increasing Usage of HPC, Big Data and Deep Learning

Big Data HPC (Hadoop, Spark, (MPI, RDMA, HBase, Lustre, etc.) Memcached, etc.)

Convergence of HPC, Big Deep Learning Data, and Deep Learning! (Caffe, TensorFlow, BigDL, etc.) Increasing Need to Run these applications on the Cloud!!

Network Based Computing Laboratory OpenPOWER Summit (Mar ’18) 2 Parallel Programming Models Overview

P1 P2 P3 P1 P2 P3 P1 P2 P3

Logical shared memory Memory Memory Memory Shared Memory Memory Memory Memory

Shared Memory Model Distributed Memory Model Partitioned Global Address Space (PGAS) SHMEM, DSM MPI (Message Passing Interface) Global Arrays, UPC, Chapel, X10, CAF, … • Programming models provide abstract machine models • Models can be mapped on different types of systems – e.g. Distributed Shared Memory (DSM), MPI within a node, etc. • PGAS models and Hybrid MPI+PGAS models are gradually receiving importance Network Based Computing Laboratory OpenPOWER Summit (Mar ’18) 3 Partitioned Global Address Space (PGAS) Models • Key features - Simple shared memory abstractions - Light weight one-sided communication - Easier to express irregular communication • Different approaches to PGAS - Languages - Libraries • Unified Parallel C (UPC) • OpenSHMEM • Co-Array Fortran (CAF) • UPC++ • X10 • Global Arrays • Chapel

Network Based Computing Laboratory OpenPOWER Summit (Mar ’18) 4 Hybrid (MPI+PGAS) Programming

• Application sub-kernels can be re-written in MPI/PGAS HPC Application based on communication characteristics Kernel 1 • Benefits: MPI

– Best of Distributed Computing Model Kernel 2 MPI – Best of Shared Memory Computing Model PGAS Kernel 3 MPI

Kernel N PGASMPI

Network Based Computing Laboratory OpenPOWER Summit (Mar ’18) 5 Supporting Programming Models for Multi-Petaflop and Exaflop Systems: Challenges

Application Kernels/Applications Co-Design Middleware Opportunities and Programming Models Challenges MPI, PGAS (UPC, Global Arrays, OpenSHMEM), CUDA, OpenMP, across Various OpenACC, Cilk, Hadoop (MapReduce), Spark (RDD, DAG), etc. Layers

Communication Library or Runtime for Programming Models Performance Point-to-point Collective Energy- Synchronization I/O and Fault Communication Communication Awareness and Locks File Systems Tolerance Scalability Resilience Networking Technologies Multi-/Many-core Accelerators (InfiniBand, 40/100GigE, Architectures (GPU and FPGA) Aries, and Omni-Path)

Network Based Computing Laboratory OpenPOWER Summit (Mar ’18) 6 MPI, PGAS and Deep Learning Support for OpenPOWER

• Message Passing Interface (MPI) Support • Support for PGAS and MPI + PGAS (OpenSHMEM, UPC) • Exploiting Accelerators (NVIDIA GPGPUs) • Support for Deep Learning

Support for Big Data (Hadoop and Spark) on OpenPOWER (Talk at 2:15 pm)

Network Based Computing Laboratory OpenPOWER Summit (Mar ’18) 7 Overview of the MVAPICH2 Project • High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE) – MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.1), Started in 2001, First version available in 2002 – MVAPICH2-X (MPI + PGAS), Available since 2011 – Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014 – Support for Virtualization (MVAPICH2-Virt), Available since 2015 – Support for Energy-Awareness (MVAPICH2-EA), Available since 2015 – Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015 – Used by more than 2,875 organizations in 85 countries – More than 455,000 (> 0.45 million) downloads from the OSU site directly – Empowering many TOP500 clusters (Nov ‘17 ranking) • 1st, 10,649,600-core (Sunway TaihuLight) at National Supercomputing Center in Wuxi, China • 9th, 556,104 cores (Oakforest-PACS) in Japan • 12th, 368,928-core (Stampede2) at TACC • 17th, 241,108-core (Pleiades) at NASA • 48th, 76,032-core (Tsubame 2.5) at Tokyo Institute of Technology – Available with software stacks of many vendors, Linux Distros (RedHat and SuSE), and OpenHPC – http://mvapich.cse.ohio-state.edu • Empowering Top500 systems for over a decade

Network Based Computing Laboratory OpenPOWER Summit (Mar ’18) 8 MVAPICH2 Software Family High-Performance Parallel Programming Libraries MVAPICH2 Support for InfiniBand, Omni-Path, Ethernet/iWARP, and RoCE MVAPICH2-X Advanced MPI features, OSU INAM, PGAS (OpenSHMEM, UPC, UPC++, and CAF), and MPI+PGAS programming models with unified communication runtime MVAPICH2-GDR Optimized MPI for clusters with NVIDIA GPUs MVAPICH2-Virt High-performance and scalable MPI for hypervisor and container based HPC cloud MVAPICH2-EA Energy aware and High-performance MPI MVAPICH2-MIC Optimized MPI for clusters with Intel KNC (being updated with KNL-based designs) Microbenchmarks OMB Microbenchmarks suite to evaluate MPI and PGAS (OpenSHMEM, UPC, and UPC++) libraries for CPUs and GPUs Tools OSU INAM Network monitoring, profiling, and analysis for clusters with MPI and scheduler integration OEMT Utility to measure the energy consumption of MPI applications

Network Based Computing Laboratory OpenPOWER Summit (Mar ’18) 9 MVAPICH2 Release Timeline and Downloads 500000

450000 MIC 2.0 - GDR 2.0b 400000 - MV2 1.9 350000 MV2 2.3b 1.8 X GDR 2.3a - - MV2 1.7 Virt 2.2

300000 OSU INAM 0.9.3 MV2 MV2 2.3rc1 MV2 MV2 1.6 MV2 250000 MV2 1.5 MV2 1.4 1.0.3 1.1 1.0

200000 1.0 MV2 MV2 MV MV2 MV Number of Downloads of Number

150000 MV2 MV2 0.9.8 MV 0.9.4 MV2 0.9.0

100000

50000

0 Jul-15 Jul-10 Jul-05 Jan-18 Jan-13 Jan-08 Jun-13 Jun-08 Oct-16 Oct-11 Oct-06 Apr-14 Apr-09 Sep-14 Feb-15 Sep-09 Feb-10 Sep-04 Feb-05 Dec-15 Dec-10 Dec-05 Aug-17 Aug-12 Aug-07 Nov-13 Nov-08 Mar-17 Mar-12 Mar-07 May-16 May-11 May-06 Timeline Network Based Computing Laboratory OpenPOWER Summit (Mar ’18) 10 Architecture of MVAPICH2 Software Family

High Performance Parallel Programming Models

Message Passing Interface PGAS Hybrid --- MPI + X (MPI) (UPC, OpenSHMEM, CAF, UPC++) (MPI + PGAS + OpenMP/Cilk)

High Performance and Scalable Communication Runtime Diverse APIs and Mechanisms

Point-to- Remote Collectives Energy- I/O and Fault Active Introspection point Job Startup Memory Virtualization Algorithms Awareness Tolerance Messages & Analysis Primitives Access File Systems

Support for Modern Networking Technology Support for Modern Multi-/Many-core Architectures (InfiniBand, iWARP, RoCE, Omni-Path) (Intel-Xeon, OpenPOWER, Xeon-Phi, ARM, NVIDIA GPGPU) Transport Protocols Modern Features Transport Mechanisms Modern Features

SR- Multi Shared RC XRC UD DC UMR ODP CMA IVSHMEM XPMEM* MCDRAM* NVLink CAPI* IOV Rail Memory

* Upcoming Network Based Computing Laboratory OpenPOWER Summit (Mar ’18) 11 OpenPOWER Platform Support in MVAPICH2 Libraries

• MVAPICH2 – Basic MPI support • since MVAPICH2 2.2rc1 (March 2016) • MVAPICH2-X – PGAS (OpenSHMEM and UPC) and Hybrid MPI+PGAS support • since MVAPICH2-X 2.2b (March 2016) – Advanced Collective Support with CMA • Since MVAPICH2-X 2.3b (Oct 2017) • MVAPICH2-GDR – NVIDIA GPGPU support with NVLink (CORAL like systems) • Since MVAPICH2-GDR 2.3a (Nov 2017)

Network Based Computing Laboratory OpenPOWER Summit (Mar ’18) 12 MPI, PGAS and Deep Learning Support for OpenPOWER

• Message Passing Interface (MPI) Support – Point-to-point Inter-node and Intra-node – CMA-based Collectives – Upcoming XPMEM Support • Support for PGAS and MPI + PGAS (OpenSHMEM, UPC) • Exploiting Accelerators (NVIDIA GPGPUs) • Support for Deep Learning

Network Based Computing Laboratory OpenPOWER Summit (Mar ’18) 13 Intra-node Point-to-Point Performance on OpenPower Intra-Socket Small Message Latency Intra-Socket Large Message Latency 1.5 80 MVAPICH2-2.3rc1 MVAPICH2-2.3rc1 SpectrumMPI-10.1.0.2 60 SpectrumMPI-10.1.0.2 1 OpenMPI-3.0.0 OpenMPI-3.0.0 40 0.5 Latency (us) Latency Latency (us) Latency 20 0.30us 0 0 0 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M Intra-Socket Bandwidth Intra-Socket Bi-directional Bandwidth 40000 80000 MVAPICH2-2.3rc1 MVAPICH2-2.3rc1 30000 SpectrumMPI-10.1.0.2 60000 SpectrumMPI-10.1.0.2 OpenMPI-3.0.0 OpenMPI-3.0.0 20000 40000

10000 20000 Bandwidth (MB/s) Bandwidth Bandwidth (MB/s) Bandwidth 0 0 1 8 64 512 4K 32K 256K 2M 1 8 64 512 4K 32K 256K 2M Platform: Two nodes of OpenPOWER (Power8-ppc64le) CPU using Mellanox EDR (MT4115) HCA Network Based Computing Laboratory OpenPOWER Summit (Mar ’18) 14 Inter-node Point-to-Point Performance on OpenPower Small Message Latency Large Message Latency 4 200 MVAPICH2-2.3rc1 MVAPICH2-2.3rc1 3 SpectrumMPI-10.1.0.2 150 SpectrumMPI-10.1.0.2 OpenMPI-3.0.0 OpenMPI-3.0.0 2 100 Latency (us) Latency Latency (us) Latency 1 50

0 0 0 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M Bandwidth Bi-directional Bandwidth 15000 30000 MVAPICH2-2.3rc1 MVAPICH2-2.3rc1 SpectrumMPI-10.1.0.2 SpectrumMPI-10.1.0.2 10000 OpenMPI-3.0.0 20000 OpenMPI-3.0.0

5000 10000 Bandwidth (MB/s) Bandwidth Bandwidth (MB/s) Bandwidth 0 0 1 8 64 512 4K 32K 256K 2M 1 8 64 512 4K 32K 256K 2M Platform: Two nodes of OpenPOWER (Power8-ppc64le) CPU using Mellanox EDR (MT4115) HCA Network Based Computing Laboratory OpenPOWER Summit (Mar ’18) 15 Multi-Pair Latency Comparison 2 MVAPICH2-2.3rc1 500 MVAPICH2-2.3rc1 40% SpectrumMPI-10.1.0 400 SpectrumMPI-10.1.0 OpenMPI-3.0.0 48%

pair pair 300 OpenMPI-3.0.0

- 1

MVAPICH2-XPMEM pair - 200 MVAPICH2-XPMEM 32% Multi Latency (us) Latency Multi node (PPN=20) Latency (us) Latency - 45% 100 0 0

Intra 0 1 2 4 8 16 32 64 128 256 512 0 2 8 32 2K 8K 2M 128 512 32K

6 128K 512K MVAPICH2-2.3rc1 4000 SpectrumMPI-10.1.0 45% MV2-2.3rc1 3000 4 OpenMPI-3.0.0 SpectrumMPI-10.1.0 pair pair - MVAPICH2-XPMEM pair

- 2000 OpenMPI-3.0.0

Multi 2 MV2-XPMEM node Latency (us) Latency Multi

- 1000 Latency (us) Latency

Inter 0 0 1K 2K 4K 8K 1M 2M 4M (Nodes=2, PPN=20) 0 1 2 4 8 16 32 64 128 256 512 16K 32K 64K 128K 256K 512K Message Size Message Size • Basic point-to-point multi-pair latency comparison – Up to 48% and 45% performance improvement over Spectrum MPI and OpenMPI for intra-/inter-node Optimized Runtime Parameters: MV2_CPU_BINDING_POLICY=hybrid MV2_HYBRID_BINDING_POLICY=bunch Network Based Computing Laboratory OpenPOWER Summit (Mar ’18) 16 Multi-pair Bandwidth Comparison 20000 600,000 MVAPICH2-2.3rc1 1.6X 1.8X MV2-2.3rc1 SpectrumMPI-10.1.0 15000 SpectrumMPI-10.1.0 450,000 OpenMPI-3.0.0 OpenMPI-3.0.0 10000 MVAPICH2-XPMEM 300,000 MV2-XPMEM 1.2X 1.4X 5000 150,000 node (PPN=20) Bandwidth (MB/s) Bandwidth (MB/s) - 0 0 1 2 4 8 16 32 64 128 256 512 Intra 1K 2K 4K 8K 1M 2M 4M 16K 32K 64K 128K 256K 512K 15000 15000 MV2-2.3rc1 1.2X SpectrumMPI-10.1.0 10000 10000 OpenMPI-3.0.0 1.5X MV2-2.3rc1

node SpectrumMPI-10.1.0 - MV2-XPMEM 5000 5000 OpenMPI-3.0.0 Inter

Bandwidth (MB/s) Bandwidth (MB/s) MV2-XPMEM 0 0 (Nodes=2, PPN=20) 1 2 4 8 16 32 64 128 256 512 Message Size Message Size • Basic point-to-point multi-pair bandwidth comparison – Up to 83% and 62% performance improvement over Spectrum MPI and OpenMPI for intra-/inter-node Optimized Runtime Parameters: MV2_CPU_BINDING_POLICY=hybrid MV2_HYBRID_BINDING_POLICY=bunch Network Based Computing Laboratory OpenPOWER Summit (Mar ’18) 17 Process Mapping Support in MVAPICH2 for Multi-Core Platforms

Process-Mapping support in MVAPICH2 (available since v1.4)

Preset Binding Policies User-defined binding

bunch Policy scatter hybrid MPI rank-to-core binding (Default)

core socket numanode Granularity (Default)

• MVAPICH2 detects processor architecture at job-launch

Network Based Computing Laboratory OpenPOWER Summit (Mar ’18) 18 Process and Thread binding policies in Hybrid MPI+Threads

• A new process binding policy – “hybrid” – MV2_CPU_BINDING_POLICY = hybrid • A new environment variable for co-locating Threads with MPI Processes – MV2_THREADS_PER_PROCESS = k – Automatically set to OMP_NUM_THREADS if OpenMP is being used – Provides a hint to the MPI runtime to spare resources for application threads. • New variable for threads bindings with respect to parent process and architecture – MV2_THREADS_BINDING_POLICY = {linear | compact | spread} • Linear – binds MPI ranks and OpenMP threads sequentially (one after the other) – Recommended to be used on non-hyperthreaded systems with MPI+OpenMP • Compact – binds MPI rank to physical-core and locates respective OpenMP threads on hardware threads – Recommended to be used on multi-/many-cores e.g., KNL, POWER8, and hyper-threaded Xeon Network Based Computing Laboratory OpenPOWER Summit (Mar ’18) 19 User-Defined Process Mapping • User has complete-control over process-mapping • To run 4 processes on cores 0, 1, 4, 5: – $ mpirun_rsh -np 4 -hostfile hosts MV2_CPU_MAPPING=0:1:4:5 ./a.out • Use ‘,’ or ‘-’ to bind to a set of cores: – $mpirun_rsh -np 64 -hostfile hosts MV2_CPU_MAPPING=0,2-4:1:5:6 ./a.out • Is process binding working as expected? – MV2_SHOW_CPU_BINDING=1 • Display CPU binding information • Launcher independent • Example – MV2_SHOW_CPU_BINDING=1 MV2_CPU_BINDING_POLICY=scatter ------CPU AFFINITY------RANK:0 CPU_SET: 0 RANK:1 CPU_SET: 8

• Refer to Running with Efficient CPU (Core) Mapping section of MVAPICH2 user guide for more information • http://mvapich.cse.ohio-state.edu/static/media/mvapich/mvapich2-2.3rc1-userguide.html#x1-600006.5 Network Based Computing Laboratory OpenPOWER Summit (Mar ’18) 20 MPI, PGAS and Deep Learning Support for OpenPOWER

• Message Passing Interface (MPI) Support – Point-to-point Inter-node and Intra-node – CMA-based Collectives – Upcoming XPMEM Support • Support for PGAS and MPI + PGAS (OpenSHMEM, UPC) • Exploiting Accelerators (NVIDIA GPGPUs) • Support for Deep Learning

Network Based Computing Laboratory OpenPOWER Summit (Mar ’18) 21 Scalable Collectives with CMA (Intra-node Reduce & AlltoAll) Reduce 40 2000 3.2X MVAPICH2-X-2.3b MVAPICH2-X-2.3b 30 SpectrumMPI-10.1.0.2 1600 SpectrumMPI-10.1.0.2 OpenMPI-3.0.0 5.2X 1200 OpenMPI-3.0.0 20 1.4X 3.6X 800 Latency (us) Latency Latency (us) Latency 10 400 (Nodes=1, PPN=20) (Nodes=1, PPN=20) 0 0 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M Alltoall 3.2X 1.3X 200 3.3X 10000 MVAPICH2-X-2.3b MVAPICH2-X-2.3b 1.2X 150 SpectrumMPI-10.1.0.2 7500 SpectrumMPI-10.1.0.2 OpenMPI-3.0.0 OpenMPI-3.0.0 100 5000

Latency (us) Latency 50 (us) Latency 2500 (Nodes=1, PPN=20) 0 0 (Nodes=1, PPN=20) 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M Up to 5X and 3x performance improvement by MVAPICH2-X-2.3b for small and large messages respectively Network Based Computing Laboratory OpenPOWER Summit (Mar ’18) 22 Scalable Collectives with CMA (Intra-node, Gather & Scatter) Gather 75 2000 MVAPICH2-X-2.3b 15.6X MVAPICH2-X-2.3b 24.5X SpectrumMPI-10.1.0.2 50 SpectrumMPI-10.1.0.2 OpenMPI-3.0.0 8.3X OpenMPI-3.0.0 1000 25 5.6X Latency (us) Latency Latency (us) Latency 0 (Nodes=1, PPN=20) (Nodes=1, PPN=20) 0 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M Message Size (bytes) Scatter Message Size (bytes) 4.9X 45 750 6.7X MVAPICH2-X-2.3b MVAPICH2-X-2.3b 6.4X SpectrumMPI-10.1.0.2 30 SpectrumMPI-10.1.0.2 500 OpenMPI-3.0.0 OpenMPI-3.0.0 1.5X 15 250 Latency (us) Latency Latency (us) Latency 0 0 (Nodes=1, PPN=20) (Nodes=1, PPN=20) 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M Message Size (bytes) Message Size (bytes) Up to 24X and 15x performance improvement by MVAPICH2-X-2.3b for medium to large messages respectively Network Based Computing Laboratory OpenPOWER Summit (Mar ’18) 23 Scalable Collectives with CMA (Multi-node, Reduce & Alltoall) Reduce 30 12.4X 6000 MVAPICH2-X-2.3b MVAPICH2-X-2.3b 8.5X OpenMPI-3.0.0 OpenMPI-3.0.0 20 4000

10 2000 Latency (us) Latency Latency (us) Latency (Nodes=4, PPN=20) (Nodes=4, PPN=20) 0 0 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M Alltoall 1000 1.9X 150000 MVAPICH2-X-2.3b MVAPICH2-X-2.3b 1.0X OpenMPI-3.0.0 OpenMPI-3.0.0 100000 500 50000 Latency (us) Latency (us) Latency (Nodes=4, PPN=20) 0 0 (Nodes=4, PPN=20) 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M Up to 12.4X and 8.5X performance improvement by MVAPICH2-X-2.3b for small and large messages respectively Network Based Computing Laboratory OpenPOWER Summit (Mar ’18) 24 Scalable Collectives with CMA (Multi-node, Gather & Scatter) Gather 60 17.8X 4000 3.9X MVAPICH2-X-2.3b MVAPICH2-X-2.3b 40 3000 OpenMPI-3.0.0 OpenMPI-3.0.0 2000 20

Latency (us) Latency 1000 Latency (us) Latency 0 (Nodes=4, PPN=20) (Nodes=4, PPN=20) 0 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M Message Size (bytes) Scatter Message Size (bytes) 100 5000 MVAPICH2-X-2.3b 1.6X MVAPICH2-X-2.3b 0.8X 80 4000 OpenMPI-3.0.0 60 OpenMPI-3.0.0 3000 40 2000

Latency (us) Latency 1000 Latency (us) Latency 20 0 0 (Nodes=4, PPN=20) (Nodes=4, PPN=20) 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M Message Size (bytes) Message Size (bytes) Up to 17.8X and 3.9X performance improvement by MVAPICH2-X-2.3b for medium to large messages respectively Network Based Computing Laboratory OpenPOWER Summit (Mar ’18) 25 MPI, PGAS and Deep Learning Support for OpenPOWER

• Message Passing Interface (MPI) Support – Point-to-point Inter-node and Intra-node – CMA-based Collectives – Upcoming XPMEM Support • Support for PGAS and MPI + PGAS (OpenSHMEM, UPC) • Exploiting Accelerators (NVIDIA GPGPUs) • Support for Deep Learning

Network Based Computing Laboratory OpenPOWER Summit (Mar ’18) 26 Optimized MVAPICH2 All-Reduce with XPMEM 300 3000 MVAPICH2-2.3rc1 4X MVAPICH2-2.3rc1 SpectrumMPI-10.1.0 SpectrumMPI-10.1.0 200 2000 48% OpenMPI-3.0.0 2X OpenMPI-3.0.0 MVAPICH2-XPMEM MVAPICH2-XPMEM 100 1000 Latency (us) Latency Latency (us) Latency 34%

(Nodes=1, PPN=20) 0 0 16K 32K 64K 128K 256K 512K 1M 2M 4000 300 MVAPICH2-2.3rc1 3.3X MVAPICH2-2.3rc1 2X SpectrumMPI-10.1.0 3000 SpectrumMPI-10.1.0 200 OpenMPI-3.0.0 OpenMPI-3.0.0 MVAPICH2-XPMEM 2000 MVAPICH2-XPMEM 100 1000 Latency (us) Latency (us) Latency 2X 3X

(Nodes=2, PPN=20) 0 0 16K 32K 64K 128K 256K 512K 1M 2M Message Size Message Size • Optimized MPI All-Reduce Design in MVAPICH2 – Up to 2X performance improvement over Spectrum MPI and 4X over OpenMPI for intra-node Optimized Runtime Parameters: MV2_CPU_BINDING_POLICY=hybrid MV2_HYBRID_BINDING_POLICY=bunch Network Based Computing Laboratory OpenPOWER Summit (Mar ’18) 27 Optimized MVAPICH2 All-Reduce with XPMEM 400 4000 MVAPICH2-2.3rc1 42% MVAPICH2-2.3rc1 27% OpenMPI-3.0.0 300 3000 OpenMPI-3.0.0 MVAPICH2-XPMEM MVAPICH2-XPMEM 200 2000 18.8% Latency (us) Latency 100 22% (us) Latency 1000

(Nodes=3, PPN=20) 0 0 16K 32K 64K 128K 256K 512K 1M 2M 400 MVAPICH2-2.3rc1 2X 4000 35% MVAPICH2-2.3rc1 OpenMPI-3.0.0 300 3000 OpenMPI-3.0.0 MVAPICH2-XPMEM MVAPICH2-XPMEM 200 2000 20% 100 1000 Latency (us) Latency 11% (us) Latency

(Nodes=4, PPN=20) 0 0 16K 32K 64K 128K 256K 512K 1M 2M Message Size Message Size • Optimized MPI All-Reduce Design in MVAPICH2 – Up to 2X performance improvement over OpenMPI for inter-node. (Spectrum MPI didn’t run for >2 processes) Optimized Runtime Parameters: MV2_CPU_BINDING_POLICY=hybrid MV2_HYBRID_BINDING_POLICY=bunch Network Based Computing Laboratory OpenPOWER Summit (Mar ’18) 28 Optimized MVAPICH2 Reduce with XPMEM 400 15000 MVAPICH2-2.3rc1 2.8X 1.9X MVAPICH2-2.3rc1 SpectrumMPI-10.1.0 SpectrumMPI-10.1.0 OpenMPI-3.0.0 10000 200 MVAPICH2-XPMEM OpenMPI-3.0.0 MVAPICH2-XPMEM 5000 Latency (us) Latency 27% 26% (us) Latency 0 0 (Nodes=1, PPN=20) 16K 32K 64K 128K 256K 512K 1M 2M 4M 8M 16M 800 5.6X 40000 MVAPICH2-2.3rc1 MVAPICH2-2.3rc1 3.1X 600 SpectrumMPI-10.1.0 30000 SpectrumMPI-10.1.0 400 OpenMPI-3.0.0 20000 MVAPICH2-XPMEM OpenMPI-3.0.0 200

Latency (us) Latency 10000 MVAPICH2-XPMEM Latency (us) Latency 0 37% 0 93% (Nodes=2, PPN=20) 16K 32K 64K 128K 256K 512K 1M 2M 4M 8M 16M Message Size Message Size • Optimized MPI Reduce Design in MVAPICH2 – Up to 3.1X performance improvement over OpenMPI at scale and up to 37% over spectrum MPI on intra-node Optimized Runtime Parameters: MV2_CPU_BINDING_POLICY=hybrid MV2_HYBRID_BINDING_POLICY=bunch Network Based Computing Laboratory OpenPOWER Summit (Mar ’18) 29 Optimized MVAPICH2 Reduce with XPMEM 1000 50000 MVAPICH2-2.3rc1 6X MVAPICH2-2.3rc1 800 40000 4X SpectrumMPI-10.1.0 SpectrumMPI-10.1.0 600 OpenMPI-3.0.0 30000 OpenMPI-3.0.0 MVAPICH2-XPMEM 400 20000 MVAPICH2-XPMEM 16%

Latency (us) Latency 8% 200 (us) Latency 10000 0 0 (Nodes=3, PPN=20) 16K 32K 64K 128K 256K 512K 1M 2M 4M 8M 16M 1200 7X 50000 4X 1000 MVAPICH2-2.3rc1 MVAPICH2-2.3rc1 40000 800 SpectrumMPI-10.1.0 SpectrumMPI-10.1.0 30000 600 OpenMPI-3.0.0 OpenMPI-3.0.0 20000 400 14% MVAPICH2-XPMEM 25% MVAPICH2-XPMEM Latency (us) Latency 200 (us) Latency 10000 0 0 (Nodes=4, PPN=20) 16K 32K 64K 128K 256K 512K 1M 2M 4M 8M 16M Message Size Message Size • Optimized MPI Reduce Design in MVAPICH2 – Up to 7X performance improvement over OpenMPI at scale and up to 25% over MVAPICH2-2.3rc1 Optimized Runtime Parameters: MV2_CPU_BINDING_POLICY=hybrid MV2_HYBRID_BINDING_POLICY=bunch Network Based Computing Laboratory OpenPOWER Summit (Mar ’18) 30 MiniAMR Performance using Optimized XPMEM-based Collectives 60 42% MVAPICH-2.3rc1 36% MVAPICH2-XPMEM 40 41% 45%

20 Execution Time (s) Execution

0 10 20 40 60 MPI Processes (proc-per-sock=10, proc-per-node=20)

• MiniAMR application execution time comparing MVAPICH2-2.3rc1 and optimized All-Reduce design – Up to 45% improvement over MVAPICH2-2.3rc1 in mesh-refinement time of MiniAMR application for weak- scaling workload on up to four POWER8 nodes.

Optimized Runtime Parameters: MV2_CPU_BINDING_POLICY=hybrid MV2_HYBRID_BINDING_POLICY=scatter Network Based Computing Laboratory OpenPOWER Summit (Mar ’18) 31 MPI, PGAS and Deep Learning Support for OpenPOWER

• Message Passing Interface (MPI) Support • Support for PGAS and MPI + PGAS (OpenSHMEM, UPC) • Exploiting Accelerators (NVIDIA GPGPUs) • Support for Deep Learning

Network Based Computing Laboratory OpenPOWER Summit (Mar ’18) 32 MVAPICH2-X for Hybrid MPI + PGAS Applications

• Current Model – Separate Runtimes for OpenSHMEM/UPC/UPC++/CAF and MPI – Possible deadlock if both runtimes are not progressed – Consumes more network resource • Unified communication runtime for MPI, UPC, UPC++, OpenSHMEM, CAF – Available with since 2012 (starting with MVAPICH2-X 1.9) – http://mvapich.cse.ohio-state.edu Network Based Computing Laboratory OpenPOWER Summit (Mar ’18) 33 OpenSHMEM PUT/GET using MVAPICH2-X on OpenPOWER shmem_get (intra) 40 0.4 shmem_get (intra) shmem_put (intra) 30 shmem_put (intra) 20 0.2 10 node (PPN=2) Latency (us) Latency -

Latency (us) Latency 0 0

Intra 1 2 4 8 16 32 64 128 256 512 Message Size Message Size 3 shmem_get (inter) 150 shmem_get (inter) shmem_put (inter) 2 100 shmem_put (inter)

1 50 node (PPN=1) Latency (us) Latency

Latency (us) Latency 0 0

Inter - 1 2 4 8 16 32 64 128 256 512 Message Size Message Size • OpenSHMEM one-sided PUT/GET benchmark using MVAPICH2-X 2.3b on OpenPOWER cluster – Near native performance benefits are observed on OpenPOWER

Network Based Computing Laboratory OpenPOWER Summit (Mar ’18) 34 UPC PUT/GET benchmarks using MVAPICH2-X on OpenPOWER 3 upc_getmem (intra) 600 upc_getmem (intra) upc_putmem (intra) 2 400 upc_putmem (intra)

1 200 node (PPN=2) - Latency (us) Latency

Latency (us) Latency 0 0 2K 4K 8K Intra 1M 2M 4M 1 2 4 8 16 32 64 128 256 512 1K 16K 32K 64K 128K 256K 512K Message Size Message Size 3 upc_getmem (inter) 400 upc_getmem (inter) upc_putmem (inter) 300 2 upc_putmem (inter) 200 1 100 node (PPN=1)

Latency (us) Latency 0 0 (us) Latency 2K 4K 8K 1M 2M 4M Inter - 16K 32K 64K

1 2 4 8 16 32 64 128 256 512 1K 128K 256K 512K Message Size Message Size • UPC one-sided memput/memget benchmarks using MVAPICH2-X 2.3b on OpenPOWER cluster – Near native performance benefits are observed on OpenPOWER

Network Based Computing Laboratory OpenPOWER Summit (Mar ’18) 35 UPC++ PUT/GET benchmarks using MVAPICH2-X on OpenPOWER 0.06 upcxx_async_get (intra) 200 upcxx_async_get (intra) upcxx_async_put (intra) 0.04 upcxx_async_put (intra) 100 0.02 node (PPN=2) - Latency (us) Latency

Latency (us) Latency 0 0 2K 4K 8K Intra 1M 2M 4M 1 2 4 8 16 32 64 128 256 512 1K 16K 32K 64K 128K 256K 512K Message Size Message Size 2 upcxx_async_get (inter) 500 upcxx_async_get (inter) upcxx_async_put (inter) upcxx_async_put (inter) 1 node (PPN=1) Latency (us) Latency

Latency (us) Latency 0 0 2K 4K 8K Inter - 1M 2M 4M 1 2 4 8 16 32 64 128 256 512 1K 16K 32K 64K 128K 256K 512K Message Size Message Size • UPC++ one-sided async_copy_put/get benchmarks using MVAPICH2-X 2.3b on OpenPOWER cluster – Near native performance benefits are observed on OpenPOWER

Network Based Computing Laboratory OpenPOWER Summit (Mar ’18) 36 MPI, PGAS and Deep Learning Support for OpenPOWER

• Message Passing Interface (MPI) Support • Support for PGAS and MPI + PGAS (OpenSHMEM, UPC) • Exploiting Accelerators (NVIDIA GPGPUs) – CUDA-aware MPI – Optimized GPUDirect RDMA (GDR) Support – Optimized MVAPICH2 for OpenPower • Support for Deep Learning

Network Based Computing Laboratory OpenPOWER Summit (Mar ’18) 37 GPU-Aware (CUDA-Aware) MPI Library: MVAPICH2-GPU • Standard MPI interfaces used for unified data movement • Takes advantage of Unified Virtual Addressing (>= CUDA 4.0) • Overlaps data movement from GPU with RDMA transfers

At Sender:

MPI_Send(s_devbuf, size, …); inside MVAPICH2 At Receiver: MPI_Recv(r_devbuf, size, …);

High Performance and High Productivity

Network Based Computing Laboratory OpenPOWER Summit (Mar ’18) 38 CUDA-Aware MPI: MVAPICH2-GDR 1.8-2.3 Releases • Support for MPI communication from NVIDIA GPU device memory • High performance RDMA-based inter-node point-to-point communication (GPU-GPU, GPU-Host and Host-GPU) • High performance intra-node point-to-point communication for multi-GPU adapters/node (GPU-GPU, GPU-Host and Host-GPU) • Taking advantage of CUDA IPC (available since CUDA 4.1) in intra-node communication for multiple GPU adapters/node • Optimized and tuned collectives for GPU device buffers • MPI datatype support for point-to-point and collective communication from GPU device buffers • Unified memory

Network Based Computing Laboratory OpenPOWER Summit (Mar ’18) 39 Optimized MVAPICH2-GDR Design GPU-GPU Inter-node Latency 30 6000 GPU-GPU Inter-node Bi-Bandwidth

20 4000 10 11X 1.88us 10x 2000

Latency (us) Latency 0 0 1 2 4 8 Bandwidth (MB/s) 0 16 32 64 1K 2K 4K 8K 1 2 4 8 128 256 512 16 32 64 1K 2K 4K

Message Size (Bytes) 128 256 512 Message Size (Bytes) MV2-(NO-GDR) MV2-GDR-2.3a MV2-(NO-GDR) MV2-GDR-2.3a

4000 GPU-GPU Inter-node Bandwidth 3000 2000 9x MVAPICH2-GDR-2.3a Intel Haswell (E5-2687W @ 3.10 GHz) node - 20 cores 1000 NVIDIA Volta V100 GPU

Bandwidth (MB/s) 0 Mellanox Connect-X4 EDR HCA 1 2 4 8 16 32 64 1K 2K 4K CUDA 9.0 128 256 512 Mellanox OFED 4.0 with GPU-Direct-RDMA Message Size (Bytes) MV2-(NO-GDR) MV2-GDR-2.3a Network Based Computing Laboratory OpenPOWER Summit (Mar ’18) 40 Application-Level Evaluation (HOOMD-blue) 64K Particles 256K Particles 3500 3000 2500 MV2 MV2+GDR 2500 2000 2X 2000 1500 2X 1500 1000 1000 second (TPS) second 500 second (TPS)second 500

0 per Steps Time Average 0 Average Time Steps per Time Steps Average 4 8 16 32 4 8 16 32 Number of Processes Number of Processes • Platform: Wilkes (Intel Ivy Bridge + K20c + Mellanox Connect-IB) • HoomDBlue Version 1.0.5 • GDRCOPY enabled: MV2_USE_CUDA=1 MV2_IBA_HCA=mlx5_0 MV2_IBA_EAGER_THRESHOLD=32768 MV2_VBUF_TOTAL_SIZE=32768 MV2_USE_GPUDIRECT_LOOPBACK_LIMIT=32768 MV2_USE_GPUDIRECT_GDRCOPY=1 MV2_USE_GPUDIRECT_GDRCOPY_LIMIT=16384 Network Based Computing Laboratory OpenPOWER Summit (Mar ’18) 41 Application-Level Evaluation (Cosmo) and Weather Forecasting in Switzerland

Wilkes GPU Cluster CSCS GPU cluster

Default Callback-based Event-based Default Callback-based Event-based

1.2 1.2 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 4 8 16 32 Time Execution Normalized 16 32 64 96 Normalized Execution Time Execution Normalized Number of GPUs Number of GPUs

• 2X improvement on 32 GPUs nodes Cosmo model: http://www2.cosmo-model.org/content • 30% improvement on 96 GPU nodes (8 GPUs/node) /tasks/operational/meteoSwiss/ On-going collaboration with CSCS and MeteoSwiss (Switzerland) in co-designing MV2-GDR and Cosmo Application C. Chu, K. Hamidouche, A. Venkatesh, D. Banerjee , H. Subramoni, and D. K. Panda, Exploiting Maximal Overlap for Non-Contiguous Data Movement Processing on Modern GPU-enabled Systems, IPDPS’16 Network Based Computing Laboratory OpenPOWER Summit (Mar ’18) 42 MVAPICH2-GDR: Performance on OpenPOWER (NVLink + Pascal) INTRA-NODE LATENCY (SMALL) INTRA-NODE LATENCY (LARGE) INTRA-NODE BANDWIDTH 30 40 400 30 20 20 10 200 10 Latency (us) Latency Latency (us) Latency 0

0 0 1 4 16 64 1K 4K 1 2 4 8 1M 4M 256 16K 64K 16 32 64 1K 2K 4K 8K 256K 128 256 512 16K 32K 64K 128K256K512K 1M 2M 4M

Message Size (Bytes) Message Size (Bytes) (GB/sec) Bandwidth Message Size (Bytes)

INTRA-SOCKET(NVLINK) INTER-SOCKET INTRA-SOCKET(NVLINK) INTER-SOCKET INTRA-SOCKET(NVLINK) INTER-SOCKET Intra-node Latency: 14.6 us (without GPUDirectRDMA) Intra-node Bandwidth: 33.9 GB/sec (NVLINK)

INTER-NODE LATENCY (SMALL) INTER-NODE LATENCY (LARGE) INTER-NODE BANDWIDTH 30 500 15 28 400 26 300 10 24 200 5 Latency (us) Latency 22 (us) Latency 100 20 0

1 2 4 8 0 16 32 64 1K 2K 4K 8K Bandwidth (GB/sec) Bandwidth 1 4 128 256 512 16 64 1K 4K 1M 4M 256 16K 64K

Message Size (Bytes) 256K Message Size (Bytes) Message Size (Bytes) Inter-node Latency: 23.8 us (without GPUDirectRDMA) Available in MVAPICH2-GDR 2.3a Inter-node Bandwidth: 11.9 GB/sec (EDR) Platform: OpenPOWER (ppc64le) nodes equipped with a dual-socket CPU, 4 Pascal P100-SXM GPUs, and EDR InfiniBand Inter-connect Network Based Computing Laboratory OpenPOWER Summit (Mar ’18) 43 Compilation of Best Practices • MPI runtime has many parameters • Tuning a set of parameters can help you to extract higher performance • Compiled a list of such contributions through the MVAPICH Website – http://mvapich.cse.ohio-state.edu/best_practices/ • List of applications (16 contributions so far) – Amber, Cloverleaf, GAPgeofem, HoomDBlue, HPCG, LAMPS, LU, Lulesh, MILC, Neuro, POP2, SMG2000, SPEC MPI, TERA_TF, UMR, and WRF2 • Soliciting additional contributions, send your results to mvapich-help at cse.ohio-state.edu • We will link these results with credits to you

Network Based Computing Laboratory OpenPOWER Summit (Mar ’18) 44 MPI, PGAS and Deep Learning Support for OpenPOWER

• Message Passing Interface (MPI) Support • Support for PGAS and MPI + PGAS (OpenSHMEM, UPC) • Exploiting Accelerators (NVIDIA GPGPUs) • Support for Deep Learning – New challenges for MPI Run-time – Optimizing Large-message Collectives – Co-designing DL Frameworks

Network Based Computing Laboratory OpenPOWER Summit (Mar ’18) 45 Deep Learning: New Challenges for MPI Runtimes • Deep Learning frameworks are a different game Proposed altogether NCCL Co-Designs – Unusually large message sizes (order of megabytes) – Most communication based on GPU buffers cuDNN • Existing State-of-the-art cuBLAS – cuDNN, cuBLAS, NCCL --> scale-up performance MPI – CUDA-Aware MPI --> scale-out performance • For small and medium message sizes only! up Performance up - • Proposed: Can we co-design the MPI runtime (MVAPICH2- gRPC GDR) and the DL framework (Caffe) to achieve both? Scale – Efficient of Computation and Communication Overlap Hadoop – Efficient Large-Message Communication (Reductions) – What application co-designs are needed to exploit communication-runtime co-designs? Scale-out Performance A. A. Awan, K. Hamidouche, J. M. Hashmi, and D. K. Panda, S-Caffe: Co-designing MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU Clusters. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '17) Network Based Computing Laboratory OpenPOWER Summit (Mar ’18) 46 Large Message Optimized Collectives for Deep Learning Reduce – 192 GPUs Reduce – 64 MB 300 200 200 100 100 • MV2-GDR provides 0 0 Latency (ms) Latency (ms) optimized collectives for 2 4 8 16 32 64 128 128 160 192 Message Size (MB) No. of GPUs large message sizes Allreduce – 64 GPUs Allreduce - 128 MB 300 • Optimized Reduce, 300 200 200 Allreduce, and Bcast 100 100 0 • Good scaling with large 0 Latency (ms) Latency (ms) 16 32 64 2 4 8 16 32 64 128 number of GPUs No. of GPUs Message Size (MB) • Available since MVAPICH2- Bcast – 64 GPUs Bcast 128 MB 100 100 GDR 2.2GA 50 50

0 0 Latency (ms) 2 4 8 16 32 64 128 Latency (ms) 16 32 64 Message Size (MB) No. of GPUs Network Based Computing Laboratory OpenPOWER Summit (Mar ’18) 47 MVAPICH2: Allreduce Comparison with Baidu and OpenMPI • Initial Evaluation shows promising performance gains for MVAPICH2-GDR 2.3a* • 8 GPUs (2 nodes) MVAPICH2-GDR vs. Baidu-Allreduce and OpenMPI 3.0

100000 6000 800000 700000 OpenMPI is ~4X slower 5000 ~3.5X better 10000 600000 than Baidu 1000 4000 500000 MV2 is ~20% better 400000 ~10X better 3000 than Baidu 100 300000 Latency (us) Latency Latency (us) Latency

Latency (us) Latency 2000 200000 10 100000 1000 1 0

4 0 16 64 8M 1K 4K 256 16M 32M 64M 16K 64K 128M 256M 512M 256K 512K 1M 2M 4M Message Size (Bytes) Message Size (Bytes) Message Size (Bytes)

MVAPICH2 BAIDU OPENMPI MVAPICH2 BAIDU OPENMPI MVAPICH2 BAIDU OPENMPI

*Available with MVAPICH2-GDR 2.3a and higher Platform: OpenPOWER (ppc64le) nodes equipped with a dual-socket CPU, 4 Pascal P100-SXM GPUs, and FDR InfiniBand Inter-connect Network Based Computing Laboratory OpenPOWER Summit (Mar ’18) 48 MVAPICH2: Allreduce Comparison with Baidu and OpenMPI • 16 GPUs (4 nodes) MVAPICH2-GDR vs. Baidu-Allreduce and OpenMPI 3.0

100000 50000 ~10X better 6000000 OpenMPI is ~5X slower 5000000 than Baidu 10000 40000 ~30X better 4000000 1000 30000 MV2 is ~2X better 3000000 than Baidu 100 20000

Latency (us) Latency 2000000 Latency (us) Latency Latency (us) Latency 10 1000000 10000 ~4X better 1 0

4 0 16 64 8M 1K 4K 256 16M 32M 64M 16K 64K 128M 256M 512M 256K 512K 1M 2M 4M Message Size (Bytes) Message Size (Bytes) Message Size (Bytes)

MVAPICH2 BAIDU OPENMPI MVAPICH2 BAIDU OPENMPI MVAPICH2 BAIDU OPENMPI

*Available with MVAPICH2-GDR 2.3a and higher Platform: OpenPOWER (ppc64le) nodes equipped with a dual-socket CPU, 4 Pascal P100-SXM GPUs, and FDR InfiniBand Inter-connect Network Based Computing Laboratory OpenPOWER Summit (Mar ’18) 49 MVAPICH2-GDR vs. NCCL2 • Optimized designs in MVAPICH2-GDR 2.3b* offer better/comparable performance for most cases • MPI_Bcast (MVAPICH2-GDR) vs. ncclBcast (NCCL2) on 16 K-80 GPUs

100000 100000 1 GPU/node 2 GPUs/node 10000 10000

1000 1000 ~4X better 100 ~10X better 100 Latency (us) Latency (us) Latency 10 10

1 1 4 4 16 64 16 64 1K 4K 1K 4K 1M 4M 1M 4M 256 256 16K 64K 16K 64K 16M 64M 16M 64M 256K 256K Message Size (Bytes) Message Size (Bytes)

NCCL2 MVAPICH2-GDR NCCL2 MVAPICH2-GDR *Will be available with upcoming MVAPICH2-GDR 2.3b Platform: Intel Xeon (Broadwell) nodes equipped with a dual-socket CPU, 2 K-80 GPUs, and EDR InfiniBand Inter-connect Network Based Computing Laboratory OpenPOWER Summit (Mar ’18) 50 OSU-Caffe: Scalable Deep Learning • Caffe : A flexible and layered Deep Learning framework. GoogLeNet (ImageNet) on 128 GPUs • Benefits and Weaknesses – Multi-GPU Training within a single node 250 – Performance degradation for GPUs across different sockets 200 – Limited Scale-out • OSU-Caffe: MPI-based Parallel Training 150 – Enable Scale-up (within a node) and Scale-out (across multi-GPU nodes) 100 – Scale-out on 64 GPUs for training CIFAR-10 network on CIFAR-10 dataset 50 Training Time (seconds) Training – Scale-out on 128 GPUs for training GoogLeNet network on ImageNet dataset 0 OSU-Caffe publicly available from 8 16 32 64 128 No. of GPUs http://hidl.cse.ohio-state.edu/ Invalid use case Support on OPENPOWER will be available soon Caffe OSU-Caffe (1024) OSU-Caffe (2048) Network Based Computing Laboratory OpenPOWER Summit (Mar ’18) 51 Concluding Remarks • Next generation HPC systems need to be designed with a holistic view of Big Data and Deep Learning • OpenPOWER platform is emerging with many novel features • Presented some of the approaches and results along these directions from the MVAPICH2 and HiDL Projects • Solutions Enable High-Performance and Scalable HPC and Deep Learning on OpenPOWER Platforms

Network Based Computing Laboratory OpenPOWER Summit (Mar ’18) 52 High-Performance Hadoop and Spark on OpenPOWER Platform

Talk at 2:15 pm

Network Based Computing Laboratory OpenPOWER Summit (Mar ’18) 53 Funding Acknowledgments Funding Support by

Equipment Support by

Network Based Computing Laboratory OpenPOWER Summit (Mar ’18) 54 Personnel Acknowledgments

Current Students (Graduate) Current Students (Undergraduate) Current Research Scientists Current Research Specialist – A. Awan (Ph.D.) – J. Hashmi (Ph.D.) – N. Sarkauskas (B.S.) – X. Lu – J. Smith – R. Biswas (M.S.) – H. Javed (Ph.D.) – H. Subramoni – M. Arnold – M. Bayatpour (Ph.D.) – P. Kousha (Ph.D.) – S. Chakraborthy (Ph.D.) – D. Shankar (Ph.D.) Current Post-doc – C.-H. Chu (Ph.D.) – H. Shi (Ph.D.) – A. Ruhela – S. Guganani (Ph.D.) – J. Zhang (Ph.D.)

Past Students Past Research Scientist – W. Huang (Ph.D.) – J. Liu (Ph.D.) – R. Rajachandrasekar (Ph.D.) – A. Augustine (M.S.) – K. Hamidouche – P. Balaji (Ph.D.) – W. Jiang (M.S.) – M. Luo (Ph.D.) – G. Santhanaraman (Ph.D.) – S. Sur – S. Bhagvat (M.S.) – J. Jose (Ph.D.) – A. Mamidala (Ph.D.) – A. Singh (Ph.D.) – J. Sridhar (M.S.) – A. Bhat (M.S.) – S. Kini (M.S.) – G. Marsh (M.S.) Past Programmers – S. Sur (Ph.D.) – D. Buntinas (Ph.D.) – M. Koop (Ph.D.) – V. Meshram (M.S.) – D. Bureddy – H. Subramoni (Ph.D.) – L. Chai (Ph.D.) – K. Kulkarni (M.S.) – A. Moody (M.S.) – J. Perkins – K. Vaidyanathan (Ph.D.) – B. Chandrasekharan (M.S.) – R. Kumar (M.S.) – S. Naravula (Ph.D.) – N. Dandapanthula (M.S.) – S. Krishnamoorthy (M.S.) – R. Noronha (Ph.D.) – A. Vishnu (Ph.D.) – V. Dhanraj (M.S.) – K. Kandalla (Ph.D.) – X. Ouyang (Ph.D.) – J. Wu (Ph.D.) – T. Gangadharappa (M.S.) – M. Li (Ph.D.) – S. Pai (M.S.) – W. Yu (Ph.D.) – K. Gopalakrishnan (M.S.) – P. Lai (M.S.) – S. Potluri (Ph.D.)

Past Post-Docs – D. Banerjee – J. Lin – S. Marcarelli – X. Besseron – M. Luo – J. Vienne – H.-W. Jin – E. Mancini – H. Wang

Network Based Computing Laboratory OpenPOWER Summit (Mar ’18) 55 Thank You! [email protected]

Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/

The High-Performance MPI/PGAS Project The High-Performance Big Data Project The High-Performance Deep Learning Project http://mvapich.cse.ohio-state.edu/ http://hibd.cse.ohio-state.edu/ http://hidl.cse.ohio-state.edu/

Network Based Computing Laboratory OpenPOWER Summit (Mar ’18) 56