Designing Efficient HPC, Big Data, Deep Learning, and Cloud Computing Middleware for Exascale Systems

Keynote Talk at HPCAC‐China Conference by

Dhabaleswar K. (DK) Panda The Ohio State University E‐mail: [email protected]‐state.edu http://www.cse.ohio‐state.edu/~panda Increasing Usage of HPC, Big Data and Deep Learning

Big Data HPC (Hadoop, Spark, (MPI, RDMA, HBase, Lustre, etc.) Memcached, etc.)

Convergence of HPC, Big Deep Learning Data, and Deep Learning! (Caffe, TensorFlow, BigDL, etc.) Increasing Need to Run these applications on the Cloud!!

Network Based Computing Laboratory HPCAC‐China ‘17 2 Can We Run Big Data and Deep Learning Jobs on Existing HPC Infrastructure?

Physical Compute

Network Based Computing Laboratory HPCAC‐China ‘17 3 Can We Run Big Data and Deep Learning Jobs on Existing HPC Infrastructure?

Network Based Computing Laboratory HPCAC‐China ‘17 4 Can We Run Big Data and Deep Learning Jobs on Existing HPC Infrastructure?

Network Based Computing Laboratory HPCAC‐China ‘17 5 Can We Run Big Data and Deep Learning Jobs on Existing HPC Infrastructure?

Hadoop Job Deep Learning Job

Spark Job

Network Based Computing Laboratory HPCAC‐China ‘17 6 HPC, Big Data, Deep Learning, and Cloud •Traditional HPC – Message Passing Interface (MPI), including MPI + OpenMP – Support for PGAS and MPI + PGAS (OpenSHMEM, UPC) – Exploiting Accelerators •Big Data/Enterprise/Commercial Computing –Spark and Hadoop (HDFS, HBase, MapReduce) – Memcached is also used for Web 2.0 • Deep Learning – Caffe, CNTK, TensorFlow, and many more •Cloud for HPC and BigData – Virtualization with SR‐IOV and Containers

Network Based Computing Laboratory HPCAC‐China ‘17 7 Supporting Programming Models for Multi‐Petaflop and Exaflop Systems: Challenges

Application Kernels/Applications Co‐Design Middleware Opportunities and Programming Models Challenges MPI, PGAS (UPC, Global Arrays, OpenSHMEM), CUDA, OpenMP, across Various OpenACC, Cilk, Hadoop (MapReduce), Spark (RDD, DAG), etc. Layers

Communication Library or Runtime for Programming Models Performance Point‐to‐point Collective Energy‐ Synchronization I/O and Fault Communication Communication Awareness and Locks File Systems Tolerance Scalability Resilience Networking Technologies Multi‐/Many‐core Accelerators (InfiniBand, 40/100GigE, Architectures (GPU and FPGA) Aries, and Omni‐Path)

Network Based Computing Laboratory HPCAC‐China ‘17 8 Overview of the MVAPICH2 Project •High Performance open‐source MPI Library for InfiniBand, Omni‐Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE) –MVAPICH (MPI‐1), MVAPICH2 (MPI‐2.2 and MPI‐3.0), Started in 2001, First version available in 2002 –MVAPICH2‐X (MPI + PGAS), Available since 2011 – Support for GPGPUs (MVAPICH2‐GDR) and MIC (MVAPICH2‐MIC), Available since 2014 – Support for Virtualization (MVAPICH2‐Virt), Available since 2015 – Support for Energy‐Awareness (MVAPICH2‐EA), Available since 2015 – Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015 – Used by more than 2,825 organizations in 85 countries – More than 429,000 (> 0.4 million) downloads from the OSU site directly –Empowering many TOP500 clusters (June ‘17 ranking) • 1st, 10,649,600‐core (Sunway TaihuLight) at National Supercomputing Center in Wuxi, China • 15th, 241,108‐core (Pleiades) at NASA • 20th, 462,462‐core (Stampede) at TACC • 44th, 74,520‐core (Tsubame 2.5) at Tokyo Institute of Technology – Available with software stacks of many vendors and Linux Distros (RedHat and SuSE) – http://mvapich.cse.ohio‐state.edu •Empowering Top500 systems for over a decade –System‐X from Virginia Tech (3rd in Nov 2003, 2,200 processors, 12.25 TFlops) ‐> – Sunway TaihuLight (1st in Jun’17, 10M cores, 100 PFlops) Network Based Computing Laboratory HPCAC‐China ‘17 9 MVAPICH2 Release Timeline and Downloads 450000 2.0

2.0b 400000 2.2

MIC ‐ GDR ‐ MV2 Virt X2.2 MV2 2.1 350000 ‐ ‐ MV2 MV2 2.2rc1

MV2 300000 MV2 MV2 1.9 GDR ‐ 250000 MV2 1.8 MV2 MV2 1.7 Downloads

200000 MV2 1.6 MV2 1.5 0.9.8 0.9.0

MV2 1.4 0.9.4 MV 1.1

150000 MV2 1.0.3 MV 1.0 Number MV2 1.0 MV2 MV MV2 100000

50000

0 04 05 06 07 08 09 09 10 11 12 13 14 14 15 16 17 17 05 05 06 07 08 08 10 10 11 12 13 13 15 15 16 ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ g Jul Jul Jul Jan Jan Jun Jun Oct Oct Oct Apr Apr Sep Feb Sep Feb Sep Feb Dec Dec Dec Aug Aug Au Nov Nov Mar Mar Mar May TimelineMay May Network Based Computing Laboratory HPCAC‐China ‘17 10 Architecture of MVAPICH2 Software Family

High Performance Parallel Programming Models

Message Passing Interface PGAS Hybrid ‐‐‐ MPI + X (MPI) (UPC, OpenSHMEM, CAF, UPC++) (MPI + PGAS + OpenMP/Cilk)

High Performance and Scalable Communication Runtime Diverse APIs and Mechanisms

Point‐to‐ Remote Collectives Energy‐ I/O and Fault Active Introspection point Job Startup Memory Virtualization Algorithms Awareness Tolerance Messages & Analysis Primitives Access File Systems

Support for Modern Networking Technology Support for Modern Multi‐/Many‐core Architectures (InfiniBand, iWARP, RoCE, Omni‐Path) (Intel‐Xeon, OpenPower, Xeon‐Phi, ARM, NVIDIA GPGPU) Transport Protocols Modern Features Transport Mechanisms Modern Features

SR‐ Multi Shared RC XRC UD DC UMR ODP CMA IVSHMEM XPMEM* MCDRAM* NVLink* CAPI* IOV Rail Memory

* Upcoming Network Based Computing Laboratory HPCAC‐China ‘17 11 Overview of A Few Challenges being Addressed by the MVAPICH2 Project for Exascale • Scalability for million to billion processors – Support for highly‐efficient inter‐node and intra‐node communication –Scalable Start‐up – Optimized Collectives using SHArP and Multi‐Leaders – Optimized CMA‐based Collectives •Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, UPC++, …) •Integrated Support for GPGPUs • Optimized MVAPICH2 for OpenPower and ARM

Network Based Computing Laboratory HPCAC‐China ‘17 12 One‐way Latency: MPI over IB with MVAPICH2

1.8 Small Message Latency 120 Large Message Latency TrueScale‐QDR 1.6 1.19 100 ConnectX‐3‐FDR 1.4 1.11 ConnectIB‐DualFDR 1.2 80 ConnectX‐4‐EDR Omni‐Path 1 (us) (us)

60 0.8 1.15 0.6 1.01 40 1.04 Latency Latency 0.4 20 0.2 0 0

Message Size (bytes) Message Size (bytes) TrueScale‐QDR ‐ 3.1 GHz Deca‐core (Haswell) Intel PCI Gen3 with IB switch ConnectX‐3‐FDR ‐ 2.8 GHz Deca‐core (IvyBridge) Intel PCI Gen3 with IB switch ConnectIB‐Dual FDR ‐ 3.1 GHz Deca‐core (Haswell) Intel PCI Gen3 with IB switch ConnectX‐4‐EDR ‐ 3.1 GHz Deca‐core (Haswell) Intel PCI Gen3 with IB Switch Omni‐Path ‐ 3.1 GHz Deca‐core (Haswell) Intel PCI Gen3 with Omni‐Path switch Network Based Computing Laboratory HPCAC‐China ‘17 13 Bandwidth: MPI over IB with MVAPICH2 14000 Unidirectional Bandwidth Bidirectional Bandwidth 12,590 25000 TrueScale‐QDR 24,136 12000 12,366 20000 ConnectX‐3‐FDR 10000 ConnectIB‐DualFDR 21,983

12,083 ConnectX‐4‐EDR 21,227 8000 15000 6,356 Omni‐Path 12,161 6000 10000 Bandwidth

3,373 Bandwidth

(MBytes/sec) 4000 6,228 (MBytes/sec) 5000 2000 0 0

Message Size (bytes) Message Size (bytes) TrueScale‐QDR ‐ 3.1 GHz Deca‐core (Haswell) Intel PCI Gen3 with IB switch ConnectX‐3‐FDR ‐ 2.8 GHz Deca‐core (IvyBridge) Intel PCI Gen3 with IB switch ConnectIB‐Dual FDR ‐ 3.1 GHz Deca‐core (Haswell) Intel PCI Gen3 with IB switch ConnectX‐4‐EDR ‐ 3.1 GHz Deca‐core (Haswell) Intel PCI Gen3 IB switch Omni‐Path ‐ 3.1 GHz Deca‐core (Haswell) Intel PCI Gen3 with Omni‐Path switch Network Based Computing Laboratory HPCAC‐China ‘17 14 Startup Performance on KNL + Omni‐Path

MPI_Init & Hello World ‐ Oakforest‐PACS 25 Hello World (MVAPICH2‐2.3a) 21s 20 MPI_Init (MVAPICH2‐2.3a)

(Seconds) 15

10 Taken 5.8s 22s 5 Time 0 64 1K 2K 4K 8K 128 256 512 Number of Processes 16K 32K 64K

• MPI_Init takes 22 seconds on 229,376 processes on 3,584 KNL nodes (Stampede2 – Full scale) • 8.8 times faster than Intel MPI at 128K processes (Courtesy: TACC) • At 64K processes, MPI_Init and Hello World takes 5.8s and 21s respectively (Oakforest‐PACS) • All numbers reported with 64 processes per node New designs available in MVAPICH2‐2.3a and as patch for SLURM‐15.08.8 and SLURM‐16.05.1 Network Based Computing Laboratory HPCAC‐China ‘17 15 Advanced Allreduce Collective Designs Using SHArP

0.4 Avg DDOT Allreduce time of HPCG MVAPICH2 0.3 12% MVAPICH2‐SHArP SHArP Support is available in MVAPICH2 2.3a (seconds)

0.2

0.1 Latency 0 (4,28) (8,28) (16,28) (Number of Nodes, PPN) M. Bayatpour, S. Chakraborty, H. Subramoni, X. Lu, and D. K. Panda, Scalable Reduction 0.2 Mesh Refinement Time of MiniAMR Collectives with Data Partitioning‐based Multi‐ MVAPICH2 Leader Design, SuperComputing '17. 0.15 13% MVAPICH2‐SHArP (seconds)

0.1

0.05 Latency 0 (4,28) (8,28) (16,28) (Number of Nodes, PPN) Network Based Computing Laboratory HPCAC‐China ‘17 16 Performance of MPI_Allreduce On Stampede2 (10,240 Processes) 300 2000 1800 250 1600 200 1400

(us) 1200

150 1000 800 2.4X Latency 100 600 50 400 200 0 0 4 8 16 32 64 128 256 512 1024 2048 4096 8K 16K 32K 64K 128K 256K Message Size Message Size MVAPICH2 MVAPICH2‐OPT IMPI MVAPICH2 MVAPICH2‐OPT IMPI OSU Micro Benchmark 64 PPN •For MPI_Allreduce latency with 32K bytes, MVAPICH2‐OPT can reduce the latency by 2.4X

M. Bayatpour, S. Chakraborty, H. Subramoni, X. Lu, and D. K. Panda, Scalable Reduction Collectives with Data Partitioning‐based Multi‐Leader Design, SuperComputing '17.

Network Based Computing Laboratory HPCAC‐China ‘17 17 Enhanced MPI_Bcast with Optimized CMA‐based Design

100000 100000 100000 KNL (64 Processes) Broadwell (28 Processes) Power8 (160 Processes) 10000 10000 10000 Use

(us) Use 1000 1000 Use 1000 CMA CMA CMA Use 100 100 100 Latency MVAPICH2‐2.3a SHMEM MVAPICH2‐2.3a Intel MPI 2017 Intel MPI 2017 MVAPICH2‐2.3a 10 10 10 Use Use OpenMPI 2.1.0 OpenMPI 2.1.0 OpenMPI 2.1.0 SHMEM SHMEM Proposed Proposed Proposed 1 1 1 1K 4K 16K 64K 256K 1M 4M 1K 4K 16K 64K 256K 1M 4M Message Size Message Size Message Size • Up to 2x ‐ 4x improvement over existing implementation for 1MB messages • Up to 1.5x –2xfaster than Intel MPI and Open MPI for 1MB messages • Improvements obtained for large messages only • p‐1 copies with CMA, p copies with Shared memory • Fallback to SHMEM for small messages S. Chakraborty, H. Subramoni, and D. K. Panda, Contention Aware Kernel‐Assisted MPI Collectives for Multi/Many‐ core Systems, IEEE Cluster ’17, BEST Paper Finalist Network Based Computing Laboratory HPCAC‐China ‘17 18 Overview of A Few Challenges being Addressed by the MVAPICH2 Project for Exascale • Scalability for million to billion processors •Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, UPC++, …) •Integrated Support for GPGPUs • Optimized MVAPICH2 for OpenPower and ARM

Network Based Computing Laboratory HPCAC‐China ‘17 19 MVAPICH2‐X for Hybrid MPI + PGAS Applications

• Current Model – Separate Runtimes for OpenSHMEM/UPC/UPC++/CAF and MPI –Possible deadlock if both runtimes are not progressed –Consumes more network resource •Unified communication runtime for MPI, UPC, UPC++, OpenSHMEM, CAF – Available with since 2012 (starting with MVAPICH2‐X 1.9) – http://mvapich.cse.ohio‐state.edu Network Based Computing Laboratory HPCAC‐China ‘17 20 Application Level Performance with Graph500 and Sort Graph500 Execution Time Sort Execution Time 35 3000 MPI‐Simple MPI Hybrid 30 2500 25 MPI‐CSC 2000 (s) MPI‐CSR 20 13X 1500 51% (seconds) 15 Hybrid (MPI+OpenSHMEM) 1000 Time 10

Time 500 5 7.6X 0 0 4K 8K 16K 500GB‐512 1TB‐1K 2TB‐2K 4TB‐4K No. of Processes Input Data ‐ No. of Processes •Performance of Hybrid (MPI+ OpenSHMEM) Graph500 Design •Performance of Hybrid (MPI+OpenSHMEM) Sort • 8,192 processes Application ‐ 2.4X improvement over MPI‐CSR • 4,096 processes, 4 TB Input Size ‐ 7.6X improvement over MPI‐Simple ‐ MPI – 2408 sec; 0.16 TB/min • 16,384 processes ‐ Hybrid – 1172 sec; 0.36 TB/min ‐ 1.5X improvement over MPI‐CSR ‐ 51% improvement over MPI‐design ‐ 13X improvement over MPI‐Simple

J. Jose, K. Kandalla, S. Potluri, J. Zhang and D. K. Panda, Optimizing Collective Communication in OpenSHMEM, Int'l Conference on Partitioned Global Address Space Programming Models (PGAS '13), October 2013. J. Jose, S. Potluri, K. Tomko and D. K. Panda, Designing Scalable Graph500 Benchmark with Hybrid MPI+OpenSHMEM Programming Models, International Supercomputing Conference (ISC’13), June 2013 J. Jose, K. Kandalla, M. Luo and D. K. Panda, Supporting Hybrid MPI and OpenSHMEM over InfiniBand: Design and Performance Evaluation, Int'l Conference on Parallel Processing (ICPP '12), September 2012 Network Based Computing Laboratory HPCAC‐China ‘17 21 Optimized OpenSHMEM with AVX and MCDRAM: Application Kernels Evaluation Heat‐2D Kernel using Jacobi method Heat Image Kernel KNL (Default) 1000 100 KNL (Default) KNL (AVX‐512) KNL (AVX‐512) KNL (AVX‐512+MCDRAM) KNL (AVX‐512+MCDRAM) 100 Broadwell 10 Broadwell (s) (s)

10 1 Time Time

1 0.1 16 32 64 128 16 32 64 128 No. of processes No. of processes

•On heat diffusion based kernels AVX‐512 vectorization showed better performance •MCDRAM showed significant benefits on Heat‐Image kernel for all process counts. Combined with AVX‐512 vectorization, it showed up to 4X improved performance

Network Based Computing Laboratory HPCAC‐China ‘17 22 Overview of A Few Challenges being Addressed by the MVAPICH2 Project for Exascale • Scalability for million to billion processors •Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, UPC++, …) •Integrated Support for GPGPUs –CUDA‐aware MPI – GPUDirect RDMA (GDR) Support – Control Flow Decoupling through GPUDirect Async – Optimized GDR Support • Optimized MVAPICH2 for OpenPower and ARM

Network Based Computing Laboratory HPCAC‐China ‘17 23 GPU‐Aware (CUDA‐Aware) MPI Library: MVAPICH2‐GPU • Standard MPI interfaces used for unified data movement •Takes advantage of Unified Virtual Addressing (>= CUDA 4.0) •Overlaps data movement from GPU with RDMA transfers

At Sender:

MPI_Send(s_devbuf, size, …); inside MVAPICH2 At Receiver: MPI_Recv(r_devbuf, size, …);

High Performance and High Productivity

Network Based Computing Laboratory HPCAC‐China ‘17 24 CUDA‐Aware MPI: MVAPICH2‐GDR 1.8‐2.2 Releases • Support for MPI communication from NVIDIA GPU device memory •High performance RDMA‐based inter‐node point‐to‐point communication (GPU‐GPU, GPU‐Host and Host‐GPU) •High performance intra‐node point‐to‐point communication for multi‐GPU adapters/node (GPU‐GPU, GPU‐Host and Host‐GPU) •Taking advantage of CUDA IPC (available since CUDA 4.1) in intra‐node communication for multiple GPU adapters/node • Optimized and tuned collectives for GPU device buffers •MPI datatype support for point‐to‐point and collective communication from GPU device buffers •Unified memory

Network Based Computing Laboratory HPCAC‐China ‘17 25 Performance of MVAPICH2‐GPU with GPU‐Direct RDMA (GDR) GPU‐GPU internode latency GPU‐GPU Internode Bandwidth 30 MV2‐GDR2.2 MV2‐GDR2.0b 3000 MV2‐GDR2.2 (us)

25 MV2 w/o GDR 2500 MV2‐GDR2.0b 11X 20 2000 MV2 w/o GDR 15 3X

10x (MB/s) 1500

Latency 10

Bandwidth 2X 5 1000 0 2.18us 500 028321285122K 0 Message Size (bytes) 1 4 16 64 256 1K 4K GPU‐GPU Internode Bi‐Bandwidth Message Size (bytes) 4000 3000 MV2‐GDR2.2

(MB/s) MV2‐GDR2.0b 11x

2000 MV2 w/o GDR MVAPICH2‐GDR‐2.2 1000 2X Intel Ivy Bridge (E5‐2680 v2) node ‐ 20 cores NVIDIA Tesla K40c GPU 0 Mellanox Connect‐X4 EDR HCA Bandwidth

‐ 1 4 16 64 256 1K 4K CUDA 8.0

Bi Mellanox OFED 3.0 with GPU‐Direct‐RDMA Message Size (bytes) Network Based Computing Laboratory HPCAC‐China ‘17 26 Application‐Level Evaluation (HOOMD‐blue) 64K Particles 256K Particles 3500

3000 2500

per MV2 MV2+GDR per

2500 2000 2X 2000 2X Steps

Steps 1500 (TPS)

(TPS) 1500 1000 Time

Time 1000

second 500

0 Average 0

Average 4 8 16 32 4 8 16 32 Number of Processes Number of Processes • Platform: Wilkes (Intel Ivy Bridge + NVIDIA Tesla K20c + Mellanox Connect-IB) • HoomdBlue Version 1.0.5 • GDRCOPY enabled: MV2_USE_CUDA=1 MV2_IBA_HCA=mlx5_0 MV2_IBA_EAGER_THRESHOLD=32768 MV2_VBUF_TOTAL_SIZE=32768 MV2_USE_GPUDIRECT_LOOPBACK_LIMIT=32768 MV2_USE_GPUDIRECT_GDRCOPY=1 MV2_USE_GPUDIRECT_GDRCOPY_LIMIT=16384 Network Based Computing Laboratory HPCAC‐China ‘17 27 Application‐Level Evaluation (Cosmo) and Weather Forecasting in Switzerland

Wilkes GPU Cluster CSCS GPU cluster

Default Callback‐based Event‐based Default Callback‐based Event‐based

1.2 1.2 Time 1 1 Time

0.8 0.8 0.6 0.6 Execution 0.4 Execution 0.4

0.2 0.2 0 0 4 8 16 32 Normalized 16 32 64 96 Normalized Number of GPUs Number of GPUs

• 2X improvement on 32 GPUs nodes Cosmo model: http://www2.cosmo‐model.org/content • 30% improvement on 96 GPU nodes (8 GPUs/node) /tasks/operational/meteoSwiss/ On‐going collaboration with CSCS and MeteoSwiss (Switzerland) in co‐designing MV2‐GDR and Cosmo Application C. Chu, K. Hamidouche, A. Venkatesh, D. Banerjee , H. Subramoni, and D. K. Panda, Exploiting Maximal Overlap for Non‐Contiguous Data Movement Processing on Modern GPU‐enabled Systems, IPDPS’16 Network Based Computing Laboratory HPCAC‐China ‘17 28 MVAPICH2‐GDS: Preliminary Results Latency oriented: Kernel+Send and Recv+Kernel Overlap with host computation/communication Default MPI Enhaced MPI+GDS Default MPI Enhanced MPI+GDS 80 100 60 80

(%) 60

(us) 40 40 20 20 Overlap Latency 0 0 1 4 16 64 256 1K 4K 1 4 16 64 256 1K 4K Message Size (bytes) Message Size (bytes) •Latency Oriented: Able to hide the kernel launch overhead –8‐15% improvement compared to default behavior •Overlap: Asynchronously to offload queue the Communication and computation tasks – 89% overlap with host computation at 128‐Byte message size

Intel Sandy Bridge, NVIDIA Tesla K40c and Mellanox FDR HCA Will be available in a public release soon CUDA 8.0, OFED 3.4, Each kernel is ~50us Network Based Computing Laboratory HPCAC‐China ‘17 29 Optimized MVAPICH2‐GDR Design GPU‐GPU Inter‐node Latency GPU‐GPU Inter‐node Bi‐Bandwidth 30 6000

20 4000 (us) (MB/s)

11X 10 10x 2000 1.91us

Latency 0 0

0 1 2 4 8 16 32 64 128256512 1K 2K 4K 8K Bandwidth 1 2 4 8 16 32 641282565121K 2K 4K Message Size (Bytes) Message Size (Bytes) MV2‐(NO‐GDR) MV2‐GDR‐2.3a MV2‐(NO‐GDR) MV2‐GDR‐2.3a

GPU‐GPU Inter‐node Bandwidth 4000 3000 MVAPICH2‐GDR‐2.3a (upcoming)

(MB/s) Intel Haswell (E5‐2687W) node ‐ 20 cores 2000 9x NVIDIA Pascal P100 GPU 1000 Mellanox Connect‐X5 EDR HCA CUDA 8.0 0 Mellanox OFED 4.0 with GPU‐Direct‐RDMA

Bandwidth 12481632641282565121K2K4K Message Size (Bytes) MV2‐(NO‐GDR) MV2‐GDR‐2.3a Network Based Computing Laboratory HPCAC‐China ‘17 30 Overview of A Few Challenges being Addressed by the MVAPICH2 Project for Exascale • Scalability for million to billion processors •Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, UPC++, …) •Integrated Support for GPGPUs • Optimized MVAPICH2 for OpenPower and ARM

Network Based Computing Laboratory HPCAC‐China ‘17 31 Intra‐node Point‐to‐Point Performance on OpenPower

Small Message Latency Large Message Latency 1 80 MVAPICH2 MVAPICH2 0.8 60 (us) (us)

0.6 40 0.4 Latency Latency 20 0.2

0 0 012481632641282565121K2K4K 8K 16K 32K 64K 128K 256K 512K 1M 2M Bandwidth Bi‐directional Bandwidth 40000 80000 MVAPICH2 MVAPICH2 30000 60000 (MB/s)

Bandwidth 20000 40000

10000 20000 Bandwidth

0 Bidirectional 0 1 2 4 8 1 2 4 8 16 32 64 1K 2K 4K 8K 16 32 64 1M 1K 2K 4K 8K 1M 128 256 512 16K 32K 64K 128 256 512 16K 32K 64K 128K 256K 512K 128K 512K Platform: OpenPOWER (Power8‐ppc64le) processor with 160 cores dual‐socket CPU. Each socket contains 80 cores. Network Based Computing Laboratory HPCAC‐China ‘17 32 Inter‐node Point‐to‐Point Performance on OpenPower Small Message Latency Large Message Latency 4 200 MVAPICH2 MVAPICH2 3 150 (us) (us)

2 100 Latency 1 Latency 50

0 0 012481632642565121K2K4K 8K 16K 32K 64K 128K 256K 512K 1M 2M Bandwidth Bi‐directional Bandwidth 15000 30000 MVAPICH2 MVAPICH2 10000 20000 (MB/s)

Bandwidth

5000 10000 Bandwidth

0 Bidirectional 0 1 2 4 8 1 2 4 8 16 32 64 1K 2K 4K 8K 16 32 64 1M 1K 2K 4K 8K 1M 128 256 512 16K 32K 64K 128 256 512 16K 32K 64K 128K 256K 512K 128K 256K 512K Platform: Two nodes of OpenPOWER (Power8‐ppc64le) CPU using Mellanox EDR (MT4115) HCA. Network Based Computing Laboratory HPCAC‐China ‘17 33 MVAPICH2‐GDR: Performance on OpenPOWER (NVLink + Pascal) INTRA‐NODE LATENCY (SMALL) INTRA‐NODE LATENCY (LARGE) INTRA‐NODE BANDWIDTH 20 40 400 30

(us) 15 (us)

10 20

200 (GB/sec) 5 10 Latency

Latency 0 0 0 1 4 16 64 1K 4K 1 2 4 8 1M 4M 256 16K 64K 16 32 64 1K 2K 4K 8K 16K 32K 64K 128K256K512K 1M 2M 4M 256K 128 256 512

Message Size (Bytes) Message Size (Bytes) Bandwidth Message Size (Bytes)

INTRA‐SOCKET(NVLINK) INTER‐SOCKET INTRA‐SOCKET(NVLINK) INTER‐SOCKET INTRA‐SOCKET(NVLINK) INTER‐SOCKET Intra‐node Latency: 16.8 us (without GPUDirectRDMA) Intra‐node Bandwidth: 32 GB/sec (NVLINK)

INTER‐NODE LATENCY (SMALL) INTER‐NODE LATENCY (LARGE) INTER‐NODE BANDWIDTH 40 1000 8 30 800 6 (us) (us)

600

20 (GB/sec) 400 4 10 Latency Latency 200 2 0 0

1 2 4 8 0 16 32 64 1K 2K 4K 8K 1 4 Bandwidth 128 256 512 16 64 1K 4K 1M 4M 256 16K 64K

Message Size (Bytes) 256K Message Size (Bytes) Message Size (Bytes) Inter‐node Latency: 22 us (without GPUDirectRDMA)Will be available in upcoming MVAPICH2‐GDR Inter‐node Bandwidth: 6 GB/sec (FDR) Platform: OpenPOWER (ppc64le) nodes equipped with a dual‐socket CPU, 4 Pascal P100‐SXM GPUs, and 4X‐FDR InfiniBand Inter‐connect Network Based Computing Laboratory HPCAC‐China ‘17 34 Intra‐node Point‐to‐point Performance on ARMv8 Small Message Latency Large Message Latency 3 600 MVAPICH2 MVAPICH2

2 400 (us) (us)

0.74 microsec 1 (4 bytes) 200 Latency Latency Will be available in upcoming MVAPICH2 0 0 0 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M Bandwidth Bi‐directional Bandwidth 5000 10000 MVAPICH2 MVAPICH2 4000 8000 (MB/s) 3000 6000 Bandwidth

2000 4000 1000 2000 Bandwidth

0 Bidirectional 0

Platform: ARMv8 (aarch64) MIPS processor with 96 cores dual‐socket CPU. Each socket contains 48 cores. Network Based Computing Laboratory HPCAC‐China ‘17 35 HPC, Big Data, Deep Learning, and Cloud •Traditional HPC – Message Passing Interface (MPI), including MPI + OpenMP – Support for PGAS and MPI + PGAS (OpenSHMEM, UPC) – Exploiting Accelerators •Big Data/Enterprise/Commercial Computing –Spark and Hadoop (HDFS, HBase, MapReduce) – Memcached is also used for Web 2.0 • Deep Learning – Caffe, CNTK, TensorFlow, and many more •Cloud for HPC and BigData – Virtualization with SR‐IOV and Containers

Network Based Computing Laboratory HPCAC‐China ‘17 36 How Can HPC Clusters with High‐Performance Interconnect and Storage Architectures Benefit Big Data Applications?

Can RDMA‐enabled Can HPC Clusters with How much Can the bottlenecks be high‐performance alleviated with new high‐performance performance benefits storage systems (e.g. designs by taking interconnects can be achieved advantage of HPC SSD, parallel file through enhanced technologies? benefit Big Data systems) benefit Big designs? processing? Data applications? How to design What are the major benchmarks for bottlenecks in current Big evaluating the Data processing performance of Big middleware (e.g. Hadoop, Data middleware on Spark, and Memcached)? HPC clusters?

Bring HPC and Big Data processing into a “convergent trajectory”!

Network Based Computing Laboratory HPCAC‐China ‘17 37 Designing Communication and I/O Libraries for Big Data Systems: Challenges

Applications Benchmarks Big Data Middleware Upper level (HDFS, MapReduce, HBase, Spark and Memcached) Changes? Programming Models (Sockets) RDMA? Communication and I/O Library Point‐to‐Point Threaded Models Virtualization (SR‐IOV) Communication and Synchronization

I/O and File Systems QoS & Fault Tolerance Performance Tuning

Commodity Computing System Networking Technologies Architectures Storage Technologies (InfiniBand, 1/10/40/100 GigE (Multi‐ and Many‐core (HDD, SSD, NVM, and NVMe‐ and Intelligent NICs) architectures and accelerators) SSD)

Network Based Computing Laboratory HPCAC‐China ‘17 38 The High‐Performance Big Data (HiBD) Project •RDMA for Apache Spark •RDMA for Apache Hadoop 2.x (RDMA‐Hadoop‐2.x) – Plugins for Apache, Hortonworks (HDP) and Cloudera (CDH) Hadoop distributions •RDMA for Apache HBase •RDMA for Memcached (RDMA‐Memcached) Available for InfiniBand and RoCE •RDMA for Apache Hadoop 1.x (RDMA‐Hadoop) Also run on Ethernet •OSU HiBD‐Benchmarks (OHB) – HDFS, Memcached, HBase, and Spark Micro‐benchmarks • http://hibd.cse.ohio‐state.edu •Users Base: 250 organizations from 31 countries •More than 23,500 downloads from the project site

Network Based Computing Laboratory HPCAC‐China ‘17 39 Performance Numbers of RDMA for Apache Hadoop 2.x – RandomWriter & TeraGen in OSU‐RI2 (EDR) 400 IPoIB (EDR) 800 IPoIB (EDR) Reduced by 4x 350 Reduced by 3x 700 (s) (s)

OSU‐IB (EDR) 300 600 OSU‐IB (EDR) 250 500 Time Time

200 400 150 300 100 200 Execution Execution 50 100 0 0 80 120 160 80 160 240 Data Size (GB) Data Size (GB) RandomWriter TeraGen Cluster with 8 Nodes with a total of 64 maps

• RandomWriter • TeraGen – 3x improvement over IPoIB – 4x improvement over IPoIB for for 80‐160 GB file size 80‐240 GB file size

Network Based Computing Laboratory HPCAC‐China ‘17 40 Performance Evaluation of RDMA‐Spark on SDSC Comet –HiBench PageRank

800 450 37% 43% 700 IPoIB 400 IPoIB 600 RDMA 350 RDMA 500 300 (sec) (sec)

250 400 200 Time 300 Time 150 200 100 100 50 0 0 Huge BigData Gigantic Huge BigData Gigantic Data Size (GB) Data Size (GB) 32 Worker Nodes, 768 cores, PageRank Total Time 64 Worker Nodes, 1536 cores, PageRank Total Time

• InfiniBand FDR, SSD, 32/64 Worker Nodes, 768/1536 Cores, (768/1536M 768/1536R) • RDMA vs. IPoIB with 768/1536 concurrent tasks, single SSD per node. –32 nodes/768 cores: Total time reduced by 37% over IPoIB (56Gbps) –64 nodes/1536 cores: Total time reduced by 43% over IPoIB (56Gbps)

Network Based Computing Laboratory HPCAC‐China ‘17 41 Performance Evaluation on SDSC Comet: Astronomy Application

• Kira Toolkit1: Distributed astronomy image 120 processing toolkit implemented using Apache Spark. 100 21 % 80 •Source extractor application, using a 65GB dataset 60 from the SDSS DR2 survey that comprises 11,150 40 image files. 20 0 •Compare RDMA Spark performance with the RDMA Spark Apache Spark standard apache implementation using IPoIB. (IPoIB) Execution times (sec) for Kira SE 1. Z. Zhang, K. Barbary, F. A. Nothaft, E.R. Sparks, M.J. Franklin, D.A. benchmark using 65 GB dataset, 48 cores. Patterson, S. Perlmutter. Scientific Computing meets Big Data Technology: An Astronomy Use Case. CoRR, vol: abs/1507.03325, Aug 2015.

M. Tatineni, X. Lu, D. J. Choi, A. Majumdar, and D. K. Panda, Experiences and Benefits of Running RDMA Hadoop and Spark on SDSC Comet, XSEDE’16, July 2016

Network Based Computing Laboratory HPCAC‐China ‘17 42 HPC, Big Data, Deep Learning, and Cloud •Traditional HPC – Message Passing Interface (MPI), including MPI + OpenMP – Support for PGAS and MPI + PGAS (OpenSHMEM, UPC) – Exploiting Accelerators •Big Data/Enterprise/Commercial Computing –Spark and Hadoop (HDFS, HBase, MapReduce) – Memcached is also used for Web 2.0 • Deep Learning – Caffe, CNTK, TensorFlow, and many more •Cloud for HPC and BigData – Virtualization with SR‐IOV and Containers

Network Based Computing Laboratory HPCAC‐China ‘17 43 Deep Learning: New Challenges for MPI Runtimes • Deep Learning frameworks are a different game Proposed altogether NCCL Co‐Designs – Unusually large message sizes (order of megabytes) –Most communication based on GPU buffers cuDNN •Existing State‐of‐the‐art cuBLAS – cuDNN, cuBLAS, NCCL ‐‐> scale‐up performance MPI

–CUDA‐Aware MPI ‐‐> scale‐out performance Performance

•For small and medium message sizes only! up ‐ • Proposed: Can we co‐design the MPI runtime (MVAPICH2‐ gRPC GDR) and the DL framework (Caffe) to achieve both? Scale – Efficient Overlap of Computation and Communication Hadoop – Efficient Large‐Message Communication (Reductions) –What application co‐designs are needed to exploit communication‐runtime co‐designs? Scale‐out Performance A. A. Awan, K. Hamidouche, J. M. Hashmi, and D. K. Panda, S‐Caffe: Co‐designing MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU Clusters. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '17) Network Based Computing Laboratory HPCAC‐China ‘17 44 Efficient Broadcast: MVAPICH2‐GDR and NCCL 1 Internode Comm. (Knomial)

• NCCL 1.x had some limitations 2 3 –Only worked for a single node; no scale‐out on multiple nodes 4

–Degradation across IOH (socket) for scale‐up (within a node) Intranode Comm. CPU (NCCL Ring)

[1] PLX PLX •We propose optimized MPI_Bcast design that exploits NCCL Ring Direction

– Communication of very large GPU buffers 1 2 3 4 –Scale‐out on large number of dense multi‐GPU nodes MV2‐GDR MV2‐GDR‐NCCL MV2‐GDR‐NCCL MV2‐GDR‐Opt 30 30 • Hierarchical Communication that efficiently exploits: 25 25 20

20 (seconds) (seconds)

–CUDA‐Aware MPI_Bcast in MV2‐GDR 15 15 Time Time

10 10 – NCCL Broadcast for intra‐node transfers 5 5 Training Training 0 0 •Can pure MPI‐level designs be done that achieve similar 2 4 8 16 32 64 2483264128 No. of GPUs No. of GPUs [2] or better performance than NCCL‐based approach? VGG Training with CNTK 1. A. A. Awan, K. Hamidouche, A. Venkatesh, and D. K. Panda, Efficient Large Message Broadcast using NCCL and CUDA‐Aware MPI for Deep Learning. In Proceedings of the 23rd European MPI Users' Group Meeting (EuroMPI 2016). [Best Paper Nominee]

2. A. A. Awan, C‐H. Chu, H. Subramoni, and D. K. Panda. Optimized Broadcast for Deep Learning Workloads on Dense‐GPU InfiniBand Clusters: MPI or NCCL?, arXiv ’17 (https://arxiv.org/abs/1707.09414) Network Based Computing Laboratory HPCAC‐China ‘17 45 OSU‐Caffe: Scalable Deep Learning • Caffe : A flexible and layered Deep Learning framework. GoogLeNet (ImageNet) on 128 GPUs • Benefits and Weaknesses 250 –Multi‐GPU Training within a single node –Performance degradation for GPUs across different 200 sockets – Limited Scale‐out

(seconds) 150 •OSU‐Caffe: MPI‐based Parallel Training – Enable Scale‐up (within a node) and Scale‐out (across Time 100 multi‐GPU nodes) –Scale‐out on 64 GPUs for training CIFAR‐10 network on 50 CIFAR‐10 dataset Training –Scale‐out on 128 GPUs for training GoogLeNet network on ImageNet dataset 0 8 163264128 OSU‐Caffe publicly available from No. of GPUs Invalid use case http://hidl.cse.ohio‐state.edu/ Caffe OSU‐Caffe (1024) OSU‐Caffe (2048) Network Based Computing Laboratory HPCAC‐China ‘17 46 Large Message Optimized Collectives for Deep Learning Reduce – 192 GPUs Reduce –64 MB 300 200 200 (ms) (ms) 100 100 •MV2‐GDR provides 0 0 optimized collectives for Latency 2 4 8 163264128 Latency 128 160 192 Message Size (MB) No. of GPUs large message sizes Allreduce –64 GPUs Allreduce ‐ 128 MB 300 • Optimized Reduce, 300 200 (ms) 200 (ms) Allreduce, and Bcast 100 100 0

• Good scaling with large 0 Latency Latency 16 32 64 2 4 8 163264128 number of GPUs No. of GPUs Message Size (MB) • Available in MVAPICH2‐ Bcast –64 GPUs Bcast 128 MB 100 100 GDR 2.2GA (ms) (ms)

50 50

0 0 Latency Latency 2 4 8 163264128 16 32 64 Message Size (MB) No. of GPUs Network Based Computing Laboratory HPCAC‐China ‘17 47 MVAPICH2‐GDR vs. Baidu‐allreduce •Initial Evaluation shows promising performance gains for MVAPICH2‐GDR 2.3a* compared to Baidu‐allreduce

8 GPUs (4 nodes log scale‐allreduce vs MVAPICH2‐GDR (MPI_Allreduce) Baidu‐allreduce MVAPICH2‐GDR

scale 1000000 100000 log 10000 ‐ 1000 ~11% improvement ~30X better

(us) 100

10 1 Latency

*MVAPICH2‐GDR 2.3a will be available soon Message Size (bytes) Network Based Computing Laboratory HPCAC‐China ‘17 48 High‐Performance Deep Learning over Big Data (DLoBD) Stacks • Benefits of Deep Learning over Big Data (DLoBD) . Easily integrate deep learning components into Big Data processing workflow . Easily access the stored data in Big Data systems . No need to set up new dedicated deep learning clusters; Reuse existing big data analytics clusters • Challenges . Can RDMA‐based designs in DLoBD stacks improve performance, scalability, and resource utilization on high‐ performance interconnects, GPUs, and multi‐core CPUs? . What are the performance characteristics of representative DLoBD stacks on RDMA networks? 5010 60 IPoIB-Time • Characterization on DLoBD Stacks 4010 RDMA-Time IPoIB-Accuracy . CaffeOnSpark, TensorFlowOnSpark, and BigDL 40 RDMA-Accuracy . IPoIB vs. RDMA; In‐band communication vs. Out‐of‐band 3010 2.6X communication; CPU vs. GPU; etc. 2010 20

. Performance, accuracy, scalability, and resource utilization 1010 Accuracy (%) . RDMA‐based DLoBD stacks (e.g., BigDL over RDMA‐Spark)

Epochs Time Epochs Time (secs) 10 0 can achieve 2.6x speedup compared to the IPoIB based 123456789101112131415161718 scheme, while maintain similar accuracy Epoch Number

X. Lu, H. Shi, M. H. Javed, R. Biswas, and D. K. Panda, Characterizing Deep Learning over Big Data (DLoBD) Stacks on RDMA‐capable Networks, HotI 2017.

Network Based Computing Laboratory HPCAC‐China ‘17 49 HPC, Big Data, Deep Learning, and Cloud •Traditional HPC – Message Passing Interface (MPI), including MPI + OpenMP – Support for PGAS and MPI + PGAS (OpenSHMEM, UPC) – Exploiting Accelerators •Big Data/Enterprise/Commercial Computing –Spark and Hadoop (HDFS, HBase, MapReduce) – Memcached is also used for Web 2.0 • Deep Learning – Caffe, CNTK, TensorFlow, and many more •Cloud for HPC and BigData – Virtualization with SR‐IOV and Containers

Network Based Computing Laboratory HPCAC‐China ‘17 50 Can HPC and Virtualization be Combined? • Virtualization has many benefits –Fault‐tolerance –Job migration –Compaction •Have not been very popular in HPC due to overhead associated with Virtualization •New SR‐IOV (Single Root –IO Virtualization) support available with Mellanox InfiniBand adapters changes the field • Enhanced MVAPICH2 support for SR‐IOV • MVAPICH2‐Virt 2.2 supports: – OpenStack, Docker, and singularity

J. Zhang, X. Lu, J. Jose, R. Shi and D. K. Panda, Can Inter‐VM Shmem Benefit MPI Applications on SR‐IOV based Virtualized InfiniBand Clusters? EuroPar'14 J. Zhang, X. Lu, J. Jose, M. Li, R. Shi and D.K. Panda, High Performance MPI Libray over SR‐IOV enabled InfiniBand Clusters, HiPC’14 J. Zhang, X .Lu, M. Arnold and D. K. Panda, MVAPICH2 Over OpenStack with SR‐IOV: an Efficient Approach to build HPC Clouds, CCGrid’15 Network Based Computing Laboratory HPCAC‐China ‘17 51 Application‐Level Performance on Chameleon

6000 400 MV2‐SR‐IOV‐Def MV2‐SR‐IOV‐Def 350 5000 MV2‐SR‐IOV‐Opt MV2‐SR‐IOV‐Opt 300 (s)

MV2‐Native (ms) 4000 MV2‐Native 2% 250 Time

Time 3000 200

150 9.5% 2000

Execution 1% Execution 5% 100 1000 50

0 0 22,20 24,10 24,16 24,20 26,10 26,16 milc leslie3d pop2 GAPgeofem zeusmp2 lu Problem Size (Scale, Edgefactor) Graph500 SPEC MPI2007 •32 VMs, 6 Core/VM •ComparedtoNative,2‐5% overhead for Graph500 with 128 Procs •Compared to Native, 1‐9.5% overhead for SPEC MPI2007 with 128 Procs

Network Based Computing Laboratory HPCAC‐China ‘17 52 Application‐Level Performance on Docker with MVAPICH2 100 300 Container‐Def 90 Container‐Def 73% Container‐Opt 250 80 11% Container‐Opt Native

(s) 70 200

(ms) Native 60 Time 50

Time 150

40 30 100 Execution Execution

20 50

10 BFS 0 0 MG.D FT.D EP.D LU.D CG.D 1Cont*16P 2Conts*8P 4Conts*4P NAS Scale, Edgefactor (20,16) Graph 500 •64 Containers across 16 nodes, pining 4 Cores per Container • Compared to Container‐Def, up to 11% and 73% of execution time reduction for NAS and Graph 500 •Compared to Native, less than 9 % and 5% overhead for NAS and Graph 500

Network Based Computing Laboratory HPCAC‐China ‘17 53 Application‐Level Performance on Singularity with MVAPICH2

NPB Class D Graph500 300 3000 Singularity Singularity 6% 250 2500 (ms) Native Native (s) 200 2000 Time

Time

150 1500 Execution 100 1000

Execution 7% BFS 50 500

0 0 CG EP FT IS LU MG 22,16 22,20 24,16 24,20 26,16 26,20 Problem Size (Scale, Edgefactor) • 512 Processes across 32 nodes •Less than 7% and 6% overhead for NPB and Graph500, respectively

J. Zhang, X .Lu and D. K. Panda, Is Singularity‐based Container Technology Ready for Running MPI Applications on HPC Clouds?, UCC ’17 Network Based Computing Laboratory HPCAC‐China ‘17 54 Concluding Remarks

•Next generation HPC systems need to be designed with a holistic view of Big Data, Deep Learning and Cloud •Presented some of the approaches and results along these directions • Enable Big Data, Deep Learning and Cloud community to take advantage of modern HPC technologies •Many other open issues need to be solved •Next generation HPC professionals need to be trained along these directions

Network Based Computing Laboratory HPCAC‐China ‘17 55 Funding Acknowledgments Funding Support by

Equipment Support by

Network Based Computing Laboratory HPCAC‐China ‘17 56 Personnel Acknowledgments

Current Students Current Research Scientists Current Research Specialist –S. Guganani (Ph.D.) –M. Rahman (Ph.D.) –A. Awan (Ph.D.) –X. Lu –J. Smith –J. Hashmi (Ph.D.) –D. Shankar (Ph.D.) –M. Bayatpour (Ph.D.) –H. Subramoni –M. Arnold –S. Chakraborthy (Ph.D.) –N. Islam (Ph.D.) –A. Venkatesh (Ph.D.) Current Post‐doc –C.‐H. Chu (Ph.D.) –M. Li (Ph.D.) –J. Zhang (Ph.D.) –A. Ruhela

Past Students Past Research Scientist –W. Huang (Ph.D.) –M. Luo (Ph.D.) –R. Rajachandrasekar (Ph.D.) –A. Augustine (M.S.) –K. Hamidouche –P. Balaji (Ph.D.) –W. Jiang (M.S.) –A. Mamidala (Ph.D.) –G. Santhanaraman (Ph.D.) –S. Sur –S. Bhagvat (M.S.) –J. Jose (Ph.D.) –G. Marsh (M.S.) –A. Singh (Ph.D.) –J. Sridhar (M.S.) –A. Bhat (M.S.) –S. Kini (M.S.) –V. Meshram (M.S.) Past Programmers –S. Sur (Ph.D.) –D. Buntinas (Ph.D.) –M. Koop (Ph.D.) –A. Moody (M.S.) –D. Bureddy –H. Subramoni (Ph.D.) –L. Chai (Ph.D.) –K. Kulkarni (M.S.) –S. Naravula (Ph.D.) –J. Perkins –K. Vaidyanathan (Ph.D.) –B. Chandrasekharan (M.S.) –R. Kumar (M.S.) –R. Noronha (Ph.D.) –N. Dandapanthula (M.S.) –S. Krishnamoorthy (M.S.) –X. Ouyang (Ph.D.) –A. Vishnu (Ph.D.) –V. Dhanraj (M.S.) –K. Kandalla (Ph.D.) –S. Pai (M.S.) –J. Wu (Ph.D.) –T. Gangadharappa (M.S.) –P. Lai (M.S.) –S. Potluri (Ph.D.) –W. Yu (Ph.D.) –K. Gopalakrishnan (M.S.) –J. Liu (Ph.D.)

Past Post‐Docs –D. Banerjee –J. Lin –S. Marcarelli –X. Besseron –M. Luo –J. Vienne –H.‐W. Jin –E. Mancini –H. Wang

Network Based Computing Laboratory HPCAC‐China ‘17 57 Multiple Positions Available in My Group

• Looking for Bright and Enthusiastic Personnel to join as –Post‐Doctoral Researchers (博士后) –PhD Students (博士研究生) –MPI Programmer/Software Engineer (MPI软件开发工程师) – Hadoop/Big Data Programmer/Software Engineer (大数据软件开发工程师) •If interested, please contact me at this conference and/or send an e‐mail to [email protected]‐state.edu or [email protected]‐state.edu

Network Based Computing Laboratory HPCAC‐China ‘17 58 Thank You! [email protected]‐state.edu

Network‐Based Computing Laboratory http://nowlab.cse.ohio‐state.edu/

The High‐Performance MPI/PGAS Project The High‐Performance Big Data Project The High‐Performance Deep Learning Project http://mvapich.cse.ohio‐state.edu/ http://hibd.cse.ohio‐state.edu/ http://hidl.cse.ohio‐state.edu/

Network Based Computing Laboratory HPCAC‐China ‘17 59