2020 OFA Virtual Workshop

STATUS OF OPENFABRICS INTERFACES (OFI) SUPPORT IN MPICH

Yanfei Guo, Assistant Computer Scientist

Argonne National Laboratory AGENDA

▪ What is MPICH? ▪ Why OFI? ▪ Current support • MPICH 3.3 series (CH4) • MPICH 3.4 series (CH4) ▪ Ongoing work • New Collective Framework • GPU Support

2 © OpenFabrics Alliance WHAT IS MPICH?

▪ MPICH is a high-performance and widely portable open-source implementation of MPI ▪ It provides all features of MPI that have been defined so far (up to and include MPI-3.1) ▪ Active development lead by Argonne National Laboratory and University of Illinois at Urbana-Champaign • Several close collaborators who contribute features, bug fixes, testing for quality assurance, etc. • IBM, Microsoft, , Intel, Ohio State University, Queen’s University, Mellanox, RIKEN AICS and others ▪ Current stable release is MPICH-3.3.2 ▪ Latest release is MPICH-3.4a2 ▪ www.mpich.org

3 © OpenFabrics Alliance MPICH: GOAL AND PHILOSOPHY

▪ MPICH aims to be the preferred MPI implementation on the top machines in the world ▪ Our philosophy is to create an “MPICH Ecosystem”

PETSc TAU ADLB MPE Intel GA-MPI IBM MPI Tianhe HPCToolkit MPI MPI CAF-MPI Cray MVAPICH MPI MPICH OpenShmem DDT -MPI Lenovo Microsoft MPI MPI Mellanox Totalview MPICH-MXM MathWorks

ANSYS

4 © OpenFabrics Alliance MOTIVATION

▪ Why OFI/OFIWG? • Support for diverse hardware through a common API • Actively, openly developed • Bi-weekly calls • Hosted on Github • Close abstraction for MPI • MPI community engaged from the start • Fully functional sockets provider • Prototype code on a laptop • Strong Vendor Support

5 © OpenFabrics Alliance MPICH-3.3 SERIES

▪ Introducing the CH4 device • Replacement for CH3, but we will maintain CH3 till all of our partners have moved to CH4 • Co-design effort • Weekly telecons with partners to discuss design and development issues • Two primary objectives: • Low-instruction count communication • Ability to support high-level network (OFI, UCX) • E.g., tag-matching in hardware, direct PUT/GET communication • Support for very high concurrency • Improvements to message rates in highly threaded environments (MPI_THREAD_MULTIPLE) • Support for multiple network endpoints (THREAD_MULTIPLE or not)

6 © OpenFabrics Alliance MPICH WITH CH4 DEVICE OVERVIEW

Application MPI Interface MPI Layer Machine- Derived Datatype independent Group Management Management Collectives Abstract Device Interface (ADI) CH4 CH4 Core Architecture- Active Message GPU Support Fallback specific Collectives Fallback

Netmods Shmmods OFI UCX POSIX XPMEM

7 © OpenFabrics Alliance MPICH PERFORMANCE AND

▪ Lightweight communication BGQ LAMMPS Strong Scaling MPICH/CH4 vs • Reducing overhead in instruction count and memory MPICH/Original usage MPICH/CH4 Efficiency • Inline Libfabric with MPICH further reduces overhead 1400 100% ▪ Improvements in MPI one-sided 1200 90% communication 80% 1000 • Enabling HW accelerated RMA 70% 800 60% ▪ Communication hints 50% • Allowing user to tell MPI to optimize for the crucial subset 600 40% of features 400 30% 20%

200 Percentage Timesteps per Second Timesteps 10% 0 0% 512 1024 2048 4096 8192 (368) (184) (90) (45) (23) Number of nodes (atoms per core)

8 © OpenFabrics Alliance MULTIPLE VIRTUAL NETWORK INTERFACE (VNI)

▪ Virtual Network Interface (VNI) • Each VNI abstracts a set of network resources • Some networks support multiple VNIs: InfiniBand contexts, scalable endpoints over Intel Omni-Path • Traditional MPI implementation uses single VNI • Serializes all traffic • Does not fully exploit network hardware resources ▪ Utilizing multiple VNIs to maximize independence in communication T0 T1 T2 T3 – Separate VNIs per communicator or per RMA window

– Distribute traffic between VNIs with respect to ranks, tags, Comm[0] Comm[1] Comm[2] Comm[3] and generally out-of-order communication MPI – M-N mapping between Work-Queues and VNIs

VNI VNI VNI

Hardware

9 © OpenFabrics Alliance MPI+THREAD HYBRID PROGRAMMING PERFORMANCE

Multithreaded Transfer Model Current MPI (3.1) 55 Application 50 45 40 User Endpoint 35 MPI 30 25 CTX 20 15 Hardware 10 5 User Expose Parallelism 106) (x Messages/s 0 -5 with COMM/TAG Application

User Message size (B) Endpoint MPI MPI_THREAD_SINGLE CTX CTX MPI_THREAD_MULTIPLE with MPI_COMM_WORLD Hardware MPI_THREAD_MULTIPLE with separate COMMs Work-Queue Data Transfer Model with MPI 10 © OpenFabrics Alliance Endpoints UPCOMING MPICH-3.4 AND FUTURE PLANS

▪ New Collective Framework • Optimizing collective based on communication characteristic and availability of HW acceleration • JSON configuration generated by external profiler ▪ GPU Support • Communication using GPU-resident buffers • Non-contiguous datatypes

11 © OpenFabrics Alliance ▪ Thanks to Intel for the significant work on this infrastructure ▪ Two major improvements: • C++ Template-like structure (still written in C) • Allows collective algorithms to be written in template form • Provides “generic” top-level instantiation using point-to-point operations • Allows device-level machine specific optimized implementations (e.g., using triggered operations for OFI or HCOLL for UCX) • Several new algorithms for a number of blocking and nonblocking collectives (performance tuning still ongoing)

Contributed by Intel (with some minor help from Argonne)

12 © OpenFabrics Alliance SELECTING COLLECTIVE ALGORITHM

▪ Choose Optimal Collective Algorithms • Optimized algorithm for certain communicator size, message size • Optimized algorithm using HW collective support • Making decision on each collective call ▪ Generated Decision Tree • JSON file describing choosing algorithms with conditions • JSON file created by profiling tools • JSON parsed at MPI_Init time and applied to the library

Contributed by Intel (with some minor help from Argonne)

13 © OpenFabrics Alliance GPU SUPPORT PLAN

▪ Internode • Native GPU support through Librabric and UCX • Developing fallback path for no native GPU support ▪ Intranode • GPU support in SHM ▪ Intranode • Supporting non-contiguous datatype for GPU • Packing/Unpacking using host/device buffer

Partnership with Intel, Cray, Mellanox, NVIDIA and AMD

14 © OpenFabrics Alliance CURRENT STATE OF GPU SUPPORT

▪ Native Internode • With Libfabric, UCX and supported GPUs ▪ Yaksa Datatype Engine (https://github.com/pmodels/yaksa) • Support H2D, D2H, D2D • Packing to appropriate GPU or CPU stage buffer for either native or fallback route • 1.0 release with CUDA backend • Intel is contributing on the Intel Xe backend Yaksa H2H Yaksa D2D CPU 6 CPU Backend 4 Vector CUDA NVIDIA Backend GPU 2 Datatype (msec) Time Indexed Frontend HIP AMD 0 … Backend GPU

ZE Intel Number of integers in the Z dimension Backend GPU Struct Packing the Y-Z plane of a 3D matrix (2x2x) Yaksa Datatype Engine 15 © OpenFabrics Alliance MPICH-3.4 ROADMAP

▪ CH4 already in at http://github.com/pmodels/mpich ▪ MPICH-3.4 GA coming out this summer • Multi-VNI support • Collective Selection Framework

16 © OpenFabrics Alliance 2020 OFA Virtual Workshop THANK YOU Yanfei Guo, Assistant Computer Scientist Argonne National Laboratory