(Ofi) Support in Mpich

2020 OFA Virtual Workshop STATUS OF OPENFABRICS INTERFACES (OFI) SUPPORT IN MPICH Yanfei Guo, Assistant Computer Scientist Argonne National Laboratory AGENDA ▪ What is MPICH? ▪ Why OFI? ▪ Current support • MPICH 3.3 series (CH4) • MPICH 3.4 series (CH4) ▪ Ongoing work • New Collective Framework • GPU Support 2 © OpenFabrics Alliance WHAT IS MPICH? ▪ MPICH is a high-performance and widely portable open-source implementation of MPI ▪ It provides all features of MPI that have been defined so far (up to and include MPI-3.1) ▪ Active development lead by Argonne National Laboratory and University of Illinois at Urbana-Champaign • Several close collaborators who contribute features, bug fixes, testing for quality assurance, etc. • IBM, Microsoft, Cray, Intel, Ohio State University, Queen’s University, Mellanox, RIKEN AICS and others ▪ Current stable release is MPICH-3.3.2 ▪ Latest release is MPICH-3.4a2 ▪ www.mpich.org 3 © OpenFabrics Alliance MPICH: GOAL AND PHILOSOPHY ▪ MPICH aims to be the preferred MPI implementation on the top machines in the world ▪ Our philosophy is to create an “MPICH Ecosystem” PETSc TAU ADLB MPE Intel GA-MPI IBM MPI Tianhe HPCToolkit MPI MPI CAF-MPI Cray MVAPICH MPI MPICH OpenShmem DDT -MPI Lenovo Microsoft MPI MPI Mellanox Totalview MPICH-MXM MathWorks ANSYS 4 © OpenFabrics Alliance MOTIVATION ▪ Why OFI/OFIWG? • Support for diverse hardware through a common API • Actively, openly developed • Bi-weekly calls • Hosted on Github • Close abstraction for MPI • MPI community engaged from the start • Fully functional sockets provider • Prototype code on a laptop • Strong Vendor Support 5 © OpenFabrics Alliance MPICH-3.3 SERIES ▪ Introducing the CH4 device • Replacement for CH3, but we will maintain CH3 till all of our partners have moved to CH4 • Co-design effort • Weekly telecons with partners to discuss design and development issues • Two primary objectives: • Low-instruction count communication • Ability to support high-level network APIs (OFI, UCX) • E.g., tag-matching in hardware, direct PUT/GET communication • Support for very high thread concurrency • Improvements to message rates in highly threaded environments (MPI_THREAD_MULTIPLE) • Support for multiple network endpoints (THREAD_MULTIPLE or not) 6 © OpenFabrics Alliance MPICH WITH CH4 DEVICE OVERVIEW Application MPI Interface MPI Layer Machine- Derived Datatype independent Group Management Management Collectives Abstract Device Interface (ADI) CH4 CH4 Core Architecture- Active Message GPU Support Fallback specific Collectives Fallback Netmods Shmmods OFI UCX POSIX XPMEM 7 © OpenFabrics Alliance MPICH PERFORMANCE AND SCALABILITY ▪ Lightweight communication BGQ LAMMPS Strong Scaling MPICH/CH4 vs • Reducing overhead in instruction count and memory MPICH/Original usage MPICH/CH4 Efficiency • Inline Libfabric with MPICH further reduces overhead 1400 100% ▪ Improvements in MPI one-sided 1200 90% communication 80% 1000 • Enabling HW accelerated RMA 70% 800 60% ▪ Communication hints 50% • Allowing user to tell MPI to optimize for the crucial subset 600 40% of features 400 30% 20% 200 Speedup Percentage Timesteps per Second Timesteps 10% 0 0% 512 1024 2048 4096 8192 (368) (184) (90) (45) (23) Number of nodes (atoms per core) 8 © OpenFabrics Alliance MULTIPLE VIRTUAL NETWORK INTERFACE (VNI) ▪ Virtual Network Interface (VNI) • Each VNI abstracts a set of network resources • Some networks support multiple VNIs: InfiniBand contexts, scalable endpoints over Intel Omni-Path • Traditional MPI implementation uses single VNI • Serializes all traffic • Does not fully exploit network hardware resources ▪ Utilizing multiple VNIs to maximize independence in communication T0 T1 T2 T3 – Separate VNIs per communicator or per RMA window – Distribute traffic between VNIs with respect to ranks, tags, Comm[0] Comm[1] Comm[2] Comm[3] and generally out-of-order communication MPI – M-N mapping between Work-Queues and VNIs VNI VNI VNI Hardware 9 © OpenFabrics Alliance MPI+THREAD HYBRID PROGRAMMING PERFORMANCE Multithreaded Transfer Model Current MPI (3.1) 55 Application 50 45 40 User Endpoint 35 MPI 30 25 CTX 20 15 Hardware 10 5 User Expose Parallelism 106) (x Messages/s 0 -5 with COMM/TAG Application User Message size (B) Endpoint MPI MPI_THREAD_SINGLE CTX CTX MPI_THREAD_MULTIPLE with MPI_COMM_WORLD Hardware MPI_THREAD_MULTIPLE with separate COMMs Work-Queue Data Transfer Model with MPI 10 © OpenFabrics Alliance Endpoints UPCOMING MPICH-3.4 AND FUTURE PLANS ▪ New Collective Framework • Optimizing collective based on communication characteristic and availability of HW acceleration • JSON configuration generated by external profiler ▪ GPU Support • Communication using GPU-resident buffers • Non-contiguous datatypes 11 © OpenFabrics Alliance ▪ Thanks to Intel for the significant work on this infrastructure ▪ Two major improvements: • C++ Template-like structure (still written in C) • Allows collective algorithms to be written in template form • Provides “generic” top-level instantiation using point-to-point operations • Allows device-level machine specific optimized implementations (e.g., using triggered operations for OFI or HCOLL for UCX) • Several new algorithms for a number of blocking and nonblocking collectives (performance tuning still ongoing) Contributed by Intel (with some minor help from Argonne) 12 © OpenFabrics Alliance SELECTING COLLECTIVE ALGORITHM ▪ Choose Optimal Collective Algorithms • Optimized algorithm for certain communicator size, message size • Optimized algorithm using HW collective support • Making decision on each collective call ▪ Generated Decision Tree • JSON file describing choosing algorithms with conditions • JSON file created by profiling tools • JSON parsed at MPI_Init time and applied to the library Contributed by Intel (with some minor help from Argonne) 13 © OpenFabrics Alliance GPU SUPPORT PLAN ▪ Internode • Native GPU support through Librabric and UCX • Developing fallback path for no native GPU support ▪ Intranode • GPU support in SHM ▪ Intranode • Supporting non-contiguous datatype for GPU • Packing/Unpacking using host/device buffer Partnership with Intel, Cray, Mellanox, NVIDIA and AMD 14 © OpenFabrics Alliance CURRENT STATE OF GPU SUPPORT ▪ Native Internode • With Libfabric, UCX and supported GPUs ▪ Yaksa Datatype Engine (https://github.com/pmodels/yaksa) • Support H2D, D2H, D2D • Packing to appropriate GPU or CPU stage buffer for either native or fallback route • 1.0 release with CUDA backend • Intel is contributing on the Intel Xe backend Yaksa H2H Yaksa D2D CPU 6 CPU Backend 4 Vector CUDA NVIDIA Backend GPU 2 Datatype Time (msec) Indexed Frontend HIP AMD 0 … Backend GPU ZE Intel Number of integers in the Z dimension Backend GPU Struct Packing the Y-Z plane of a 3D matrix (2x2x<dims>) Yaksa Datatype Engine 15 © OpenFabrics Alliance MPICH-3.4 ROADMAP ▪ CH4 already in at http://github.com/pmodels/mpich ▪ MPICH-3.4 GA coming out this summer • Multi-VNI support • Collective Selection Framework 16 © OpenFabrics Alliance 2020 OFA Virtual Workshop THANK YOU Yanfei Guo, Assistant Computer Scientist Argonne National Laboratory.

(Ofi) Support in Mpich

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support