Scalability-Driven Approaches to Key Aspects of the Message Passing Interface for Next Generation Supercomputing

Scalability-Driven Approaches to Key Aspects of the Message Passing Interface for Next Generation Supercomputing by Ayi Judicael Zounmevo A thesis submitted to the Department of Electrical and Computer Engineering in conformity with the requirements for the degree of Doctor of Philosophy Queen’s University Kingston, Ontario, Canada May 2014 Copyright c Ayi Judicael Zounmevo, 2014 � Abstract The Message Passing Interface (MPI), which dominates the supercomputing programming environment, is used to orchestrate and fulfill communication in High Performance Computing (HPC). How far HPC programs can scale depends in large part on the ability to achieve fast communication; and to overlap communication with computation or communication with communication. This dissertation proposes a new asynchronous solution to the nonblocking Rendezvous protocol used between pairs of processes to transfer large payloads. On top of enforcing communication/computation overlapping in a comprehensive way, the proposal trumps existing network device-agnostic asynchronous solutions by being memory-scalable and by avoiding brute force strategies. Achieving overlapping between communication and computation is important; but each communication is also expected to generate minimal latency. In that respect, the processing of the queues meant to hold messages pending reception inside the MPI middleware is expected to be fast. Currently though, that processing slows down when program scales grow. This research presents a novel scalability-driven message queue whose processing skips alto- gether large portions of queue items that are deterministically guaranteed to lead to unfruitful searches. For having little sensitivity to program sizes, the proposed message queue maintains a very good performance, on top of displaying a low andflattening memory footprint growth pattern. Due to the blocking nature of its required synchronizations, the one-sided communication model of MPI creates both communication/computation and communication/communication serializations. This researchfixes these issues and latency-related inefficiencies documented i for MPI one-sided communications by proposing completely nonblocking and non-serializing versions for those synchronizations. The improvements, meant for consideration in a future MPI standard, also allow new classes of programs to be more efficiently expressed in MPI. Finally, a persistent distributed service is designed over MPI to show its impacts at large scales beyond communication-only activities. MPI is analyzed in situations of resource exhaustion, partial failure and heavy use of internal objects for communicating and non- communicating routines. Important scalability issues are revealed and solution approaches are put forth. ii Acknowledgments Thanks to God for past and current graces, for the wisdom he offers for free; and for watching over me and my future. I am deeply indebted to my supervisor, Dr. Ahmad Afsahi, for his unfailing support and for constantly pushing me toward the completion of this research. My thanks go to him for his guidance for high quality research, understanding and praiseworthy patience. I am also indebted to Dr. Dries Kimpe, Dr. Pavan Balaji and Dr. Rob Ross of the U.S. Argonne National Laboratory, IL. I undeniably grew with their advice and insight during my Ph.D. studies. To the members of my thesis committee, Dr. Ying Zou, Dr. Ahmed Hassan, Dr. Michael Korenberg and Dr. Jesper Larsson Träff, thanks for the priceless feedback on this research. To my co-workers from the Parallel Processing Research lab, Dr. Ying Qian, Dr. Mo- hammad Rashti, Dr. Ryan Grant, Dr. Reza Zamani, Grigory Inozemtsev, Iman Faraji and Hessam Mirsadeghi, thanks for your support, the constructive discussions, the jokes and the laughters. The experience would have been lonely without you. To Ryan, additional thanks for those spontaneous and extremely helpful advice in my last year. To Iman and Hessam, thanks for now being two irreplaceable friends for life. Thanks to Xin Zhao of the University of Illinois Urbana Champaign for the collaboration and insightful discussions on MPI one-sided communications. Thanks to the Natural Science and Engineering Research Council of Canada (NSERC) for supporting this research through grants to my supervisor. Thanks to the Electrical and Computer Engineering (ECE) Department of Queen’s University as well as the school of grad- uate studies for thefinancial support. Thanks to the IT service and office staff of the ECE iii Department of Queen’s University. Special thanks to Ms. Debra Fraser and Ms. Kendra Pople-Easton for being so nice, so supportive and so available for me. My gratitude also goes to Pak Lui and the HPC Advisory Council for the supercomputing resources. Thanks to the Argonne Laboratory Computing Research Center for the supercomputing resources. To all my friends in Montreal, I cannot acknowledge enough your support. In alphabetical order, Narcisse Adjalla, Santos Adjalla, Sami Abike Yacoubou-Chabi Yo, Sani Chabi Yo, Flo- rent Kpodjedo, Ablan Joyce Nouaho and Fatou Tao, thank you. As for Florent, my improvised psychologist in difficult and lonely moments, additional thanks; you actually watched over me. Lastly, I would like to acknowledge my family for always believing in me. My gratitude goes to my brothers and sisters, Ida, Cyr, Hervé,Eric, Lydia, Annick and Laure for their support and prayers. To my father Fabien and my mother Jeanne, you are the absolute mental strength that I carry with myself; I owe you so much. iv Table of Contents Abstract i Acknowledgments iii Table of Contents v List of Tables ix List of Figures x List of Equations xiii List of Code Listings xiv Acronyms xv Chapter 1: Introduction . ............. 1 1.1 Motivation . .................... 3 1.2 Problem Statement . ................. 7 1.3 Contributions . .................... 8 1.4 Dissertation Outline . 11 Chapter 2: Background . 13 2.1 Message Passing Interface (MPI) . 14 2.1.1 Datatypes and Derived Datatypes in MPI . 16 2.1.2 Two-sided Communications . 17 2.1.3 Collective Communications . 19 2.1.4 One-sided Communications . 19 2.1.5 The Nonblocking Communication Handling in MPI . 23 2.1.6 Generalized Requests and Info Objects . 24 2.1.7 Thread Levels . 25 2.2 High Performance Interconnects . 26 2.2.1 InfiniBand: Characteristics, Queueing Model and Semantics . 28 2.2.2 The Lower-level Messaging Middleware . 29 2.3 Summary . .................... 30 v Chapter 3: Scalable MPI Two-Sided Message Progression over RDMA . 31 3.1 Introduction to Rendezvous Protocols . 34 3.1.1 RDMA Write-based Rendezvous Protocol . 34 3.1.2 RDMA Read-based Rendezvous Protocol . 36 3.1.3 Synthesis of the RDMA-based Protocols and Existing Improvement Pro- posals . 36 3.1.4 Hardware-offloaded Two-sided Message Progression . 40 3.1.5 Host-based Asynchronous Message Progression Methods . 41 3.1.6 Summary of Related Work . 43 3.2 Scenario-conscious Asynchronous Rendezvous . 44 3.2.1 Design Objectives . 45 3.2.2 Asynchronous Execution Flows inside a Single Thread . 47 3.3 Experimental Evaluation . 49 3.3.1 Microbenchmark Results . 50 3.3.2 Application Results . 54 3.4 Summary . .................... 58 Chapter 4: Scalable Message Queues in MPI . 60 4.1 Related Work . .................... 62 4.2 Motivations . .................... 63 4.2.1 The Omnipresence of Message Queue Processing in MPI Communications 63 4.2.2 Performance and Scalability Concerns . 64 4.2.3 Rethinking Message Queues for Scalability by Leveraging MPI’s Very Characteristics . 67 4.3 A New Message Queue for Large Scales . 69 4.3.1 Reasoning about Dimensionality . 69 4.3.2 A Scalable Multidimensional MPI Message Sub-queue . 70 4.3.3 Receive Match Ordering Enforcement . 75 4.3.4 Size Threshold Heuristic for Sub-queue Structures . 76 4.4 Performance Analysis . 80 4.4.1 Runtime Complexity Analysis . 80 4.4.2 Memory Overhead Analysis . 90 4.4.3 Cache Behaviour . 93 4.5 Practical Justification of a New Message Queue Data Structure . 94 4.6 Experimental Evaluation . 98 4.6.1 Notes on Measurements . 99 4.6.2 Microbenchmark Results . 99 4.6.3 Application Results . 111 4.6.4 Message Queue Memory Behaviours at Extreme Scales . 116 4.7 Summary . .................... 119 Chapter 5: Nonblocking MPI One-sided Communication Synchronizations . 121 5.1 Related Work . .................... 123 5.2 The Burden of Blocking RMA Synchronization . 124 vi 5.2.1 The Inefficiency Patterns . 124 5.2.2 Serializations in MPI One-sided Communications . 126 5.3 Opportunistic Message Progression as Default Advantage for Nonblocking Syn- chronizations . 128 5.4 Impacts of Nonblocking RMA Epochs . 130 5.4.1 Fixing The Inefficiency Patterns . 131 5.4.2 Unlocking or Improving New Use Cases . 135 5.5 Nonblocking API . 138 5.6 Semantics, Correctness and Progress Engine Behaviours . 139 5.6.1 Semantics and Correctness . 140 5.6.2 Info Object Key-value Pairs and Aggressive Progression Rules . 145 5.7 Design and Realization Notes . 150 5.7.1 Separate and New Progress Engine . 150 5.7.2 Deferred Epochs and Epoch Recording . 152 5.7.3 Epoch Ids, Win traits and Epoch Matching . 153 5.7.4 Notes on Lock and Fence Epochs . 155 5.8 Experimental Evaluation . 156 5.8.1 Microbenchmark . 157 5.8.2 Unstructured Dynamic Application Pattern . 176 5.8.3 Application Test . 178 5.9 Summary . .................... 183 Chapter 6: MPI in Extreme-scale Persistent Services: Some Missing Scala- bility Features . 186 6.1 Related Work . .................... 188 6.2 Service Overview: The Distributed Storage . 188 6.3 Transport Requirements . 192 6.4 MPI as a Network API . 193 6.4.1 Connection/Disconnection . 194 6.4.2 The RPC Listener . 198 6.4.3 I/O Life Cycles . 199 6.4.4 Cancellation . 201 6.4.5 Client-server and Failure Behaviour . 206 6.4.6 Object Limits . 209 6.5 Some Missing Scalability Features . 210 6.5.1 Fault Tolerance . 210 6.5.2 Generalization of Nonblocking Operations . 212 6.5.3 Object Limits and Resource Exhaustion . 217 6.6 Summary .

Load more