Routing on the Channel Dependency Graph

TECHNISCHE UNIVERSITÄT DRESDEN FAKULTÄT INFORMATIK INSTITUT FÜR TECHNISCHE INFORMATIK PROFESSUR FÜR RECHNERARCHITEKTUR Routing on the Channel Dependency Graph: A New Approach to Deadlock-Free, Destination-Based, High-Performance Routing for Lossless Interconnection Networks Dissertation zur Erlangung des akademischen Grades Doktor rerum naturalium (Dr. rer. nat.) vorgelegt von Name: Domke Vorname: Jens geboren am: 12.09.1984 in: Bad Muskau Tag der Einreichung: 30.03.2017 Tag der Verteidigung: 16.06.2017 Gutachter: Prof. Dr. rer. nat. Wolfgang E. Nagel, TU Dresden, Germany Professor Tor Skeie, PhD, University of Oslo, Norway Multicast loops are bad since the same multicast packet will go around and around, inevitably creating a black hole that will destroy the Earth in a fiery conflagration. — OpenSM Source Code iii Abstract In the pursuit for ever-increasing compute power, and with Moore’s law slowly coming to an end, high- performance computing started to scale-out to larger systems. Alongside the increasing system size, the interconnection network is growing to accommodate and connect tens of thousands of compute nodes. These networks have a large influence on total cost, application performance, energy consumption, and overall system efficiency of the supercomputer. Unfortunately, state-of-the-art routing algorithms, which define the packet paths through the network, do not utilize this important resource efficiently. Topology- aware routing algorithms become increasingly inapplicable, due to irregular topologies, which either are irregular by design, or most often a result of hardware failures. Exchanging faulty network components potentially requires whole system downtime further increasing the cost of the failure. This management approach becomes more and more impractical due to the scale of today’s networks and the accompanying steady decrease of the mean time between failures. Alternative methods of operating and maintaining these high-performance interconnects, both in terms of hardware- and software-management, are necessary to mitigate negative effects experienced by scientific applications executed on the supercomputer. However, existing topology-agnostic routing algorithms either suffer from poor load balancing or are not bounded in the number of virtual channels needed to resolve deadlocks in the routing tables. Using the fail-in-place strategy, a well-established method for storage systems to repair only critical component failures, is a feasible solution for current and future HPC interconnects as well as other large-scale installations such as data center networks. Although, an appropriate combination of topology and routing algorithm is required to minimize the throughput degradation for the entire system. This thesis contributes a network simulation toolchain to facilitate the process of finding a suitable combination, either during system design or while it is in operation. On top of this foundation, a key contribution is a novel scheduling-aware routing, which reduces fault-induced throughput degradation while improving overall network utilization. The scheduling-aware routing performs frequent property preserving routing updates to optimize the path balancing for simultaneously running batch jobs. The increased deployment of lossless interconnection networks, in conjunction with fail-in-place modes of operation and topology-agnostic, scheduling-aware routing algorithms, necessitates new solutions to solve the routing-deadlock problem. Therefore, this thesis further advances the state-of-the-art by introducing a novel concept of routing on the channel dependency graph, which allows the design of an universally applicable destination-based routing capable of optimizing the path balancing without exceeding a given number of virtual channels, which are a common hardware limitation. This disruptive innovation enables implicit deadlock-avoidance during path calculation, instead of solving both problems separately as all previous solutions. v Contents List of Figures xi List of Tables xiii List of Algorithms xv List of Abbreviations and Symbols xvii 1 Introduction 1 1.1 Motivation . .1 1.2 Contributions . .4 1.3 Thesis Organization . .5 2 Related Work 7 2.1 State-of-the-Art for High-Performance Interconnect Designs . .7 2.1.1 Resiliency of Networks, Topologies, and Routings . .7 2.1.2 HPC Routing Strategies and Network Management . .8 2.1.3 Simulation Frameworks for Network Analyses . .9 2.2 State-of-the-Art for Deadlock-free Routing Approaches . 10 2.2.1 Topology-aware Routing Algorithms . 12 2.2.2 Topology-agnostic Routing Algorithms . 12 3 Background, Assumptions, and Definitions 15 3.1 Interconnection Networks and Routing Algorithms . 15 3.1.1 Network-related Definitions . 15 3.1.2 Routing-related Definitions . 17 3.1.3 Topology and Routing Metrics . 19 3.2 Common Network Topologies and Production HPC Systems . 21 3.2.1 Meshes and Tori . 22 3.2.2 Fat-Trees and eXtended Generalized Fat-Trees . 23 3.2.3 Kautz Graph . 24 3.2.4 Dragonfly Topologies . 24 3.2.5 Random Topologies . 26 vii 3.2.6 Real-world HPC Systems . 26 3.3 Selection of Routing Algorithms for HPC . 28 3.3.1 MinHop Routing . 28 3.3.2 Single-Source Shortest-Path Routing (SSSP) . 28 3.3.3 Dimension Order Routing (DOR) . 29 3.3.4 Up*/Down* and Down*/Up* Routing . 29 3.3.5 Fat-Tree Routing . 29 3.3.6 Layered Shortest Path Routing (LASH) . 31 3.3.7 Torus-2QoS Routing . 31 3.3.8 Deadlock-Free SSSP Routing (DFSSSP) . 32 4 Fail-in-Place High-Performance Networks 33 4.1 Failures Analysis for Real HPC Systems . 33 4.2 Building Blocks for Fail-in-Place Networks . 36 4.2.1 Resilient Topologies . 36 4.2.2 Resilient Routing Algorithms . 36 4.2.3 Resiliency Metrics . 37 4.3 Interconnect Simulation Framework . 40 4.3.1 Toolchain Overview . 40 4.3.2 InfiniBand Model in OMNeT++ . 41 4.3.3 Traffic Injection . 42 4.3.4 Simulator Improvements . 43 4.4 Simulation Results . 44 4.4.1 Initial Usability Study . 44 4.4.2 Influence of Link Faults on Small Topologies . 46 4.4.3 Influence of Link Faults on Large Topologies . 50 4.4.4 Case Study of the TSUBAME2 Supercomuter . 52 4.4.5 Case Study of the Deimos HPC System . 54 5 Utilization Improvement through SAR 57 5.1 Conceptual Design of SAR for shared HPC Environments . 57 5.2 Example Implementation of the Scheduling-Aware Routing . 62 5.2.1 Hardware and Software Building Blocks . 62 5.2.2 Scheduling-Aware Routing Optimization . 65 5.2.3 Property Preserving Network Update for InfiniBand . 67 5.3 Evaluation . 71 5.3.1 Current Limitations and Problems . 72 5.3.2 Theoretical Evaluation of Network Metrics . 72 5.3.3 Practical Evaluation on a Production System . 77 viii 6 Routing on the Channel Dependency Graph 81 6.1 Limitations of Multi-Step Routing Approaches . 81 6.2 Nue Routing . 84 6.2.1 Complete Channel Dependency Graph . 84 6.2.2 Escape Paths for the Static Nue Routing . 85 6.2.3 Choosing Root Node for the Spanning Tree . 87 6.2.4 Dijkstra’s Algorithm for the complete CDG . 89 6.2.5 Nue Routing Function . 91 6.2.6 Optimizations for Nue Routing . 92 6.2.7 Correctness, Completeness & Complexity . 99 6.3 Evaluation of Nue Routing . 102 6.3.1 Path Length and Edge Forwarding Index . 103 6.3.2 Throughput for Regular and Irregular Topologies . 104 6.3.3 Runtime and Practical Considerations . 107 7 Summary, Conclusion, and Future Work 109 Bibliography 113 ix List of Figures 1.1 Interconnect technology statistics of the TOP500 list of November 2016 . .2 3.1 5-node interconnection network I = G(N;C) ........................ 16 3.2 Using a shortest-path, counter-clockwise routing for network I induces the channel dependency graph D .................................... 18 3.3 Example visualization of a 3-dimensional mesh and torus topology without terminals . 22 3.4 Example visualization of a k-ary n-tree topology and an extended generalized fat-tree topology . 23 3.5 Example visualization of a Kautz graph and a dragonfly topology . 24 3.6 Topology configurations of three real-world supercomputers . 27 4.1 Network-related hardware failures of three HPC systems . 34 4.2 Whisker plot for lost connections while removing links or switches from a 10-ary 3-tree without performing a rerouting . 39 4.3 Toolchain to evaluate the network throughput of a fail-in-place network . 41 4.4 Whisker plots of consumption bandwidth for uniform random injection . 47 4.5 Histograms for exchange patterns with error bars, showing mean value and the 95% confidence interval for ten seeds . 49 4.6 Histograms for exchange patterns executed on networks with up to 8% link failures . 51 4.7 Histograms for exchange patterns for different failure types using DFSSSP, fat-tree and Up*/Down* routing on TSUBAME2 . 53 4.8 Histograms for exchange patterns for different failure types using MinHop, DFSSSP and Up*/Down* routing on Deimos . 55 5.1 Batch job history of two petascale HPC systems for one month of operation . 58 5.2 Comparison of effective EFI for inter-switch links: Oblivious routing vs. scheduling-aware routing for three equal-sized batch jobs on a 2-level fat-tree . 60 5.3 Qualitative comparison of different routing and network optimizations for supercomputers 61 5.4 SLURM architecture . 63 5.5 Open MPI’s modular component architecture establishes a message passing interface between application and hardware to enable inter-process communication . 64 5.6 Flowchart of a filtering tool . 65 xi 5.7 Sequence diagram of a five-phase update protocol to achieve property preserving network updates of InfiniBand networks . 70 5.8 Replay of job history (Figure 5.1) for two HPC systems . 74 5.9 Heat map of the EFIs for inter-switch links of TSUBAME2 for the 200-node batch job on day 16 of Feb. ’15; Used routing algorithm: fat-tree .

Routing on the Channel Dependency Graph

Deadlock: Why Does It Happen? CS 537 Andrea C

The Dining Philosophers Problem Cache Memory

CSC 553 Operating Systems Multiple Processes

Supervision 1: Semaphores, Generalised Producer-Consumer, and Priorities

Q1. Multiple Producers and Consumers

Deadlock-Free Oblivious Routing for Arbitrary Topologies

Scalable Deadlock Detection for Concurrent Programs

The Deadlock Problem

Deadlock Problem Mutex T M1, M2;

Deadlock Immunity: Enabling Systems to Defend Against Deadlocks

Edsger Wybe Dijkstra

Comparing Type Systems for Deadlock Freedom