Optimized Routing for Fat-Tree Topologies

Optimized Routing for Fat-Tree Topologies Bartosz Bogdański Thesis submitted for the degree of Philosophiae Doctor Department of Informatics Faculty of Mathematics and Natural Sciences University of Oslo January 2014 © Bartosz BogdaĔski, 2014 Series of dissertations submitted to the Faculty of Mathematics and Natural Sciences, University of Oslo No. 1521 ISSN 1501-7710 All rights reserved. No part of this publication may be reproduced or transmitted, in any form or by any means, without permission. Cover: Inger Sandved Anfinsen. Printed in Norway: AIT Oslo AS. Produced in co-operation with Akademika Publishing. The thesis is produced by Akademika Publishing merely in connection with the thesis defence. Kindly direct all inquiries regarding the thesis to the copyright holder or the unit which grants the doctorate. To my family Abstract In recent years, InfiniBand has become one of the leading interconnects for high- performance systems. InfiniBand is not only the most popular interconnect used in the fastest, largest and most expensive supercomputers in the world, but it has also entered the enterprise market and today it can be found in thousands of datacenters, in database systems, in financial institutions dealing with high- frequency trading, and even in web mapping services where picture data is stitched together to form the map we see in a web browser. Many of the InfiniBand systems are built using the fat-tree topology. The fat-tree routing algorithms that are used to distribute the paths in a fat-tree have been scrutinized in recent years and there were multiple efforts that aimed at improving the network performance. This thesis, as a collection of six research papers, contributes to various routing aspects that concern fat-tree routing for InfiniBand. The research work presented here is strongly influenced by require- ments of enterprise systems, which are significantly smaller than supercomputers, have space, energy and fixed-cabling limitations and require an integrated approach that combines multiple components so the system is robust, scalable and responsive. Current fat-tree routing algorithms lack several features that are required by an optimized enterprise system. In Papers I-IV and Paper VI we propose methods that improve the properties of the fat-tree routing algorithm. The first, and most important property of an interconnection routing algorithm is performance. In Paper I we study the problem of non-optimal routing in topologies where switch ports are not fully populated with end-nodes and we propose a solution that alleviates the counter-intuitive performance drop for such fabrics. Paper II goes further and proposes a cheap alternative to congestion control. By using multiple virtual lanes, we are able to remove head-of-line blocking and parking lot problems in fat-tree fabrics, which significantly improves the performance when multiple hotspots are present in the network. Paper III focuses on V route reachability, which is another major property of a routing algorithm. We formally prove that full reachability between all devices can be accomplished in any regular fat-tree without any deadlock concern and we present an algorithm that achieves full reachability between any pair of devices in a fat-tree fabric. The last routing property we analyze is fault tolerance, which we dis- cuss in Paper IV and VI. The former paper is a study of four major routing algorithms where we compare their routing performance on various irregular fattrees. In this paper, we also propose and analyze methods to make the discovery algorithm for the fat-tree routing more fault-tolerant and scalable. Paper VI presents a multihomed routing algorithm that makes sure that no single point of failure exists in a fabric for a node that has more than one port. When InfiniBand was standardized, several major features were missing. One of such crucial features is the layer-3 routing or IB-IB routing between IB subnets. Currently, there is very little support for such technology, but the limits imposed on local addressing space, inability to logically segment fabrics, long reconfiguration times for large fabrics in case of faults, and, finally, performance issues when interconnecting large clusters, have rekindled the industry’s interest into layer-3 routing. In Paper V we examine the layer-3 routing problems in InfiniBand and we introduce two new routing algorithms for inter-subnet IB routing. We show that the features provided by the fat-tree topology make it an excellent choice for the underlying logical backbone of the fabric and the two routing algorithms reuse the original concepts fat-tree routing is based upon. VI Acknowledgements First and foremost, I wish to express my heartfelt gratitude to my supervisors, Sven-Arne Reinemo, Tor Skeie, and Olav Lysne who gave me encouragement, constructive feedback, and guidance through the whole duration of my studies. Finishing this journey would not be possible without you. I am also grateful to my friends and colleagues at Simula Research Labora- tory for making it an attractive place to do research. I am especially indebted to Wei Lin Guay, Ernst Gunnar Gran and Frank Olaf Sem-Jacobsen, who taught me so many useful things. I would also like to thank my colleagues from Oracle Corporation, Bjørn Dag Johnsen, Line Holen, Lars Paul Huse, Ola Tørudbakken and Jørn Raastad who supported me during my PhD with their excellent technical skills and motivated me in difficult moments. I would not have contemplated this journey if not for my parents, Bo˙zenna Bogdańska and Bogdan Bogdański, who infused me with a love of science and creative pursuits. To my parents, thank you. Finally, I would like to thank my wife, Beata, for her neverending patience, love and support, and my little son Jan Krzysztof, for his bright and radiant smile. VII Table of Contents Abstract V Acknowledgements VII Table of Contents IX List of Figures XI 1 Introduction 1 1.1ContextoftheWork......................... 2 1.2Contributions............................. 4 1.3 Research Methods .......................... 5 1.3.1 Exploratory Research .................... 6 1.3.2 Constructive Research .................... 7 1.3.3 Simulations.......................... 7 1.3.4 HardwareExperiments.................... 8 1.4Publishedworks........................... 8 2 Background 13 2.1InterconnectionNetworks...................... 13 2.1.1 Topologies........................... 15 2.1.1.1 Shared-BusNetworks............... 15 2.1.1.2 DirectNetworks.................. 17 2.1.1.3 IndirectNetworks................. 17 2.1.1.4 HybridNetworks.................. 18 2.1.2 Fat-Trees........................... 19 2.1.2.1 Fat-TreeVariants................. 20 IX 2.1.2.2 ConstructionCost................. 21 2.1.3 SwitchingTechniquesandFlowControl.......... 22 2.1.3.1 CircuitSwitching................. 23 2.1.3.2 PacketSwitching.................. 24 2.1.3.3 VirtualChannels.................. 25 2.1.4 Routing............................ 26 2.1.4.1 RoutingFunction................. 26 2.1.4.2 TaxonomyofRoutingAlgorithms........ 27 2.1.4.3 Fat-TreeRouting.................. 28 2.1.4.4 Deadlock...................... 30 2.2 InfiniBand Architecture ....................... 32 2.2.1 InfiniBand Enterprise Systems ............... 34 2.2.2 Routing in InfiniBand .................... 35 2.2.2.1 Min-Hop...................... 36 2.2.2.2 Up*/Down*.................... 36 2.2.2.3 DimensionOrderRouting............. 37 2.2.2.4 Torus-2QoS..................... 37 2.2.2.5 LASH........................ 37 2.2.2.6 DFSSSP...................... 38 3 Summary of Research Papers 39 3.1 Paper I: Achieving Predictable High Performance in Imbalanced FatTrees............................... 40 3.2 Paper II: vFtree - A Fat-tree Routing Algorithm using Virtual LanestoAlleviateCongestion.................... 41 3.3 Paper III: sFtree: A Fully Connected and Deadlock-Free Switch- to-SwitchRoutingAlgorithmforFat-Trees............ 42 3.4PaperIV:DiscoveryandRoutingofDegradedFat-Trees..... 43 3.5 Paper V: Making the Network Scalable: Inter-subnet Routing in InfiniBand ............................... 44 3.6 Paper VI: Multi-homed Fat-Tree Routing with InfiniBand .... 45 4 Conclusions 47 4.1FutureDirections........................... 47 4.2ConcludingRemarks......................... 49 Bibliography 51 Published Works 63 X List of Figures 2.1 The strictly orthogonal direct networks. .............. 16 2.2Multistageinterconnectionnetworks................. 18 2.3 XGFT notation cannot be used to describe full CBB [1]. ..... 22 2.4Virtualchannelflowcontrol...................... 26 2.5Unidirectionalringwithfournodes................. 31 2.6IBASystemAreaNetwork.CourtesyofIBTA[p.89][2]..... 33 XI Chapter 1 Introduction The term Super Computing was first used in 19291, however, it was not un- til the 1960s that real supercomputers were introduced. First supercomputers consisted of relatively few custom-built processors, but today, massively parallel supercomputers equipped with tens of thousands of commercial off-the-shelf processors, are a standard. This evolution is best shown by comparing the performance of the Cray Titan supercomputer, currently one of the fastest supercomputers in the world (17.6 PetaFLOPS), with the CDC 6600, the first machine considered to be a supercomputer (1 MegaFLOP) - the Titan is 1.76∗1010 faster than the CDC 6600. For comparison purposes, this is also the difference between the land speed of a garden snail (0.017 m/s) and the speed of light in a vacuum (299 792 458 m/s). However, it is not only the number

Optimized Routing for Fat-Tree Topologies

Appendix E Final-Draft.Fm Page 1 Friday, July 14, 2006 10:23 AM

Testing the Limits of Tapered Fat Tree Networks

On-Line Reconfigurable Extended Generalized Fat Tree Network-On-Chip for Multiprocessor System-On-Chip Circuits

Parallel Computing MIMD Machine (I)

Download File

Machine CM-5 Technical Summary

Energy-Efficient Interconnection Networks for High-Performance Computing

Chapter 1: Parallel Computing

Research Report Trace-Driven Co-Simulation Of

The RACE Network Architecture

Designing Hybrid Data Center Networks for High Performance and Fault Tolerance

Interconnection Networks for Parallel Computers 1613