Optimized Routing for Fat-Tree Topologies

Total Page:16

File Type:pdf, Size:1020Kb

Optimized Routing for Fat-Tree Topologies Optimized Routing for Fat-Tree Topologies Bartosz Bogda´nski Thesis submitted for the degree of Philosophiae Doctor Department of Informatics Faculty of Mathematics and Natural Sciences University of Oslo January 2014 © Bartosz BogdaĔski, 2014 Series of dissertations submitted to the Faculty of Mathematics and Natural Sciences, University of Oslo No. 1521 ISSN 1501-7710 All rights reserved. No part of this publication may be reproduced or transmitted, in any form or by any means, without permission. Cover: Inger Sandved Anfinsen. Printed in Norway: AIT Oslo AS. Produced in co-operation with Akademika Publishing. The thesis is produced by Akademika Publishing merely in connection with the thesis defence. Kindly direct all inquiries regarding the thesis to the copyright holder or the unit which grants the doctorate. To my family Abstract In recent years, InfiniBand has become one of the leading interconnects for high- performance systems. InfiniBand is not only the most popular interconnect used in the fastest, largest and most expensive supercomputers in the world, but it has also entered the enterprise market and today it can be found in thousands of datacenters, in database systems, in financial institutions dealing with high- frequency trading, and even in web mapping services where picture data is stitched together to form the map we see in a web browser. Many of the InfiniBand systems are built using the fat-tree topology. The fat-tree routing algorithms that are used to distribute the paths in a fat-tree have been scrutinized in recent years and there were multiple efforts that aimed at improving the network performance. This thesis, as a collection of six research papers, contributes to various routing aspects that concern fat-tree routing for InfiniBand. The research work presented here is strongly influenced by require- ments of enterprise systems, which are significantly smaller than supercomput- ers, have space, energy and fixed-cabling limitations and require an integrated approach that combines multiple components so the system is robust, scalable and responsive. Current fat-tree routing algorithms lack several features that are required by an optimized enterprise system. In Papers I-IV and Paper VI we propose methods that improve the properties of the fat-tree routing algorithm. The first, and most important property of an interconnection routing algorithm is perfor- mance. In Paper I we study the problem of non-optimal routing in topologies where switch ports are not fully populated with end-nodes and we propose a solution that alleviates the counter-intuitive performance drop for such fabrics. Paper II goes further and proposes a cheap alternative to congestion control. By using multiple virtual lanes, we are able to remove head-of-line blocking and parking lot problems in fat-tree fabrics, which significantly improves the perfor- mance when multiple hotspots are present in the network. Paper III focuses on V route reachability, which is another major property of a routing algorithm. We formally prove that full reachability between all devices can be accomplished in any regular fat-tree without any deadlock concern and we present an algo- rithm that achieves full reachability between any pair of devices in a fat-tree fabric. The last routing property we analyze is fault tolerance, which we dis- cuss in Paper IV and VI. The former paper is a study of four major routing algorithms where we compare their routing performance on various irregular fat- trees. In this paper, we also propose and analyze methods to make the discovery algorithm for the fat-tree routing more fault-tolerant and scalable. Paper VI presents a multihomed routing algorithm that makes sure that no single point of failure exists in a fabric for a node that has more than one port. When InfiniBand was standardized, several major features were missing. One of such crucial features is the layer-3 routing or IB-IB routing between IB subnets. Currently, there is very little support for such technology, but the limits imposed on local addressing space, inability to logically segment fabrics, long reconfiguration times for large fabrics in case of faults, and, finally, performance issues when interconnecting large clusters, have rekindled the industry’s interest into layer-3 routing. In Paper V we examine the layer-3 routing problems in InfiniBand and we introduce two new routing algorithms for inter-subnet IB routing. We show that the features provided by the fat-tree topology make it an excellent choice for the underlying logical backbone of the fabric and the two routing algorithms reuse the original concepts fat-tree routing is based upon. VI Acknowledgements First and foremost, I wish to express my heartfelt gratitude to my supervisors, Sven-Arne Reinemo, Tor Skeie, and Olav Lysne who gave me encouragement, constructive feedback, and guidance through the whole duration of my studies. Finishing this journey would not be possible without you. I am also grateful to my friends and colleagues at Simula Research Labora- tory for making it an attractive place to do research. I am especially indebted to Wei Lin Guay, Ernst Gunnar Gran and Frank Olaf Sem-Jacobsen, who taught me so many useful things. I would also like to thank my colleagues from Oracle Corporation, Bjørn Dag Johnsen, Line Holen, Lars Paul Huse, Ola Tørudbakken and Jørn Raastad who supported me during my PhD with their excellent technical skills and motivated me in difficult moments. I would not have contemplated this journey if not for my parents, Bo˙zenna Bogda´nska and Bogdan Bogda´nski, who infused me with a love of science and creative pursuits. To my parents, thank you. Finally, I would like to thank my wife, Beata, for her neverending patience, love and support, and my little son Jan Krzysztof, for his bright and radiant smile. VII Table of Contents Abstract V Acknowledgements VII Table of Contents IX List of Figures XI 1 Introduction 1 1.1ContextoftheWork......................... 2 1.2Contributions............................. 4 1.3 Research Methods .......................... 5 1.3.1 Exploratory Research .................... 6 1.3.2 Constructive Research .................... 7 1.3.3 Simulations.......................... 7 1.3.4 HardwareExperiments.................... 8 1.4Publishedworks........................... 8 2 Background 13 2.1InterconnectionNetworks...................... 13 2.1.1 Topologies........................... 15 2.1.1.1 Shared-BusNetworks............... 15 2.1.1.2 DirectNetworks.................. 17 2.1.1.3 IndirectNetworks................. 17 2.1.1.4 HybridNetworks.................. 18 2.1.2 Fat-Trees........................... 19 2.1.2.1 Fat-TreeVariants................. 20 IX 2.1.2.2 ConstructionCost................. 21 2.1.3 SwitchingTechniquesandFlowControl.......... 22 2.1.3.1 CircuitSwitching................. 23 2.1.3.2 PacketSwitching.................. 24 2.1.3.3 VirtualChannels.................. 25 2.1.4 Routing............................ 26 2.1.4.1 RoutingFunction................. 26 2.1.4.2 TaxonomyofRoutingAlgorithms........ 27 2.1.4.3 Fat-TreeRouting.................. 28 2.1.4.4 Deadlock...................... 30 2.2 InfiniBand Architecture ....................... 32 2.2.1 InfiniBand Enterprise Systems ............... 34 2.2.2 Routing in InfiniBand .................... 35 2.2.2.1 Min-Hop...................... 36 2.2.2.2 Up*/Down*.................... 36 2.2.2.3 DimensionOrderRouting............. 37 2.2.2.4 Torus-2QoS..................... 37 2.2.2.5 LASH........................ 37 2.2.2.6 DFSSSP...................... 38 3 Summary of Research Papers 39 3.1 Paper I: Achieving Predictable High Performance in Imbalanced FatTrees............................... 40 3.2 Paper II: vFtree - A Fat-tree Routing Algorithm using Virtual LanestoAlleviateCongestion.................... 41 3.3 Paper III: sFtree: A Fully Connected and Deadlock-Free Switch- to-SwitchRoutingAlgorithmforFat-Trees............ 42 3.4PaperIV:DiscoveryandRoutingofDegradedFat-Trees..... 43 3.5 Paper V: Making the Network Scalable: Inter-subnet Routing in InfiniBand ............................... 44 3.6 Paper VI: Multi-homed Fat-Tree Routing with InfiniBand .... 45 4 Conclusions 47 4.1FutureDirections........................... 47 4.2ConcludingRemarks......................... 49 Bibliography 51 Published Works 63 X List of Figures 2.1 The strictly orthogonal direct networks. .............. 16 2.2Multistageinterconnectionnetworks................. 18 2.3 XGFT notation cannot be used to describe full CBB [1]. ..... 22 2.4Virtualchannelflowcontrol...................... 26 2.5Unidirectionalringwithfournodes................. 31 2.6IBASystemAreaNetwork.CourtesyofIBTA[p.89][2]..... 33 XI Chapter 1 Introduction The term Super Computing was first used in 19291, however, it was not un- til the 1960s that real supercomputers were introduced. First supercomputers consisted of relatively few custom-built processors, but today, massively paral- lel supercomputers equipped with tens of thousands of commercial off-the-shelf processors, are a standard. This evolution is best shown by comparing the performance of the Cray Titan supercomputer, currently one of the fastest su- percomputers in the world (17.6 PetaFLOPS), with the CDC 6600, the first ma- chine considered to be a supercomputer (1 MegaFLOP) - the Titan is 1.76∗1010 faster than the CDC 6600. For comparison purposes, this is also the difference between the land speed of a garden snail (0.017 m/s) and the speed of light in a vacuum (299 792 458 m/s). However, it is not only the number
Recommended publications
  • Appendix E Final-Draft.Fm Page 1 Friday, July 14, 2006 10:23 AM
    Appendix E final-draft.fm Page 1 Friday, July 14, 2006 10:23 AM E.1 Introduction 3 E.2 Interconnecting Two Devices 6 E.3 Connecting More than Two Devices 20 E.4 Network Topology 30 E.5 Network Routing, Arbitration, and Switching 45 E.6 Switch Microarchitecture 56 E.7 Practical Issues for Commercial Interconnection Networks 63 E.8 Examples of Interconnection Networks 70 E.9 Internetworking 80 E.10 Crosscutting Issues for Interconnection Networks 84 E.11 Fallacies and Pitfalls 89 E.12 Concluding Remarks 96 E.13 Historical Perspective and References 98 Exercises 108 Appendix E final-draft.fm Page 2 Friday, July 14, 2006 10:23 AM E Interconnection Networks 1 Revised by Timothy Mark Pinkston, USC, and Jose Duato, UPV and Simula “The Medium is the Message” because it is the medium that shapes and controls the search and form of human associations and actions. Marshall McLuhan Understanding Media (1964) The marvels—of film, radio, and television—are marvels of one-way communication, which is not communication at all. Milton Mayer On the Remote Possibility of Communication (1967) The interconnection network is the heart of parallel architecture. Chuan-Lin and Tse-Yun Feng Interconection Networks for Parallel and Distributed Pro- cessing (1984) Indeed, as system complexity and integration continues to increase, many designers are finding it more efficient to route packets, not wires. Bill Dally Principles and Practices of Interconnection Networks (2004) Appendix E final-draft.fm Page 3 Friday, July 14, 2006 10:23 AM 3 n Appendix E Interconnection Networks E.1 Introduction Previous chapters and appendices cover the components of a single computer but give little consideration to the interconnection of those components and how multiple computer systems are interconnected.
    [Show full text]
  • Testing the Limits of Tapered Fat Tree Networks
    2019 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS) Testing the Limits of Tapered Fat Tree Networks Philip Taffet Sanil Rao Edgar A. Leon´ and Ian Karlin Dept. of Computer Science Dept. of Electrical and Computer Engineering Livermore Computing Rice University Carnegie Mellon University Lawrence Livermore National Laboratory Houston, TX Pittsburgh, PA Livermore, CA [email protected] [email protected] fleon,[email protected] Abstract—HPC system procurement with a fixed budget is an was made. This work takes a view back at previous work to optimization problem with many trade-offs. In particular, the evaluate the decision made to use a 2:1 tapered network [1] for choice of an interconnection network for a system is a major Quartz, a system at Lawrence Livermore National Laboratory choice, since communication performance is important to overall application performance and the network makes up a substantial (LLNL). We use three applications from this study, including fraction of a supercomputer’s overall price. It is necessary to the two that were most impacted by network tapering and understand how sensitive representative jobs are to various one of the most latency sensitive applications. Building on aspects of network performance to procure the right network. our work previously presented in poster form at SC17 and Unlike previous studies, which used mostly communication-only SC18 [6], [7], we add controlled congestion to study applica- motifs or simulation, this work employs a real system and measures the performance of representative applications under tion performance under adverse traffic conditions. Our work controlled environments.
    [Show full text]
  • On-Line Reconfigurable Extended Generalized Fat Tree Network-On-Chip for Multiprocessor System-On-Chip Circuits
    Julkaisu 614 Publication 614 Heikki Kariniemi On-Line Reconfigurable Extended Generalized Fat Tree Network-on-Chip for Multiprocessor System-on-Chip Circuits Tampere 2006 Tampereen teknillinen yliopisto. Julkaisu 614 Tampere University of Technology. Publication 614 Heikki Kariniemi On-Line Reconfigurable Extended Generalized Fat Tree Network-on-Chip for Multiprocessor System-on-Chip Circuits Thesis for the degree of Doctor of Technology to be presented with due permission for public examination and criticism in Sähkötalo Building, Auditorium S1, at Tampere University of Technology, on the 27th of September 2006, at 12 noon. Tampereen teknillinen yliopisto - Tampere University of Technology Tampere 2006 ISBN 952-15-1651-8 (printed) ISBN 952-15-1746-8 (PDF) ISSN 1459-2045 Abstract The System-on-Chip (SoC) circuits contain already today a large number of processors and other blocks, which sets several hard requirements to their communication infrastructures implemented with Networks-On-Chip (NOC). The NOCs are a generally accepted concept in the semiconductor industry for solving the problems related with an on-chip communication which include among other things high wire delays and increasing clock skew. Other problems related to on-chip communication are the scalability of the NOC topologies for different systems sizes and for different performance and reliability requirements. This thesis concerns the implementation of communication infrastructures with eXtended Generalized Fat Tree (XGFT) Network-On-Chip (NOC) for Multi-Processor SoCs (MPSoC). Actually this thesis presents two different NOCs, because the XGFT NOC is compared with a two-dimensional (2-D) mesh NOC which was especially developed and designed for this purpose. The 2-D mesh is used as a reference NOC, because it is the most commonly used topology in different NOC implementations presented by today.
    [Show full text]
  • Parallel Computing MIMD Machine (I)
    Lecture 2 Parallel Programming Platforms Flynn’s Taxonomy In 1966, Michael Flynn classified systems according to numbers of instruction streams and the number of data stream. Data stream Single Multiple Instruction stream stream Instruction Single Multiple SISD SIMD Uniprocessors Processor arrays Pipelined vector processors MISD MIMD Systolic arrays Multiprocessors Multicomputers SISD Machine Example: single CPU computers (serial computer) • Single instruction: Only one instruction stream is acted on by CPU during one clock cycle • Single data: Only one data stream is used as input during one clock cycle • Deterministic execution SIMD Machine (I) • A parallel computer • It typically has a single CPU devoted exclusively to control, a large number of subordinate ALUs, each with its own memory and a high- bandwidth internal network. • Control CPU broadcasts an instruction to all subordinate ALUs, and each of the subordinate ALUs either executes the instruction it is idle. • Example: CM-1, CM-2, IBM9000 SIMD Machine (2) Control CPU ALU 0 ALU 1 ALU p Mem 0 Mem 1 Mem p Interconnection network SIMD Machine (3) From Introduction to Parallel Computing MIMD Machine (I) • Most popular parallel computer architecture • Each processor is a full-fledged CPU with both a control unit and an ALU. Thus each CPU is capable of executing its own program at its own space. • Execution is asynchronous. Processors can also be specifically programmed to synchronize with each other. • Examples: networked parallel computers, symmetric multiprocessor (SMP) computer. MIMD Machine (II) Load A(1) call func Load B(1) X = Y*Z C(1) = A(1)*B(1) Sum = X^2 time Store C(1) call subroutine1(i) … Next instruction Next instruction CPU 0 CPU 1 Further classification according to memory access: • Shared-memory system • Distributed-memory system (Message-passing) Shared-Memory MIMD Machine (I) • Multiple processors can operate independently, but share the same memory resources (a global address space).
    [Show full text]
  • Download File
    Reconfigurable Optically Interconnected Systems Yiwen Shen Submitted in partial fulfillment ofthe requirements for the degree of Doctor of Philosophy in the Graduate School of Arts and Sciences COLUMBIA UNIVERSITY 2020 © 2020 Yiwen Shen All Rights Reserved ABSTRACT Reconfigurable Optically Interconnected Systems Yiwen Shen With the immense growth of data consumption in today’s data centers and high- performance computing systems driven by the constant influx of new applications, the network infrastructure supporting this demand is under increasing pressure to enable higher bandwidth, latency, and flexibility requirements. Optical interconnects, able to support high bandwidth wavelength division multiplexed signals with extreme energy efficiency, have become the basis for long-haul and metro-scale networks around the world, while photonic components are being rapidly integrated within rack and chip- scale systems. However, optical and photonic interconnects are not a direct replacement for electronic-based components. Rather, the integration of optical interconnects with electronic peripherals allows for unique functionalities that can improve the capacity, compute performance and flexibility of current state-of-the-art computing systems. This requires physical layer methodologies for their integration with electronic components, as well as system level control planes that incorporates the optical layer characteristics. This thesis explores various network architectures and the associated control plane, hard- ware infrastructure, and other supporting software modules needed to integrate silicon photonics and MEMS based optical switching into conventional datacom network systems ranging from intra-data center and high-performance computing systems to the metro- scale layer networks between data centers. In each of these systems, we demonstrate dynamic bandwidth steering and compute resource allocation capabilities to enable sig- nificant performance improvements.
    [Show full text]
  • Machine CM-5 Technical Summary
    (M5 Connection Machine CM-5 Technical Summary Corporation -lrrlrll S llY~r =ir-~Ur - Connection Machine CM-5 Technical Summary November 1993 EMImtm~t NUMBI-RUIS~~Rn; First printing, October 1991 Revised, November 1992 Revised, November 1993 The information in this document is subject to change without notice and should not be construed as a commitment by Thinking Machines Corporati. Thining Machines reserves the right to make changes to any product described herein. Although the informanationin this document has been reviewed and is believed to be reliable, Thinking Machines Corpotioa assumes no liability for emrs in this document. Thinking Machines does not assume any liability arising from the application or use of any information or product described herein Connection Machine is a registered trademark of Thinking Machines Corporation. CM, CM-2,CM-200, CM-5, CM-5 Scale 3, and DataVault are trademarks of Thinking Machines Corporation. CMos; CMAX,and Prism are trademarks of Thinking Machines Corporation. C * is a registered trademark of Thinking Machines Corporation. Paris, *Lisp, and CM Fortran are trademarks of Thinking Machines Corporation. CMMD,CMSSL, and CMX11are trademarks of Thinking Machines Corporation. CMviewis a trademark of Thinking Machines Corporation. FastGraph is a trademark of Thinking Machines Corporation. FastGraphU.S. Patent No. 5,247,694 Scalable Computing (SC) is a trademark of Thinking Machines Corporation. Scalable Disk Array (SDA)is a trademark of Thinking Machines Corporation. Thinking Machines® is a registered trademark of Thinking Machines Corporation. AVSis a trademark of Advanced Visualization Systems, Inc. Ethernet is a trademark of Xerox Corporation. IBM is a trademark of International Business Machines Corporation.
    [Show full text]
  • Energy-Efficient Interconnection Networks for High-Performance Computing
    DISSERTATION submitted to the Combined Faculty of Natural Sciences and Mathematics of the Ruprecht–Karls University Heidelberg for the degree of Doctor of Natural Sciences put forward by M.Sc. Felix Zahn born in Mannheim, Baden-Württemberg Heidelberg, 2020 Energy-Efficient Interconnection Networks for High-Performance Computing Advisor: Professor Dr. Holger Fröning Date of oral exam: ........................... Abstract In recent years, energy has become one of the most important factors for de- signing and operating large scale computing systems. This is particularly true in high-performance computing, where systems often consist of thousands of nodes. Especially after the end of Dennard’s scaling, the demand for energy- proportionality in components, where energy is depending linearly on utilization, increases continuously. As the main contributor to the overall power consump- tion, processors have received the main attention so far. The increasing energy proportionality of processors, however, shifts the focus to other components such as interconnection networks. Their share of the overall power consumption is expected to increase to 20% or more while other components further increase their efficiency in the near future. Hence, it is crucial to improve energy propor- tionality in interconnection networks likewise to reduce overall power and energy consumption. To facilitate these attempts, this work provides comprehensive studies about energy saving in interconnection networks at different levels. First, interconnection networks differ fundamentally from other components in their underlying technology. To gain a deeper understanding of these differences and to identify targets for energy savings, this work provides a detailed power analysis of current network hardware. Furthermore, various applications at different scales are analyzed regarding their communication patterns and locality properties.
    [Show full text]
  • Chapter 1: Parallel Computing
    CS 554 / CSE 512: Parallel Numerical Algorithms Lecture Notes Chapter 1: Parallel Computing Michael T. Heath and Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign November 10, 2017 1 Motivation Computational science has driven demands for large-scale machine resources since the early days of com- puting. Supercomputers—collections of machines connected by a network—are used to solve the most computationally expensive problems and drive the frontier capabilities of scientific computing. Leveraging such large machines effectively is a challenge, however, for both algorithms and software systems. Numer- ical algorithms—computational methods for finding approximate solutions to mathematical problems— serve as the basis for computational science. Parallelizing these algorithms is a necessary step for deploying them on larger problems and accelerating the time to solution. Consequently, the interdependence of parallel and numerical computing has shaped the theory and practice of both fields. To study parallel numerical algo- rithms, we will first aim to establish a basic understanding of parallel computers and formulate a theoretical model for the scalability of parallel algorithms. 1.1 Need for Parallelism Connecting the latest technology chips via networks so that they can work in concert to solve larger problems has long been the method of choice for solving problems beyond the capabilities of a single computer. Improvements in microprocessor technology, hardware logic, and algorithms have expanded the scope of problems that can be solved effectively on a sequential computer. These improvements face fundamental physical barriers, however, so that parallelism has prominently made its way on-chip. Increasing the number of processing cores on a chip is now the dominant source of hardware-level performance improvement.
    [Show full text]
  • Research Report Trace-Driven Co-Simulation Of
    RZ 3721 (# 99731) 12/02/2008 Computer Science 8 pages Research Report Trace-driven Co-simulation of High-Performance Computing Systems using OMNeT++ Cyriel Minkenberg, Germán Rodriguez Herrera IBM Research GmbH Zurich Research Laboratory 8803 Rüschlikon Switzerland Email: {sil,rod}@zurich.ibm.com LIMITED DISTRIBUTION NOTICE This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside pub- lisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies (e.g., payment of royalties). Some re- ports are available at http://domino.watson.ibm.com/library/Cyberdig.nsf/home. Research Almaden • Austin • Beijing • Delhi • Haifa • T.J. Watson • Tokyo • Zurich Trace-driven Co-simulation of High-Performance Computing Systems using OMNeT++ Cyriel Minkenberg Germán Rodriguez Herrera IBM Zurich Research Laboratory IBM Zurich Research Laboratory Säumerstrasse 4 Säumerstrasse 4 8803 Rüschlikon, Switzerland 8803 Rüschlikon, Switzerland [email protected] [email protected] ABSTRACT ing and routing policies, link-level flow control mechanism, In the context of developing next-generation high-perfor- and end-to-end congestion control mechanism. This type of mance computing systems, there is often a need for an “end- simulation is commonly of the “Monte Carlo” variety, i.e., to-end” simulation tool that can simulate the behaviour of applying a synthetically generated workload with random a full application on a reasonably faithful model of the ac- destination and interarrival-time distributions, rather than tual system.
    [Show full text]
  • The RACE Network Architecture
    The RACE Network Architecture Bradley C. Kuszmaul [email protected] Mercury Computer Systems, Inc. 199 Riverneck Road Chelmsford, MA 01824 Abstract ¡ The RACE R parallel computer system provides a high-performance parallel interconnection network at low cost. This paper describes the architecture and implementation of the RACE system, a parallel computer for embedded applications. The topology of the network, which is constructed with 6-port switches, can be specified by the customer, and is typically a fat-tree, a Clos network, or a mesh. The network employs a preemptable circuit switched strategy. The network and the processor-network interface work together to provide high performance: 160 megabytes per second transfer rates with about 1 microsecond of latency. Priorities can be used to guarantee tight real-time constraints of a few microseconds through a congested network. A self-regulating circuit adjusts the impedence and output delay of the pin-driver pads. 1 Introduction Parallel computers often force a choice between low performance and high cost. High performance usually im- plies high price, and low price usually implies low performance. The RACE1 embedded parallel computer system, however, provides high performance at low cost. Most multipurpose massively parallel computers cost an order of magnitude more than RACE, while most embedded-systems parallel computers provide an order of magnitude less performance. This paper shows how the RACE network architecture helps the RACE system by providing low cost, high-performance interconnect. (An overview of the RACE software architecture can be found in [3].) The RACE system is designed for embedded applications, such as medical image processing and military signal processing.
    [Show full text]
  • Designing Hybrid Data Center Networks for High Performance and Fault Tolerance
    ABSTRACT Designing Hybrid Data Center Networks for High Performance and Fault Tolerance by Dingming Wu Data center networks are critical infrastructure behind today’s cloud services. High-performance data center network architectures have received extensive attention in the past decade, with multiple interesting DCN architectures proposed over the years. In the early days, data center networks were built mostly on electrical packet switches, with Clos, or multi-rooted tree, structures. These networks fall short in 1) providing high bandwidth for emerging non-uniform traffic communications, e.g., multicast transmissions and skewed communicating server pairs, and 2) recovering from various network component failures, e.g., link failures and switch failures. More recently, optical technologies have gained high popularity due to its ultra-simple and flexible connectivity provisions, low power and low cost. However, the advantages of optical networking come at a price—orders of magnitude of lower switching speed than its electrical counterpart. This thesis explores the design space of hybrid electrical and optical network architectures for modern data centers. It tries to reach a delicate balance between performance, fault-tolerance, scalability and cost through coordinated use of both electrical and optical components in the network. We have developed several ap- proaches to achieving these goals from different angles. First, we used optical splitters as key building blocks to improve multicast transmission performance. We built an unconventional optical multicast architecture, called HyperOptics, that provides orders of magnitude of throughput improvement for multicast transmissions. Sec- ond, we developed a failure tolerant network, called ShareBackup, by embedding optical switches into the Clos networks.
    [Show full text]
  • Interconnection Networks for Parallel Computers 1613
    INTERCONNECTION NETWORKS FOR PARALLEL COMPUTERS 1613 38. E. Davidson, S. McArthur, J. McDonald, T. Cumming, and I. tion network as depicted in Fig. 1(a). In the physically Watt, Applying multi-agent system technology in practice: distributed-memory parallel computer, a processor and a automated management and analysis of SCADA and digital memory module form a processor–memory pair that is fault recorder data, IEEE Trans. Power Sys., 21 (2): 559–567, called processing element (PE). All N PEs are intercon- 2006. nected via an interconnection network as depicted in 39. I. Baxevanos, D. Labridis, Software agents situated in primary Fig. 1(b). In a message-passing system, PEs communicate distribution networks: A cooperative system for fault and by sending and receiving single messages (2), while in a power restoration management, IEEE Trans. Power Delivery, 22 (4): 2378–2385, 2007. distributed-shared-memory system, the distributed PE memory modules act as a single shared address space in 40. D. Morais, J. Rolim, and Z. Vale, DITRANS – A multi-agent system for integrated diagnosis of power transformers, Proc. of which a processor can access any memory cell (3). This cell POWERTECH 2007, Lausanne, Switzerland, 2007. will either be in the memory module local to the processor, 41. R. Giovanini, K. Hopkinson, D. Coury, and J. Thorp, A or be in a different PE that has to be accessed over the primary and backup cooperative protection system based on interconnection network. wide area agents, IEEE Trans. Power Delivery, 21 (3): 1222– Parallel computers can be further divided into SIMD 1230, 2006. and MIMD machines.
    [Show full text]