Measuring Congestion in High-Performance Datacenter Interconnects

Measuring Congestion in High-Performance Datacenter Interconnects Saurabh Jha1, Archit Patke1, Jim Brandt2, Ann Gentile2, Benjamin Lim1, Mike Showerman3, Greg Bauer3, Larry Kaplan4, Zbigniew Kalbarczyk1, William Kramer1,3, Ravi Iyer1 1University of Illinois at Urbana-Champaign, 2Sandia National Lab, 3National Center for Supercomputing Applications, 4Cray Inc. Abstract analysis which may be difficult to perform at runtime. While it is widely acknowledged that network congestion The core contributions of this paper are (a) a methodol- in High Performance Computing (HPC) systems can signifi- ogy, including algorithms, for quantitative characterization cantly degrade application performance, there has been little of congestion of high-speed interconnect networks; (b) in- to no quantification of congestion on credit-based intercon- troduction of a deployable toolset, Monet [7], that employs nect networks. We present a methodology for detecting, ex- our congestion characterization methodology; and (c) use of tracting, and characterizing regions of congestion in networks. the the methodology for characterization of congestion using We have implemented the methodology in a deployable tool, 5 months of operational data from a 3D torus-based inter- Monet, which can provide such analysis and feedback at run- connect network of Blue Waters [1, 27, 60], a 13.3-petaflop time. Using Monet, we characterize and diagnose congestion Cray supercomputer at the National Center for Supercom- in the world’s largest 3D torus network of Blue Waters, a 13.3- puting Applications (NCSA) at the University of Illinois at Urbana-Champaign. The novelty of our approach is its ability petaflop supercomputer at the National Center for Supercom- 1 puting Applications. Our study deepens the understanding of to use percent time stalled (PTs) metric to detect and quan- production congestion at a scale that has never been evaluated titatively characterize congestion hotspots, also referred to before. as congestion regions (CRs), which are group of links with similar levels of congestion. 1 Introduction The Monet tool has been experimentally used on NCSA’s High-speed interconnect networks (HSN), e.g., Infini- Blue Waters. Blue Waters uses a Cray Gemini [21] 3D torus band [48] and Cray Aries [42]), which uses credit-based flow interconnect, the largest known 3D torus in existence, that control algorithms [32, 61], are increasingly being used in connects 27,648 compute nodes, henceforth referred to as high-performance datacenters (HPC [11] and clouds [5,6,8, nodes. The proposed tool is not specific to Cray Gemini and 80]) to support the low-latency communication primitives Blue Waters; it can be deployed on other k-dimensional mesh required by extreme-scale applications (e.g., scientific and or toroidal networks, such as TPU clouds [3], Fujitsu TOFU deep-learning applications). Despite the network support for network-based [18, 20] K supercomputer [70] and upcoming 2 low-latency communication primitives and advanced conges- post-K supercomputer [10] . The key components of our tion mitigation and protection mechanisms, significant perfor- methodology and the Monet toolset are as follows: mance variation has been observed in production systems run- Data collection tools: On Blue Waters, we use vendor- ning real-world workloads. While it is widely acknowledged provided tools (e.g., gpcdr [35]), along with the Lightweight that network congestion can significantly degrade application Distributed Metric Service (LDMS) monitoring frame- performance [24, 26, 45, 71, 81], there has been little to no work [17]. Together these tools collect data on (a) the network quantification of congestion on such interconnect networks (e.g., transferred/received bytes, congestion metrics, and link to understand, diagnose and mitigate congestion problems failure events); (b) the file system traffic (e.g., read/write at the application or system-level. In particular, tools and bytes); and (c) the applications (e.g., start/end time). We are techniques to perform runtime measurement and characteri- released raw network data obtained from Blue Waters [57] as zation and provide runtime feedback to system software (e.g., well as the associated code for generating CRs as artifacts with schedulers) or users (e.g., application developers or system this paper [7]. To the best of our knowledge, this is the first managers) are generally not available on production systems. 1 PTs, defined formally in Section 2, approximately represents the intensity This would require continuous system-wide, data collection of congestion on a link, quantified between 0% and 100%. on the state of network performance and associated complex 2The first post-K supercomputer is scheduled to be deployed in 2021. Z+ Z- directionally Congestion Region 8 links maps §2 System §3 Data Collection §4 Use cases aggregated Description (CR) extraction using to 8 tiles 8 8 links Scheduler CR DB Congestion unsupervised clustering 2 MB 8 4 Y+ (Moab), 7 GB §6 response and X+ 4 mitigation X- 8 } Y- Network monitors Characterization <latexit sha1_base64="MsZ1jqMLiwVBM3Gn0IUpJrNtfIs=">AAAB6XicbVBNS8NAEJ3Ur1q/qh69LBbBU0m0YI8FLx6r2A9oQ9lsJ+3SzSbsboQS+g+8eFDEq//Im//GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ7dzvPKHSPJaPZpqgH9GR5CFn1FjpoT8blCtu1V2ArBMvJxXI0RyUv/rDmKURSsME1brnuYnxM6oMZwJnpX6qMaFsQkfYs1TSCLWfLS6dkQurDEkYK1vSkIX6eyKjkdbTKLCdETVjverNxf+8XmrCup9xmaQGJVsuClNBTEzmb5MhV8iMmFpCmeL2VsLGVFFmbDglG4K3+vI6aV9Vveuqd1+rNOp5HEU4g3O4BA9uoAF30IQWMAjhGV7hzZk4L86787FsLTj5zCn8gfP5A5yGjWA=</latexit> §5 48 tiles (Cray xtnlrd), 100 GB Diagnosing Gemini ASIC CR size §7 Blue Waters Performance metrics CR duration causes of 4 4 links … congestion (3D Torus) samplers (LDMS), 15 TB CR growth NIC0 NIC1 Figure 1: Characterization and diagnosis workflow for interconnection-networks. Figure 2: Cray Gemini 48-port switch. such large-scale network data release for an HPC high-speed in the next measurement window. interconnect network that uses credit-based flow control. • Quick propagation of congestion can be caused by net- A network hotspot extraction and characterization work component failures. Network component failures tool, which extracts CRs at runtime; it does so by using an (e.g., network router failures) that occur in the vicinity of a unsupervised region-growth clustering algorithm. The clus- large-scale application can lead to high network congestion tering method requires specification of congestion metrics within minutes of the failure event. Measurements show 5 (e.g., percent time stalled (PTs) or stall-to-flit ratios) and a that 88% of directional link failures caused the formation network topology graph to extract regions of congestion that of CRs with an average PTs ≥ 15%. can be used for runtime or long-term network congestion • Default congestion mitigation mechanisms have limited characterization. efficacy. Our measurements show that (a) 29.8% of the 261 A diagnosis tool, which determines the cause of conges- triggers of vendor-provided congestion mitigation mecha- tion (e.g., link failures or excessive file system traffic from nisms failed to alleviate long-lasting congestion (i.e., con- applications) by combining system and application execution gestion driven by continuous oversubscription, as opposed information with the CR characterizations. This tool leverages to isolated traffic bursts), as they did not address the root outlier-detection algorithms combined with domain-driven causes of congestion; and (b) vendor-provided mitigation knowledge to flag anomalies in the data that can be correlated mechanisms were triggered in 8% (261) of the 3,390 high- with the occurrence of CRs. congestion events identified by our framework. Of these To produce the findings discussed in this paper, we used 3,390 events, 25% lasted for more than 30 minutes. This 5 months of operational data on Blue Waters representing analysis suggests that augmentation of the vendor-supplied more than 815,006 unique application runs that injected more solution could be an effective way to improve overall con- than 70 PB of data into the network. Our key findings are as gestion management. follows: In this paper, we highlight the utility of congestion regions in • While it is rare for the system to be globally congested, the following ways: there is a continuous presence of highly congested regions (CRs) in the network, and they are severe enough to affect • We showcase the effectiveness of CRs in detecting long- application performance. Measurements show that (a) for lived congestion. Based on this characterization, we pro- more than 56% of system uptime, there exists at least one pose that CR detection could be used to trigger congestion mitigation responses that could augment the current vendor- highly congested CR (i.e., a CR with a PTs > 25%), and that these CRs have a median size of 32 links and a maximum provided mechanisms. size of 2,324 links (5.6% of total links); and (b) highly • We illustrate how CRs, in conjunction with network traf- congested regions may persist for more than 23 hours, with fic assessment, enable congestion diagnosis. Our diagno- a median duration time of 9 hours3. With respect to impact sis tool attributes congestion cause to one of the follow- on applications, we observed 1000-node production runs ing: (a) system issues (such as launch/exit of application), of the NAMD [77] application 4 slowing down by as much (b) failure issues (such as network link failures), and (c) intra-application issues (such as changes in communication

Measuring Congestion in High-Performance Datacenter Interconnects

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support