
Quantifying the impact of network congestion on application performance and network metrics Yijia Zhang∗, Taylor Grovesy, Brandon Cooky, Nicholas J. Wrighty and Ayse K. Coskun∗ ∗ Boston University, Boston, MA, USA; E-mail: fzhangyj, [email protected] y National Energy Research Scientific Computing Center, Berkeley, CA, USA; E-mail: ftgroves, bgcook, [email protected] Abstract—In modern high-performance computing (HPC) sys- also demonstrate that certain Aries network metrics are good tems, network congestion is an important factor that contributes indicators of network congestion. to performance degradation. However, how network congestion The contributions of this work are listed as follow: impacts application performance is not fully understood. As Aries network, a recent HPC network architecture featuring • In a dragonfly-network system, we quantify the impact of a dragonfly topology, is equipped with network counters mea- network congestion on the performance of various applica- suring packet transmission statistics on each router, these net- tions. We find that while applications with intensive MPI work metrics can potentially be utilized to understand network operations suffer from more than 40% extension in their performance. In this work, by experiments on a large HPC system, we quantify the impact of network congestion on various execution times under network congestion, the applications applications’ performance in terms of execution time, and we with less intensive MPI operations are negligibly affected. correlate application performance with network metrics. Our • We find that applications are more impacted by congestor results demonstrate diverse impacts of network congestion: while on nearby nodes with shared routers, and are less impacted applications with intensive MPI operations (such as HACC and by congestor on nodes without shared routers. This suggests MILC) suffer from more than 40% extension in their execution times under network congestion, applications with less intensive that a compact job allocation strategy is better than a non- MPI operations (such as Graph500 and HPCG) are mostly not compact strategy because sharing routers among different affected. We also demonstrate that a stall-to-flit ratio metric jobs is more common in a non-compact allocation strategy. derived from Aries network counters is positively correlated with • We show that a stall-to-flit ratio metric derived from Aries performance degradation and, thus, this metric can serve as an network tiles counters is positively correlated with perfor- indicator of network congestion in HPC systems. Index Terms—HPC, network congestion, network counters mance degradation and indicative of network congestion. I. INTRODUCTION II. ARIES NETWORK COUNTERS AND METRICS High-performance computing (HPC) systems play an impor- In this section, we first provide background on the Aries net- tant role in accelerating scientific research in various realms. work router. Then, we introduce our network metrics derived However, applications running on HPC systems frequently from Aries counters. The value of these metrics in revealing suffer from performance degradation [1]. Network congestion network congestion is evaluated in Section IV. is a major cause of performance degradation in HPC sys- tems [2]–[4], leading to extention on job execution time 6X A. Aries network router longer than the optimal [5]. Although performance degradation Aries is one of the latest HPC network architectures [6]. caused by congestion has been commonly observed, it is not Aries network features a dragonfly topology [7], where multi- well understood how that impact differs from application to ple routers are connected by row/column links to form a virtual application. Which network metrics could indicate network high-radix router (called a “group”), and different groups are congestion and performance degradation is also unclear. Un- connected by optical links in an all-to-all manner, giving derstanding the behavior of network metrics and application the network a low-diameter property, where the shortest path performance under network congestion on large HPC systems between any two nodes is only a few hops away. will be helpful in developing strategies to reduce congestion Figure 1 shows the 48 tiles of an Aries router in a Cray and improve the performance of HPC systems. XC40 system. The blue tiles include the optical links con- In this paper, we conduct experiments on a large HPC necting different groups; the green and grey tiles include the system called Cori, which is a 12k-node Cray XC40 system. electrical links connecting routers within a group; and the We run a diverse set of applications while running net- yellow tiles include links to the four nodes connected to this work congestors simultaneously on nearby nodes. We collect router. In the following, we call the 8 yellow tiles as processor application performance as well as Aries network counter tiles (ptiles); and we call the other 40 as network tiles (ntiles). metrics. Our results demonstrate substantial difference in the impact of network congestion on application performance. We B. Network metrics In each router, Aries hardware counters collect various types 978-1-7281-6677-3/20/$31.00 ©2020 IEEE of network transmission statistics [8], including the number of TABLE I: Aries network counters used in this work [8]. Abbreviation Full counter name N STALL r c AR RTR r c INQ PRF ROWBUS STALL CNT N FLIT r c v AR RTR r c INQ PRF INCOMING FLIT VCv P REQ STALL n AR NL PRF REQ PTILES TO NIC n STALLED P REQ FLIT n AR NL PRF REQ PTILES TO NIC n FLITS P RSP STALL n AR NL PRF RSP PTILES TO NIC n STALLED P RSP FLIT n AR NL PRF RSP PTILES TO NIC n FLITS III. EXPERIMENTAL METHODOLOGY We conduct experiments on Cori, which is a 12k-node Cray XC40 system located at the Lawrence Berkeley National Laboratory, USA. On Cori, network counter data are collected Fig. 1: Aries router architecture in a dragonfly network. and managed by the Lightweight Distributed Metric Ser- vice (LDMS) tool [9]. LDMS has been continuously running flits/packets travelling on links and the number of stalls that on Cori and collecting counter data for years for every node. represent wasted cycles due to network congestion. The data collection rate is once per second. In this work, we use a stall-to-flit ratio metric derived from To characterize job execution performance, we experiment ntile counters. As the number of network stalls represents the with the following real-world and benchmark applications: number of wasted cycles in transmitting flits from one router to the buffer of another router, we expect the stall/flit ratio • Graph500. We run breadth-first search (BFS) and single- to be an indicator of network congestion. For each router, we source shortest path (SSSP) from Graph500, which are define the ntile stall/flit ratio as: representative graph computation kernels [10]. • HACC. The Hardware Accelerated Cosmology Code frame- Ntile Stall/Flit Ratio work uses gravitational N-body techniques to simulate the N STALL r c formation of structure in an expanding universe [11]. = Avgr20::4;c20::7 P v20::7 N FLIT r c v • HPCG. The High Performance Conjugate Gradient bench- Here, N FLIT r c v is the number of incoming flits per mark models the computational and data access patterns second to the v-th virtual channel of the r-th row, c-th column of real-world applications that contain operations such as network tile. N STALL r c is the number of stalls per second sparse matrix-vector multiplication [12]. in all virtual channels on that ntile. As the stalls and flits • LAMMPS. The Large-scale Atomic/Molecular Massively collected from a specific ntile cannot be attributed to a certain Parallel Simulator is a classical molecular dynamics simu- node, we take an average over all the 40 ntiles (represented lator for modeling solid-state materials and soft matter [13]. as “Avg”) and use it as the ntile stall/flit ratio of the router. • MILC. The MIMD Lattice Computation performs quan- Because the 40 ntiles are the first 5 rows and all 8 columns in tum chromodynamics simulations. Our experiments use the Fig. 1, the metric takes the average for r 2 0::4, and c 2 0::7. su3_rmd application from MILC [14]. In comparison to ntile counters, we also analyze ptile flits • miniAMR. This mini-application applies adaptive mesh per second collected by P REQ FLIT n and P RSP FLIT n, refinement on an Eulerian mesh [15]. which are request and response flits received by a node, • miniMD. This molecular dynamics mini-application is de- respectively. In this paper, we always take the sum of these two veloped for testing new designs on HPC systems [15]. metrics when we refer to ptile flit-per-second. Similarly, we • QMCPACK. This is a many-body ab initio quantum Monte refer to the sum of P REQ STALL n and P RSP STALL n Carlo code for computing the electronic structure of atoms, as the ptile stalls per second. In these metrics, n 2 0::3 molecules, and solids [16]. corresponds to the four nodes connected with this router. Thus, To create network congestion on HPC systems in a con- ptile counters specify the contribution from a certain node. The trolled way, we use the Global Performance and Congestion full names of the counters we mentioned are listed in Table I. Network Tests (GPCNeT) [17], which is a new tool to inject The original counters record values cumulatively, so we take network congestion and benchmark communication perfor- a rolling difference to estimate instantaneous values. mance. When launched on a group of nodes, GPCNeT runs In addition, when we calculate stall/flit ratio, we ignore the congestor kernels on 80% of nodes, and the other 20% runs samples where stall-per-second is smaller than a threshold.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages7 Page
-
File Size-