MPICH-GQ: Quality-Of-Service for Message Passing Programs
Total Page:16
File Type:pdf, Size:1020Kb
MPICH-GQ: Quality-of-Service for Message Passing Programs Alain Roy ∗ Ian Foster ∗† William Gropp † Nicholas Karonis ‡ Volker Sander § Brian Toonen † Abstract One approach to dealing with contention is to explicitly manage the allocation of scarce resources to different pur- Parallel programmers typically assume that all resources poses. If appropriate mechanisms can be provided for ex- required for a program’s execution are dedicated to that pressing application requirements, for arbitrating between purpose. However, in local and wide area networks, con- different requirements, for enforcing allocations, and for tention for shared networks, CPUs, and I/O systems can providing feedback to applications concerning achieved result in significant variations in availability, with con- performance, then applications can, in principle, adapt sequent adverse effects on overall performance. We de- their behavior according to resource availability. How- scribe a new message-passing architecture, MPICH-GQ, ever, while such quality of service (QoS) mechanisms that uses quality of service (QoS) mechanisms to man- have been developed for low-bandwidth, constant bit-rate age contention and hence improve performance of mes- media flows with unreliable protocols [21, 4, 28, 30, 29], sage passing interface (MPI) applications. MPICH-GQ the high-performance, often bursty traffic patterns, com- combines new QoS specification, traffic shaping, QoS plex communication structures, and reliable protocols en- reservation, and QoS implementation techniques to de- countered in high-performance computing applications liver QoS capabilities to the high-bandwidth bursty flows, pose new challenges. Furthermore, the sockets-based ap- complex structures, and reliable protocols used in high- plication programming interfaces (APIs) typically pro- performance applications—characteristics very different vided for managing QoS are not appropriate for scientific from the low-bandwidth, constant bit-rate media flows applications. and unreliable protocols for which QoS mechanisms were We describe a system, MPICH-GQ, that addresses the designed. Results obtained on a differentiated services problems just listed, providing the parallel programmer testbed demonstrate our ability to maintain application with a QoS framework that supports: performance in the face of heavy network contention. • high-bandwidth (tens of Megabits per second: Mb/s) Keywords: MPI, Quality of Service, Differentiated Services, flows TCP • reliable protocols, specifically TCP/IP • a variety of different low-level QoS enforcement 1 Introduction mechanisms The performance achieved by parallel programs is often • both immediate and advance reservation, and co- adversely affected by contention for resources such as net- reservation of CPU, network, and other resources works and I/O systems. In the case of networks, even a needed for end-to-end performance small amount of contention over a critical link can play • a familiar high-level programming model, namely havoc with overall performance, particularly when an ap- the message passing interface (MPI) plication is using TCP/IP as its communication protocol: TCP/IP’s built in contention-avoidance mechanisms can Our prototype MPICH-GQ implementation combines ele- reduce communication rates drastically, potentially idling ments of the MPICH-G2 (formerly MPICH-G) wide area many processors. implementation of MPI [18, 10] and the General-purpose Architecture for Reservation and Allocation (GARA) ∗Department of Computer Science, The University of Chicago, Chicago, IL 60637, U.S.A. QoS framework [14, 15]. Experimental studies with sim- †Mathematics and Computer Science Division, Argonne National ple MPI benchmark studies demonstrate our ability to de- Laboratory, Argonne, IL 60439, U.S.A. liver high performance in the face of network contention. ‡ High-Performance Computing Laboratory, Department of Com- This work is part of a larger project focused on exploit- puter Science, Northern Illinois University, DeKalb, IL 60115, U.S.A. §Central Institute for Applied Mathematics, Forschungszentrum ing MPI as a standards-based API for advanced network Julich¨ GmbH, 52425 Julich,¨ Germany computing. This work has produced new techniques for Published in the Proceedings of the IEEE/ACM SC2000 Conference 0-7803-9802-5/2000/$10.00 (c) 2000 IEEE. November, 2000 constructing topology-aware collective operations [24] as the service marked in the packet. This approach greatly well as a first version of MPICH-G [10], a “Grid-” [12] or simplifies the task of the routers in the interior of the net- network-aware implementation of MPI that uses mecha- work. nisms provided by the Globus toolkit [11] to address se- In addition to classifying and marking the packets, edge curity, startup, remote I/O, and other issues that can hinder devices may have to perform policing and shaping. Polic- distributed execution. The PACX [16], MetaMPI [7], and ing is a mechanism to ensure that senders do not send MAGPIE [25] systems also address some of these issues, too much high-quality traffic; if the sender transmits data but not QoS. too quickly, policing will throw out traffic above a cer- tain rate. Policing is often implemented through a token bucket mechanism. The size of the token bucket controls 2 Quality of Service: A Brief Re- how quickly an application can send data: tokens are grad- view ually added to the token bucket and packets are only sent if there are tokens in the bucket. Policing is important We review briefly the state of the art in network QoS when the network provider wishes to offer not simply dif- mechanisms, focusing on methods used within Internet ferent types of service, but some sort of guarantee on a Protocol (IP)-based packet-switched networks. service: if access to the service is unrestricted, it may be In today’s Internet, the packets constituting an impossible to provide such guarantees. Shaping is impor- application-level flow pass through a series of routers as tant when application traffic is bursty. If these bursts are they travel from their source to their destination. An in- not smoothed to be less bursty, policing may cause pack- dividual router must make decisions with implications for ets to be dropped. As we explain below, shaping can be perceived QoS whenever more than one incoming packet performed either in the router or in the application. can be forwarded at the same time. The router must then The Internet Engineering Task Force (IETF) has de- choose to forward one packet before the other(s), which fined two different types of DS services. These are not are queued and hence delayed; if the number of incoming services in the “end-to-end” sense of the word, but instead packets exceeds the output rate of the router for an ex- are per hop behaviors (PHBs). That is, they define how tended period, the router must also choose which packets packets are treated at each router. One PHB that is of to discard. particular interest to us in the Expedited Forwarding (EF) Two primary approaches to QoS have been proposed behavior which says that all packets in the “expedited” within the IP context. In the Integrated Services (IS) [2] router queue are sent before any other packets are sent. approach, a reservation is made at each router between Clearly, to prevent starvation of nonexpedited flows, the the endpoints of a reservation, usually via the Resource number of expedited packets must be carefully limited. ReserVation Protocol (RSVP) [3]. Each router distin- It is possible to build a premium end-to-end service that guishes each of the reserved flows and provides each flow provides statistical bandwidth guarantees on top of the with performance guarantee, either statistical or strict. EF PHB by doing careful admission control and policing The IS approach has been criticized as being too “heavy,” at the edge routers. Normally, admission control is per- however, because each router is required to recognize and formed not by the router but by an external QoS system, treat each application-level flow separately, which may be usually referred to as a bandwidth broker. In this paper, too much of a burden to place on routers in the core of a we will not discuss admission control and the bandwidth large network such as the Internet. broker in detail, although it is part of the GARA system In the Differentiated Services (DS) [1] approach, as described below. routers that are at the “edge” of a DS network recognize In brief, then, the task of providing application-level packets that should receive better service by classifying QoS maps to that of configuring key parameters (flow the packets based on information in the header, such as rates, bucket sizes) within individual routers. This map- source and destination addresses and ports. For exam- ping is fairly straightforward for the media applications ple, a network provider may choose to give better-quality that have motivated most work on QoS, because of their service to all packets sent in video flows, or may have a simple and regular communication structures. For exam- bandwidth reservation system that allows applications to ple, an audio stream may comprise 1000 bit packets, gen- request specific amounts of bandwidth for particular ap- erated every 64th of a second, for a total bandwidth re- plication flows. Once an edge router classifies a packet as quirement of 64 Kb/s. Therefore, we configure the under- needing better service, it marks that packet in the header lying network to support a flow with premium bandwidth with a particular service. In the interior of the network, of 64 Kb/s and a token bucket depth of 1000 bits. As packets are no longer fully classified as they were by the we discuss in the following, things are more complex for edge routers, but they are treated as an aggregate based on high-performance applications. 2 55000 3 Quality of Service and MPI TCP Flow Reservation 50000 The communication structures associated with MPI appli- cations are often significantly more complex than in me- 45000 dia applications, in three principal respects: 40000 • Communication is often bursty: an application may 35000 compute for a while, then call a communication Bandwidth (Kb/s) function, then compute some more.