Adaptive Timeout Discovery Using the Network Weather Service∗

Adaptive Timeout Discovery using the Network Weather Service∗ Matthew S. Allen Rich Wolski Computer Science Department Computer Science Department University of California, Santa Barbara University of California, Santa Barbara James S. Plank Computer Science Department University of Tennessee, Knoxville In this paper, we present a novel methodol- uled. That is, they are customized software com- ogy for improving the performance and dependabil- ponents that exploit application-specific characteris- ity of application-level messaging in Grid systems. tics to mitigate the negative performance effects of Based on the Network Weather Service, our system Grid dynamism. Tools such as the AppLeS Param- uses non-parametric statistical forecasts of request- eter Sweep Template [6, 5] generalize this approach response times to automatically determine message to incorporate information that is common to a class timeouts. By choosing a timeout based on predicted of applications. Little work, however, has focused on network performance, the methodology improves ap- the use of dynamic techniques to optimize the per- plication and Grid service performance as extraneous formance of Grid services that application schedulers and overly-long timeouts are avoided. We describe and Grid users require. the technique, the additional execution and program- One such Grid Service that is critical to applica- ming overhead it introduces, and demonstrate the ef- tion performance is communication. Grid applica- fectiveness using a wide-area test application. tions must be designed to run on a wide area network, which poses a difficult performance problem. Com- munication times in such an environment fluctuate 1 Introduction dramatically as routers become overloaded and drop packets, networks partition, and congestion control The challenge for the Computational Grid is to ex- algorithms restrict communication. Many conditions tract reliable, consistent performance from a fluctuat- can cause a transmission in progress to take much ing resource pool, and to deliver that performance to longer than expected. One way to deal with this user applications. Dynamic schedulers [12, 14, 2] have problem is to time out communications that exceed been able to \steer" resource usage automatically so some expected elapsed time. Choosing this value, that the application can take the best advantage of however, is often little more than an educated guess. the performance that is available. To do so, many rely on short-term, statistical forecasts of resource per- When network performance is fluctuating, time- formance generated by the Network Weather Service out determination can dramatically affect both ap- (NWS) [17, 16] { a Grid performance data manage- plication and service performance. Timeouts that ment and forecasting service. are premature cause either needless failures or extra messaging overhead as the message is retried. Overly While application-level schedulers work well in long timeouts result in degraded messaging perfor- Grid contexts, to date, they have had to rely on in- mance as communication is delayed unnecessarily. trinsic \knowledge" of the application being sched- Our methodology attempts to \learn" the best time- ∗This work was supported, in part, NSF grants EIA- out value to use based on previously observed com- 9975015, ACI-9876895, and CAREER award 0093166. munication between host pairs. By combining pre- 1 dicted message send-acknowledge times with NWS The rest of this paper is organized as follows. In the forecasting error measures, our system sets timeouts next section, we will discuss the network conditions adaptively, in response to fluctuating network perfor- we hope to improve on with this technique. In Sec- mance. tion 3 we will describe the adaptive timeout discovery In this paper we present a general technique for im- methodology. Section 4 describes the specific set of proving communication times on the wide-area. Our experiments we conducted. In Section 5 we discuss system \predicts" communication times using non- the results of our investigation, and Section 6 draws parametric statistical forecasting techniques. These our final conclusions from this investigation. predictions are then used to determine an appropriate application level timeout for each send. By canceling and reissuing sends that exceed this time limit, we 2 Network Behavior are able to achieve improvements in communication. This methodology's benefits are realized in a general The motivation for this research is the observation and application independent way. that there is a high degree of variance in communi- We used a sample application to present our re- cation times, especially in wide-area networks. Send sults. For this application, we chose a global data times have been observed to loosely follow either a reduction across multiple hosts in the Grid Appli- heavy-tailed or Poisson distribution [10]. Although cation Software Development (GrADS) [1] testbed. there is some debate over which accurately describes GrADS is a Grid software research project studying actual network traffic, either case indicates that there the problem of application development and deploy- can be extremely long send times. Furthermore, the ment. From some of its early results [12], it is clear tail{end sends are not necessarily clustered together, that the performance of the Grid services themselves but often occur among more reasonably behaved com- is a critical component of Grid application perfor- munications [13]. Thus, while the send time across a mance. If the services respond slowly, the applica- link will usually take only a couple of seconds, it may tion using them will also suffer a performance degra- occasionally take significantly longer. dation. Our choice of test application in this work is Most of these delays stem from the behavior of motivated by two observations. First, collective oper- routers at one of the hops in the transmission. For ations [7, 9, 8] are a performance critical component example, a temporarily overloaded router may drop of many distributed scientific and engineering appli- packets or accidentally direct them to the wrong des- cations. Secondly, we observe that frequently, Grid tination. This is particularly a problem when a TCP users and schedulers must collect up-to-date informa- transmission collides with a poorly behaved UDP tion from widely distributed sites, often to pick out stream that fills up a routing queue [3]. Although the \best" choice from a number of possibilities. For overloaded routers are likely to provide consistently example, application schedulers often wish to deter- poor performance, other routers before them in the mine the most lightly loaded resource(s) from a spec- communication path may detect this congestion and ified resource pool when making scheduling decisions. redirect messages to another path. While it may be possible to avoid scheduling collec- Another significant problem is that routers occa- tive operations in a scientific application so that they sionally change their routing tables because of poli- span wide-area network connections, it is not gener- cies or configurations and not detection of a congested ally possible (or desirable) to centralize Grid services network. Sometimes this happens mid{transmission, in the same way. causing packets to arrive out of order and causing As such, this paper will make two significant con- other delays in the communication [11]. This is par- tributions. It will: ticularly evident in a phenomenon known as “flutter- describe an application-independent, high- ing" where a router rapidly switches between differ- • performance, and simple methodology for ent paths to direct packets. Where a few packets may adaptive timeout discovery using NWS forecast- follow a 12 hop route, the next few may follow one ing tools with 25 before the router returns to its original route. This change can occur with every other packet in the demonstrate the effectiveness of this approach in worst cases [11]. • a \live" Grid setting. It is also possible that the TCP/IP congestion con- 2 trol algorithms can delay a single transmission that cs.ucsb.edu arrives at an inappropriate time. In the face of fluctuating network traffic, these algorithms can behave cs.utk.edu cs.utk.edu chaotically and unnecessarily throttle single trans- missions [15]. This happens not as a result of an cs.utk.edu cs.uiuc.edu cs.indiana.edu wellesley.edu overall increase in traffic through the network, but in- stead as an anomaly in the algorithm. These anoma- lies quickly are resolved, and subsequent sends will ncni.net cs.uiuc.edu cs.ucsb.edu cs.vu.nl ucsd.edu cs.utk.edu eecs.harvard.educs.utk.edu be unaffected until the algorithm becomes unstable again. cs.uiuc.eduucsd.educs.utk.eduecs.csun.educs.utk.eduucsd.educs.uiuc.educs.uiuc.educs.utk.eduucsd.educs.ucsb.educs.ucsb.educs.uiuc.educs.utk.educs.uiuc.eduucsd.edu 3 Adaptive Timeout Discovery Figure 1: Topology of the reduction Our strategy for computing dynamic timeouts uses the non-parametric techniques of the NWS forecast- method is minimal. ers. For each application-level link the forecaster maintains a unique timeout prediction determined using measurements gathered throughout the exper- 4 Experiment iment's execution. Because the forecaster is constantly updated with measurements of the link's per- We chose to implement a global reduction to investi- formance, it is able to produce accurate and current gate the potential impact of various timeout strate- predictions of this link's expected speed. gies on distributed applications. Our reduction is The performance of each connection is monitored structured as a binary tree consisting of 31 nodes as constantly, and the results are used to generate time- depicted in figure 1. Each node in the tree generates out predictions for that link. When a connection is a set of n random values at the beginning of the re- established, a NWS forecaster is created to monitor duction. At the end of the reduction, the hosts have it. It is initialized with a generous prediction of the computed the n maximum values of those generated send time; in our case, 45 seconds.

Load more