What Lies Beneath: Understanding Internet Congestion Leiwen Deng, Aleksandar Kuzmanovic Bruce Davie Northwestern University Cisco Systems (l-deng,akuzma)@northwestern.edu [email protected]

ABSTRACT a router is larger than its service rate, may be non- Developing measurement tools that can concurrently mon- negligible. Variable queuing delay leads to jitter, which itor congested Internet links at a large scale would signifi- can hurt the performance of delay-based congestion con- cantly help us understand how the Internet operates. While trol algorithms (e.g., [15,18]), or real-time applications congestion at the Internet edge typically arises due to bottle- such as VoIP. And whereas it is difficult to route around necks existent at a connection’s last mile, congestion in the congestion in access-network congestion scenarios [24] core could be more complex. This is because it may depend unless a client is multihomed [13], congested links in the upon internal network policies and hence can reveal system- core could be effectively avoided in many cases [12,14]. atic problems such as routing pathologies, poorly-engineered Hence, identifying such congested locations1, and char- network policies, or non-cooperative inter-AS relationships. acterizing their properties in space (how “big” they are) Therefore, enabling the tools to provide deeper insights about and time (the time-scales at which they occur and re- congestion in the core is certainly beneficial. peat) is valuable for latency-sensitive applications such In this paper, we present the design and implementation as VoIP [8]. of a large scale triggered monitoring system that focuses on Second, the Internet is a complex system composed of monitoring a subset of Internet core links that exhibit rela- thousands of independently administered ASes. Events tively strong and persistent congestion, i.e., hot spots. The happening at one point in the network can propagate system exploits triggered mechanisms to address its scalabil- over inter-AS borders and have repercussions on other ity and automates selection of good vantage points to handle parts of the network (e.g., [37]). Consequently, mea- the common measurement experience that the much more surements from each independent AS (e.g., [27]) are in- congested Internet edges could often overshadow the obser- herently limited: even when a congestion location is vation for congestion in the core. Using the system, we char- identified, establishing potential dependencies among acterize the properties of concurrently monitored hot spots. events happening at different ASes, or revealing un- Contrary to common belief, we find that congestion events at derlying mechanisms responsible for propagating such these hot spots can be highly correlated and such correlated events, is simply infeasible. To achieve such goals, it is congestion events can span across up to three neighboring essential to have global views, i.e., to concurrently mon- ASes. We provide a root-cause analysis to explain this phe- itor the Internet congestion locations at a large scale nomenon and discuss implications of our findings. across the Internet. This paper makes two primary contributions. First, 1. INTRODUCTION we present the design of a large-scale triggered moni- The common wisdom is that very little to no con- toring system, the goal of which is to quantify, locate, gestion occurs in the Internet core (e.g., Tier-1 or -2 track, correlate, and analyze congestion events happen- providers). Given that ISPs are aggressively overpro- ing in the Internet core. The system focuses on a subset visioning the capacity of their pipes, and that end-to- of core links that exhibit relatively strong and persis- end data transfers are typically constrained at Internet tent congestion, i.e., hot spots. Second, we use a Planet- edges, the “lightly-congested core” hypothesis makes a Lab [6] implementation of our system to provide, to the lot of sense. Indeed, measurements conducted within best of our knowledge, the first-of-its-kind measurement Tier-1 providers such as Sprint report almost no packet study that characterizes the properties of concurrently losses [27]; likewise, ISPs like Verio advertise SLAs that monitored hot spots across the world. guarantee negligible packet-loss rates [9]. As a result, At a high level, our methodology is as follows. Us- researchers are focusing on characterizing congestion at ing about 200 PlanetLab vantage points, we initially access networks such as DSL and cable [24]. “jumpstart” measurements by collecting underlying IP- While numerous other measurement studies do con- level topology. Once sufficient topology information is firm that the network edge is more congested than the revealed, we start light end-to-end probing from vantage core (e.g., [23]), understanding the properties of conges- points. Whenever a path experiences excessive queuing tion events residing in the Internet core is meaningful delay over longer time scales, we accurately locate a for (at least) the following two reasons. congested link and designate it a hot spot. (Exactly First, despite negligible packet losses in the core, queu- what we consider “excessive” queuing and a “longer” ing delay, which appears whenever the arrival rate at 1We define a congested location more precisely in Section 2.

1 time scale is defined in Section 2.) In addition, we should be carefully evaluated to understand how link- trigger large-scale coordinated measurements to explore level correlations affect its accuracy. Finally, our results the entire area around that hot spot, both topologically about the typical correlation “spreading” provide guide- close to and distant from it. Our system covers close to lines for overlay re-routing systems. To avoid an entire 30,000 Internet core links using over 35,000 distinct end- hot spot (not just a part of it), it makes sense to choose to-end paths. It is capable of monitoring up to 8,000 overlay paths that are at least 3 ASes disjoint from the links concurrently, 350 of which could be inspected in one experiencing congestion, if such paths are available. depth using a tool we recently developed [23]. Designing a system capable of performing coordinated 2. PRELIMINARIES measurements at such a large scale is itself a challeng- 2.1 Congestion Events ing systems engineering problem. Some of the ques- Congestion has several manifestations. The most se- tions we must address are as follows. How to effec- vere one induces packet losses. We detect congestion tively collect the underlying topology and track routing in its early stage, as indicated by increased queuing de- changes? How to handle clock skew, routing alterations, lays at links. From a microscopic view, a continuous and other anomalies? How to effectively balance the queue building-up and draining period typically lasts measurement load across vantage points? How to min- from several ms to hundreds of ms [42]. On time scales imize the amount of measurement resources while still of seconds to minutes, queue building-up can repeatedly covering a large area? How to select vantage points happen, as shown in Figure 1. In this paper, we focus that can achieve high measurement quality with high on measuring a subset of congested links in the core probability? In this paper, we provide a “from scratch” that exhibit repetitive congestion, as we define formally system design and answer all these questions. below. Our system has important implications for emerging 0.22 protocols and systems, and they open avenues for future research. For instance, given the very light overhead on each node in our system (30 kbps on average), it could 0.11 be easily ported into emerging VoIP systems to help

avoid hot spots in the network. Further, our monitor- Congestion intensity ing system should be viewed as a rudiment of a global 11:00 11:10 11:20 11:30 11:40 11:50 12:00 12:10 12:20 12:30 Internet “debugging” system that we plan to develop. Time (hour:minute) Such a system would not only track “anomalies” (i.e., The Y axis shows congestion intensity. It is updated every 30 seconds. hot spots or outages), but would also automate a root- Figure 1: Congestion events on a link monitored cause analysis (e.g., automatically detect if a new policy by Pong implemented by an AS causes “trouble” to its neigh- bors). Finally, others will be able not just to leverage Definition of a congestion event. We define a con- our data sets, but to actively use our system to gather gestion event over each 30-second-long epoch. If we their own data, performing independent analysis. measure high queuing delays (Section 2.2.3) on a link Our key findings are as follows. (i) Congestion events during a 30-second-long epoch, we regard it as a con- on some core links can be highly correlated. Such cor- gestion event on this link. Consequently, such a link relation can span across up to three neighboring ASes. becomes a point of congestion. Figure 1 exemplifies con- (ii) There are a small number of hot spots between or gestion events on a link detected by our measurement within large backbone networks exhibiting highly inten- tool (Section 2.2). sive time-independent congestion. (iii) Both of these Congestion intensity. When we detect a congestion phenomena are closely related to the AS-level traffic ag- event, we quantify its congestion intensity as the per- gregation effect, i.e., when upstream traffic converges to cent of probing samples that observe high queuing de- a thin aggregation point relative to its upstream traffic lays on the link during the 30-second-long epoch. The volume. intuitive meaning of the congestion intensity is how The implications of our findings are the following. frequently congestion occurs during a 30-second-long First, the common wisdom is that the diversity of traf- epoch. As an example, Figure 1 shows several conges- fic and links makes large and long-lasting spatial link tion events on a link over time. The congestion event congestion dependence unlikely in real networks such that appears at 11:30 has the highest intensity (22%) in as the Internet [20]. In this paper, we show that this this time interval. The congestion event that starts at is not the case: not only that correlation between con- around 11:50 has smaller intensity; yet, it lasts longer, gestion events could be excessive, but it can cover a approximately for about half an hour. large area. Still, most (if not all) network tomogra- Hot spot. In this work, we exclusively focus on longer- phy models assume link congestion independence, e.g., lasting congestion events that show non-negligible con- [19, 20, 25, 32, 36, 43]. This is because deriving these gestion intensities. More precisely, we only consider models without such an assumption is hard or often in- events that have congestion intensity larger than 5%, feasible. Our research suggests that each such model and last longer than one minute. The corresponding

2 link (point of congestion) is then designated a hot spot. the path it travels and infers congestion status of the 2.2 Pong Background path. Congestion Here, we introduce Pong, a measurement tool we re- f s cently developed [23], which is one of the main building Source Destination 2 blocks in our triggered congestion monitoring system. S 1 D Below, we provide the necessary background and em- phasize those elements of Pong (e.g., link measurability d b score) that are essential for vantage-point selection al- Correlates probetotheupperrouterwith d probetothelowerrouter gorithms that we explain later in the text. In this scenario, s probe and b probe both capture congestion on link 1; f probe and d probe both capture congestion on 2.2.1 Pong Overview link 2. Such complementary pattern makes the 4-p probing Pong is a light-weight monitoring tool that has a per- approach applicable. path overhead less than 18 kbps. It can accurately lo- cate points of congestion on a path with high resolution Figure 2: Example of coordinated probing. at the granularity of a single link2 in most cases, as we We have four different probing techniques, which we demonstrated via Emulab [3] and Internet experiments call 4-p probing, fsd probing, fsb probing, and 2-p prob- in [23]. It exploits active probing from both end-points ing, respectively. Figure 2 shows an example of 4-p of the path and integrates the end-to-end probing with probing scenario. The 4-p probing and the fsd probing the probing to intermediate routers in a novel way. (that involves f, s, and d probes) approaches give the The key step in Pong’s measurement methodology highest measurement accuracy, the fsb probing (that in- is to repeatedly divide a given unidirectional path into volves f, s, and b probes) also gives high accuracy, while two halves, and to determine if either half is congested the 2-p probing (that involves f and s probes) gives rel- (experiencing high queuing delay). That is, Pong seeks atively low accuracy. to determine whether the unidirectional path from the Figure 3 shows an experimental example on the Em- source to a given router is congested, and whether the ulab testbed, in which we monitor congestion on a path unidirectional path from the router to the destination is consisting of 11 links using Pong. Initially (the left congested. The challenges are as follows: (i) How to de- plot), we create congestion on links 1, 6, and 9 in the couple congestion on the forward path from congestion forward direction. Then, in case 1 (the middle plot), we on the backward path? (ii) How to set proper thresh- add congestion to links 2 and 5 in the backward direc- olds for queuing delays above which we consider a path tion; and in case 2 (the right plot), we add congestion segment congested? (iii) How to handle interferences to links 5 and 11 in the forward direction. Congestion from clock skews, route alterations, asymmetric rout- on all links is generated in a periodic on and off (TCP ing, ICMP processing delays at routers, timing issues at cross traffic) pattern but with each link associating with PlanetLab nodes, and other unknown anomalies? We a different period. Thus, we emulate concurrent but in- give answers to these three questions in Sections 2.2.2, dependent congestion on multiple links of a path. All 2.2.3, and 2.2.5, respectively. three plots are a congestion intensity snapshot of a 30- Once we determine whether either half path of each second-long time interval. As we can see, Pong monitors router along a path is congested, we correlate results of multiple points of congestion on the unidirectional path neighboring routers to identify congestion on the link in with high accuracy: it successfully denoises the inter- between. We exploit the repetitive congestion proper- ferences from the two backward congested links in case ties on the concerned links to improve the identification 1 and detects all five forward congested links in case 2. accuracy. 0.4 After adding backward After adding forward 0.35 3 congested links initially congestion on link 2 and 5 congestion on link 5 and 11 2.2.2 Coordinated Probing 0.3 Our active probing methodology integrates the end- 0.25 0.2 to-end probing with the probing to intermediate routers 0.15 as shown in Figure 2. For each intermediate router, 0.1 0.05 we send up to four probing packets: (i) Two end-to- Congestion intensity 0 1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 11 end probes, one from the source (f probe) and one from Link # Link # Link # the destination (b probe); and (ii) two router targeted probes, one from the source (s probe) and one from the Figure 3: An Emulab experiment example destination (d probe). We coordinate sending time of 2.2.3 Setting Queuing Delay Thresholds the four probes using negative feedback methods such Setting the threshold at which we declare queueing that the probes travel shared path segments nearly si- 3 delay to be “high” is important for the measurement multaneously. Each probe measures queuing delay of accuracy. But it is a non-trivial task for several rea- 2 sons. First, queuing delay scales vary depending on Here “link” means IP level link. In addition, if some routers on the path do not respond to TTL-limited probes, the 3We use a standard method to estimate queuing delays, i.e., “link” is interpreted as a path segment between two clos- subtracting the minimum value among a set of samples from est routers that respond to TTL-limited probes. each sample.

3 25 12 moments of its own inaccuracy by exploiting a measure 10 20 Threshold: 2.68ms — qualify of measurability (QoM). This is particularly Threshold: 1.59ms 8 15 meaningful for our PlanetLab based experiments which 6 10 are subject to timing challenges caused by issues such as 4 CPU bursts or bandwidth rate limiting. Because we use % of samples 5 % of samples 2 coordinated probing from two endpoints, we are capable 0 0 1 3 5 7 10 13 16 19 22 25 28 1 3 5 7 10 13 16 19 22 25 28 of checking if the observations from these two endpoints Bin # Bin # are consistent. For example, if both endpoints measure (a) An Emulab path (b) An Internet path increased queuing delay on two specific round-trip paths that share the same congested link, the observation is Figure 4: Example queuing-delay distributions consistent, and hence QoM is high. In scenarios when link capacities and, to a lesser extent, buffer sizes. Sec- the two endpoints do not have a consistent observation, ond, queue building-up can take place on multiple links (which can happen for a number of reasons, e.g., ICMP along an end-to-end path concurrently. Queuing-delay queuing, clock skew, and aforementioned timing issues distribution of a path can therefore be very complex4. at PlanetLab nodes, etc), we are capable of detecting all Some paths are more “clean”, i.e., it is easy to decou- such moments and assigning them low QoM. A low QoM ple signal (congestion) from noise (no congestion) sub- does not mean that a given congestion event did not spaces by setting a threshold to moderate levels, e.g., happen, but simply that we were unable to accurately several hundred microseconds. On the other hand, end- measure that event from both endpoints. to-end queuing delay distribution may be more noisy 2.2.5 Handling Anomalies on other paths, the threshold therefore must be set to a Pong can effectively detect clock skew events between higher value, e.g., several milliseconds. The higher the endpoints and route alterations on a path. It suppresses threshold, the smaller the probability to detect moder- the generating of measurement results once it detects ate queuing, and hence the lower the threshold quality such anomalies. Other anomalies such as queuing ICMP (which is taken into account when we quantify measure- messages at routers, CPU bursts at PlanetLab nodes, ment quality, Section 2.2.6). and even some unknown anomalies (that affect mea- We set a threshold based on the distribution of queu- sured queuing delays) can be easily detected by instan- ing delay samples. Since we are measuring repetitive taneous QoM values. Once such anomalies happen, congestion, the queuing delay values for congested mo- QoM will show a very small instantaneous value. We ments show clustering distribution effects, as we demon- use instantaneous QoM as a weight when updating final strate below. results, thereby minimizing measurement errors caused We divide a queuing delay range of 0.15 ms ∼ 302 ms by anomalies. into 30 bins, with the bin sizes being logarithmically 2.2.6 Link Measurability Score (LMS) uniform (except the first and the last bins).5 Using such Although Pong makes a best effort to measure con- bin settings, we categorize queuing delay distributions gestion location on a path, its measurement accuracy we observed via a large number of Internet experiments could differ a lot for different paths due to underlying into several patterns. For each pattern, we design a path conditions. However, a link, especially a link in the suitable threshold setting algorithm. Figure 4 shows core, can be observed by many paths which can give two examples of a dominant pattern we observed. One very different measurement accuracies for congestion example is the end-to-end queuing delay distribution of on the link. We therefore can optimize measurement the 11-link path in the aforementioned Emulab experi- performance by selecting paths that give best measure- ment; the other is that of an Internet path which also ment accuracies. To quantify the measurement accu- consists of 11 links. As we can see, the distribution racy, we introduce a measure called link measurability on the Emulab path shows a sheer two modes pattern score (LMS). LMS indicates how well we are able to — one corresponds to samples observing no congestion, measure a specific link from a specific path. Different the other corresponds to samples observing congestion. paths can have very different LMS for the same link. Necessarily, the distribution on the Internet path is in- As we will describe in Section 3.5.2, our triggered mon- herently blurred, but we can still observe two relatively itoring system will always select a path that can give strong clusters. In both scenarios, we set the threshold the best LMS when measuring a specific link. to a proper value in between the two clusters. An LMS is computed based on the following three 2.2.4 Quality of Measurability (QoM) components: (i) Node score, which takes into account Pong has a novel ability, i.e., to accurately detect the corresponding QoM achieved. For each link, node scores of both nodes of a link are considered. This 4In particular, due to the heterogeneity of links that consti- measure dominantly affects LMS. (ii) Queuing thresh- tute a path. old score, which takes into account the queuing de- 5 We set the bin sizes logarithmically uniform because the lay threshold quality (Section 2.2.3) for a given path. larger the average queuing delay, the larger the queuing de- Queuing delay threshold settings are more accurate for lay deviation becomes.

4 some queuing delay distribution patterns than others, municates with measuring nodes via UDP-based com- thereby leading to different queuing threshold scores. munication channels. (iii) Observability score. Our measurement experience Pongprocess Measuringnode M shows that congestion observed on a less frequently con- M M gested link can be blurred by a much more frequently TMonprocess congested link on the same path. We therefore assign a small observability score for those less frequently con- Communicationchannel gested links when we detect a much more frequently M Perl Main congested link in recent time on the same path. script program The value of LMS ranges from 0 to 6 in our imple- mentation. We will describe implications of major LMS M MySQLdatabase Control M levels in Section 4.2.2. node 3. A TRIGGERED MONITORING SYSTEM TMonpath Pongpath In this section, we present T-Pong, a large scale con- Figure 5: T-Pong system structure gestion monitoring system. To accurately measure points of congestion in the Internet core and to effectively and Control Node. In our implementation, we use a Pen- scalably allocate measurement resources, T-Pong de- tium 4 PC for the control node. It runs Fedora core 5 ploys an on-demand triggered monitoring approach. Linux. Its application software includes main programs written in C++, a MySQL database that stores mea- 3.1 Design Goals surement data, and an optional Perl script that allows We first summarize our design goals: customized triggering logic. Fault tolerance of the con- Good coverage. The system should have a good cover- trol node is a concern for the implementation, which we age of links. By “good” we mean: (i) High coverage, low plan to address using a failover mechanism to a standby overhead. The system should be able to monitor a large control node. percent of network links concurrently while keeping its Measuring Nodes. The measuring nodes are Planet- traffic overhead low. (ii) Uniform coverage. Measuring Lab nodes running the TMon and Pong programs, both paths should cover congested links in an unbiased way. of which are written in C++. TMon keeps track of The system should reduce the probability that multiple path routes and monitors end-to-end congestion events. paths measure the same congested link. Pong measures link-level congestion of hot spots upon High measurement quality. The system should allo- triggering. cate paths that provide the best measurement qualities when measuring congested links. 3.3 Measurement Procedure Overview Online processing. To support triggering mechanisms, In this section, we introduce the major steps of T- the system should be capable of processing raw data on- Pong’s measurement procedure. We summarize tech- line. It should provide: (i) Fast algorithms. To operate niques associated with these steps in Table 1. (i) During in real time, algorithms must be fast enough to keep the system bootstrap phase, the control node collects route information from all measuring nodes and up- up with the data input rate. (ii) Extensible triggering 6 logic. Triggering logic may be frequently adjusted for dates its topology database accordingly (Section 3.4). research purposes, the system therefore should provide After that the system keeps track of route changes and a convenient interface to update triggering logic. path reachability, and incrementally updates the topol- Integrity and robustness. To be capable of running ogy database. (ii) Once the topology database is avail- long-term monitoring tasks, the system must guaran- able, the control node starts a greedy algorithm (Sec- tee the integrity of critical data and should be able to tion 3.5.1) to select a subset of paths (called TMon quickly recover from errors. paths) that can cover a relatively large percent of links. The TMon processes on measuring nodes associated 3.2 System Structure with these TMon paths then start fast-rate probing. Figure 5 illustrates the structure of T-Pong. It con- The fast-rate probing monitors end-to-end congestion sists of one control node and hundreds of measuring for each TMon path. (iii) Once end-to-end congestion nodes. The control node performs centralized control on a TMon path is detected, the system uses its de- of the entire system. It collects measurement reports fault triggering logic (a priority-based algorithm, Sec- from measuring nodes and adjusts monitoring deploy- tion 3.5.2) to decide whether it should start Pong on ment based on a set of vantage points selection algo- that path. It might also decide to start Pong on other rithms. The measuring nodes monitor network con- paths based on customized triggering logic (Section 3.5.3). gestion from vantage points all over the world. Each Measuring nodes associated with such paths then use of them runs two measurement processes: TMon and Pong (the paths are then called Pong paths) to locate Pong. TMon performs end-to-end congestion monitor- and monitor congestion on these path at the granularity ing while Pong performs fine-grained link congestion of a single link. measurement upon triggering. The control node com- 6The topology database stores the whole set of routes. It

5 Paths used Path selection algorithm Probing method Probing rate Objective Low-rate probing Track topology and All paths No selection, full mesh Once every 5 minutes (Section 3.6.2) path reachability TMon paths – a sub- Greedy TMon path se- Fast-rate probing Monitor end-to-end 5 probes/sec set of all paths lection (Section 3.5.1) (Section 3.6.3) congestion Pong paths – a subset Priority-based Pong Coordinated 10 probes/sec for e2e Locate and monitor of TMon paths upon path allocation probing probing, 2 probes/sec for link-level congestion triggering (Section 3.5.2) (Section 2.2.2) router-targeted probing

Table 1: A summary of T-Pong’s major measurement techniques

The system keeps track of congestion events as well as by less than two TMon paths. anomalies such as route changes, clock skews and path In addition, the new TMon path must satisfy the failures. It responds to such events by (i) processing following conditions: (i) Each of the two endpoints of congestion events and analyzing measurement accura- this path must not participate in more than 20 TMon cies online, (ii) updating the topology database, (iii) paths8. This restricts peak probing traffic at each mea- suppressing measurement on paths affected by anoma- suring node and hence leads to a low overhead. (ii) The lies, and (iv) rearranging deployment of TMon and Pong path must be active. A path becomes inactive when it paths. In this way, the system gradually transits into is unreachable or experiences considerable route alter- a stable state, in which it collects measurement reports, ations. (iii) The path must not be suppressed. A path processes them, applies triggering logic and adjusts mea- is suppressed from being a TMon path for a short pe- surement deployment. riod of time (15 minutes) when anomalies (e.g., clock 3.4 Initializing Topology Database skews) are detected. A path is suppressed for a long pe- When the system initially starts up, all measuring riod of time (6 hours) when it shows poor measurement nodes are sending topology messages to the control node quality while running Pong. (after which, they send incremental reports upon route The iterative selection stops when there are no links changes). The control node therefore experiences a burst currently satisfying the conditions. Then the algorithm of topology messages during bootstrap. transits into a sleeping state for 5 minutes, after which As we will discuss in Section 3.5.1, updating topology it tries the iterative selection again. information to a database is not a cheap operation, since Making it online-capable. Enabling this algo- we have to update additional parameters used for select- rithm to run online is a non-trivial task since each itera- ing TMon paths, which incurs costly database queries. tion involves complex logic. If implemented improperly, Database updates therefore would be unable to keep the algorithm could incur formidable database queries. up with arrival speed of topology messages. To solve To address this, we pre-compute per-link, per-path, and this, we only perform simple pre-processing operations per-node parameters used in this algorithm during other for each topology message in real time, while buffering operations. For example, when we update topology in- all database updates. In addition, we attach a version formation of a path, we additionally compute the num- number and an 8-byte path digest value (a fingerprint ber of paths and the number of TMon paths that cover of the route) in each topology message. This helps the each link of the path. Although pre-computing adds control node to easily filter outdated or duplicate up- some complexity to other operations, our implementa- dates thereby avoiding unnecessary database updates. tion shows it to be a good tradeoff. 3.5 Path Selection Algorithms 3.5.2 Priority-based Pong Path Allocation The control node allocates Pong paths using a priority- 3.5.1 Greedy TMon Path Selection based algorithm. In order to minimize the interference The control node uses a greedy algorithm to select among Pong paths, we restrict each measuring node to TMon paths. The goal of this algorithm is to select participating in at most one Pong path. Given this con- only a small fraction of paths while covering a relatively 7 straint, the priority-based algorithm selects paths that large percentage of links . provide the best measurement quality and link coverage Algorithm. This selection algorithm runs online in when measuring points of congestion. In addition, it ef- an iterative way. During each iteration, it selects a new fectively reduces the chance that multiple Pong paths TMon path that covers the most remaining links. A measure the same congested link. remaining link is a link that has currently been covered Potential path score and path score. Two im- does not need to resolve extensive topology information such portant measures we use in this algorithm are potential as unique nodes and unique links via IP alias approaches. path score and path score. Both quantify measurement In our measurement and analysis, we only need a rough contribution a path can achieve when running Pong. estimation of topology. 7 The difference is that path score is a runtime measure In [16], the authors show that we can see a large fraction of the Internet (in particular, the “switching core”) from its edges when properly deploying a relatively small scale 8Restricting to 20 TMon paths is a good tradeoff between end-to-end measurement infrastructure. the link coverage and the measurement overhead.

6 SourceMeasuringNode DestinationMeasuringNode 1.Initiate3-packetprobingbysending packet1 1 2.Capture send_time(packet1) 3.Capture recv_time(packet1) 4.Computepathhopnumber basedon TTL fieldin packet1 2 5.Sendpacket2 toreport recv_time(packet1) and pathhopnumber 6.Capture send_time(packet2) 3 7.Sendpacket3 toreport send_time(packet2) 8.Capture recv_time(packet2) 9.Computeround-tripdelay and one-waydelays offorwardandbackwardpaths one way delay of the forward path = recv time(packet 1) − send time(packet 1) one way delay of the backward path = recv time(packet 2) − send time(packet 2) round-trip delay = one way delay of the forward path + one way delay of the backward path Figure 6: Three-packet probing procedure which also takes into account runtime contributions of surement deployment according to these commands. other Pong paths, while potential path score does not. This feature provides great flexibility for our conges- Potential path score is the maximum link measurabil- tion measurement experiments using the T-Pong sys- ity score (LMS, Section 2.2.6) over all non-edge9 links tem. We can easily change our focused network area on a path. It indicates the best measurement quality and concerned network-wide congestion pattern, and a path can achieve. The way to compute path score easily test and verify our research hypotheses. is similar. The only difference is that when computing 3.6 TMon's Active Probing the maximum LMS, path score excludes those links cur- 3.6.1 3-packet Probing rently covered by other Pong paths with a higher LMS Both TMon’s low-rate and fast-rate probing use 3- (i.e., path score only considers links on which this path packet probing, which includes three probing packets. provides the highest LMS among all Pong paths cur- The 3-packet probing measures: (i) path reachability, rently running; if there are no such links, its path score (ii) round-trip delay, (iii) one-way delays of both for- is zero, though the path may have a high potential path ward and backward paths, and (iv) the number of hops score). Path score indicates the runtime measurement on the forward path. Figure 6 illustrates its procedure. contribution of a path. When computing potential path Note that the one-way delays measured include clock score and path score, we use LMSes of recent history in skew between measuring nodes, but they are canceled our database. when computing round-trip delay or queuing delays. Algorithm. To allocate Pong paths, we follow these rules in sequence: (i) Give priority to a path that shows 3.6.2 Low-rate Probing higher path score. When computing a path score, use A node sends low-rate probes to all other nodes every LMS history of recent 24 hours for a candidate path, five minutes. Each low-rate probe consists of a 3-packet use LMS history of recent 1 hour for a currently running probing and an optional traceroute. The 3-packet prob- Pong path that conflicts with the candidate path. (ii) ing is used to explore path reachability and hop number Give priority to a path that shows higher congestion in- of the path. For each reachable path, the traceroute is tensity unless it is filtered by the triggering logic. (iii) performed. Since we know path hop number in advance, Should not override a Pong path that has been mea- our version of traceroute runs efficiently; it sends TTL sured for only a short period of time (< 30 minutes). limited probes to all hops along the path concurrently. (iv) Suppress paths showing small potential path scores Low-rate probes to different nodes are paced evenly (use LMS history of recent 24 hours). This priority- during each 5-minute period to smooth the traffic and based algorithm allows measurement resources to be ef- to avoid interference between traceroutes. For paths fectively allocated to track hot spots. which are temporarily unreachable, we back off probing 3.5.3 Programmable Triggering Logic rate up to 1/16 of the normal rate (i.e., one probe every 80 minutes) to reduce unnecessary overhead. The system allows us to customize our triggering strate- gies on the fly through an add-on Perl script. The inter- 3.6.3 Fast-rate Probing face between the main program and the Perl script is the A node sends fast-rate probes (5 Hz) on each TMon database. The main program records each congestion path sourced from it. Each fast-rate probing is a 3- event to the database. The Perl script, which runs in a packet probing. To measure congestion, the source node loop, periodically checks new events, selects concerned keeps track of round-trip and one-way delays. Based on ones, and performs its triggering logic. Based on that, that, it computes the corresponding queuing delays and it issues measurement commands and enqueues them to infers congestion. Fast-rate probing backs off up to 1/64 the database. The main program checks the database of its normal rate when a path becomes unreachable. for enqueued commands each minute, and changes mea- 4. EVALUATION AND MEASUREMENTS 9We classify edge links and non-edge links in terms of nor- 4.1 Experimental Setup malized hop number (hop number of a link divided by the We conducted six PlanetLab based experiments us- total number of hops on a path). A non-edge link is a link with its normalized hop number between 0.2 and 0.8. ing our congestion monitoring system in a two-month

7 period. Each experiment lasts for 4 ∼ 7 days, and we cations of major LMS levels: therefore collected a set of one month-long measure- “LMS=0” means both nodes of the links are using 2-p ment data altogether. We intentionally collected data probing technique. We have observed end-to-end con- in different time intervals in order to show that our re- gestion on the path at a moment, but Pong is unable sults are time-invariant. Indeed, all our findings hold to accurately locate this congestion. The conclusion for all the distinct traces we took. We use a total of 191 that the reported link is congested may not be reliable. PlanetLab nodes in our experiments. Table 2 shows a “LMS=1” usually means at least one endpoint of the summary of these nodes classified by continents. link is using 4-p or fsd probing technique, but the qual- Continent Number of nodes ity of measurability (QoM) is not high. This is the North America, South America 110 lowest level that a link congestion event report is con- Europe 63 sidered acceptable. “LMS=2” usually means that either Asia, Australia 18 (i) one endpoint of the links is using 4-p or fsd and it Table 2: PlanetLab nodes used in our experiments achieves a high QoM or (ii) both endpoints are using 4-p or fsd and have moderate QoMs. It indicates mod- 4.2 Evaluation erate measurement quality. “LMS=3” usually means 4.2.1 Coverage and Overhead both endpoints are using 4-p or fsd and both achieve Here, we provide statistics about the network cover- good QoMs. It indicates good measurement quality. age, i.e., how many end-to-end paths and internal links “LMS≥4” indicates very good measurement quality. our system has covered. Also, we quantify the measure- 1 ment overhead imposed by our system. Link measurability score (LMS) Total paths and links. We observe about 36,000 0.8 2 paths (N , where N = 191 PlanetLab nodes), which 0.6 expose about 12,100 distinct links at a time. In ad- CDF 0.4 dition, due to routing changes, we are able to observe 0.2 about 29,000 distinct links totally. 0 1 2 3 4 5 6 Coverage and overhead of TMon paths. Our mea- Acceptable Fair Good Excellent surement log shows that there are 1,500 ∼ 2,000 paths Figure 7: CDF of link measurability score running TMon concurrently. These paths cover about 7,600 ∼ 8,000 distinct links concurrently. Comparing We use measurement data of a one week long experi- with the number of total paths and links, it shows that ment to exemplify our result. During this week, we col- we use 4.9% of total paths to cover 65% of total links. lected a total of 54,991 link congestion events. 35,728 The discrepancy between the percent of paths and links (65%) of them have a non-zero LMS. Figure 7 shows is not a surprise, since this effect has been observed pre- the cumulative distribution function (CDF) of LMS for viously (e.g., [16,21]). these 35,728 events. As we can see, (i) 95% of these Each PlanetLab node we use participates in 9.2 TMon link congestion events have an acceptable measurement paths on average, which corresponds to an average per- quality (LMS≥1), (ii) 75% of them have a better-than- node traffic overhead of 30.9 kbps, while the peak per- fair measurement quality (LMS≥2), (iii) 60% of them node traffic overhead is 67.2 kbps. This is because we have a good measurement quality (LMS≥3), and (iv) restrict each node to participate in at most 20 TMon 35% of them have an excellent measurement quality paths (Section 3.5.1). Our computations show that we (LMS≥4). can still improve the coverage of links a lot if we relax 4.3 Link-level vs. End-to-end Congestion the above restriction to some extent, e.g., allow each In this section, we study the properties of end-to- node to participate in up to 40 TMon paths. However, end congestion events (observed by our TMon system), this would come at the cost of a higher overhead. and link-level congestion events (observed by triggered Coverage of Pong paths. Our measurement log shows Pong-based measurements) and emphasize fundamental that there are 25 ∼ 30 paths running Pong concurrently. differences between the two. Throughout the section, This means that there are 50 ∼ 60 PlanetLab nodes run- we leverage the fact that traffic loads exhibit a time- ning Pong at a moment. These paths cover 250 ∼ 350 of-day pattern. As a result, network congestion also distinct links concurrently. exhibits a similar pattern, which our system successfully 4.2.2 Measurement Quality captures. One important principle of our design is to select 4.3.1 End-to-end Congestion Properties paths (or vantage points) that provide best measure- Here, we evaluate properties of the end-to-end con- ment quality on link congestion. Here we evaluate the gestion captured by TMon paths. These results are a measurement quality we achieve in our experiments. basis for understanding properties of the link-level con- We use the link measurability score (LMS, Section 2.2.6) gestion that we describe in the next section. We use annotated with each link congestion event as the evalu- a 4.5-day-long measurement data set to exemplify our ation measure. As previously described, the LMS takes result. Figure 8(a) shows end-to-end congestion events values between 0 and 6. Here we summarize the impli- of all paths. To quantify the congestion scale, we show

8 the following three measures: (i) number of congestion endpoint in Europe, and (iii) paths with at least one events, (ii) sum of congestion intensities of all conges- endpoint in Asia. Figure 8(b)-(d) plot congestion events tion events, (iii) number of distinct TMon paths asso- captured by the three path groups respectively. ciated with these events. All three measures are com- From the three figures we can resolve three different puted over 30-minute-long time bins and are normalized daily peaks which can be well mapped to local diurnal by their peak values. time of three continents. We denote them by “Ameri- can peak”, “European peak”, and “Asian peak” respec- The Y axis in each of the following figures is the ratio to the peak value of each corresponding curve in sub-figure (a). tively as summarized in the following table.

Number of congestion events Peak GMT time Local time Sum of congestion intensity Number of paths associated American peak 16:00–22:00 10:00–16:00 (GMT-6) 1 0.9 European peak 11:00–17:00 12:00–18:00 (GMT+1) 0.8 Asian peak 02:00–06:00 10:00–14:00 (GMT+8) 0.7 0.6 0.5 Next, we explain the above observations. First, it is 0.4 0.3 well-known that links at network edges are more con- 0.2 gested than that in the core [23]. As a result, the main 0.1 0 component of end-to-end congestion is congestion at 12:00 00:00 12:00 00:00 12:00 00:00 12:00 00:00 12:00 Tue Wed Wed Thu Thu Fri Fri Sat Sat edges, i.e., congestion close to measuring endpoints. In (a) All paths Figure 8(b), since both endpoints of a measuring path locate in the American continent, the figure shows a 0.5 Number of congestion events clear “American peak” in the sense that it matches the 0.4 American peak Sum of congestion intensity Number of paths associated 0.3 local diurnal time. 0.2 Figures 8(c) and 8(d) expose the same causality, which 0.1 captures the “European peak” and “Asian peak,” re- 0 12:00 00:00 12:00 00:00 12:00 00:00 12:00 00:00 12:00 spectively. However, since we only apply the geographic Tue Wed Wed Thu Thu Fri Fri Sat Sat constraint for one endpoint, these two figures capture (b) Paths with both endpoints in South, North America more than one daily peak. This is most apparent in

0.8 figure 8(d). In addition to the “Asian peak,” this fig- Number of congestion events 0.7 Sum of congestion intensity European peak Number of paths associated ure captures the “European peak” and the “American 0.6 peak” as well. This is because only one endpoint of 0.5 0.4 each path is required to be in Asia. The other endpoint 0.3 could reside in America, Europe, etc. 0.2 0.1 4.3.2 Link-level Monitoring 0 Here, we evaluate the properties of link-level con- 12:00 00:00 12:00 00:00 12:00 00:00 12:00 00:00 12:00 Tue Wed Wed Thu Thu Fri Fri Sat Sat gestion monitored via Pong by comparing them with (c) Paths with at least one endpoint in Europe properties of the end-to-end congestion monitored via TMon. We emphasize the following two fundamental Asian peak European peak Number of congestion events Sum of congestion intensity differences between the two approaches and reveal ad- American peak Number of paths associated 0.2 vantages of the link-level congestion monitoring. 0.1 First, link-level congestion monitoring allows us to fo- 0 12:00 00:00 12:00 00:00 12:00 00:00 12:00 00:00 12:00 cus on congestion events at a specific location in the In- Tue Wed Wed Thu Thu Fri Fri Sat Sat ternet core, which could show different properties than (d) Paths with at least one endpoint in Asia the congestion at edges. At the same time, the end-to- Figure 8: Time of day effect end congestion monitoring inevitably mixes congestion from all locations on a path and it is typically domi- Figure 8(a) shows a typical time-of-day effect in end- nated by congestion at edges. to-end congestion. All three measures show consistent Second, when monitoring a wide network area, our time-of-day pattern with a daily peak at 12:00∼21:00 link-level congestion monitoring approach makes it pos- GMT. However, this is a mixed time-of-day pattern for sible to cover the area in an unbiased way (Section traffic generated by users from all over the world. To 3.5.2). On the contrary, the location-unaware end-to- get a clearer picture, we decompose it into different time end monitoring would inevitably cover some congested zones and compare each component with its local time. links with a number of measuring paths. Such biased To this end, we classify paths according to the conti- coverage will lead to biased measurement results be- nent that their endpoints reside in. Since each path has cause the same congested locations are repeatedly counted two endpoints that can reside at different continents, for a number of times. we make an approximation by dividing paths into the Figure 9(a) shows results of the link-level congestion following three groups: (i) paths with both endpoints monitored via Pong. Since we know the congestion loca- in North or South America, (ii) paths with at least one tion, we can filter out congestion events at edge links.

9 The Y axis is the ratio to the peak value of each curve. Figure 9(a) Figure 9(b) Number of link congestion events Weekend 10,060 (279/hour) 5,684 (158/hour) Weekend Weekdays Number of paths associated 1 Number of links associated Weekdays 44,931 (365/hour) 17,552 (143/hour) 0.9 0.8 Total 54,991 (346/hour) 23,236 (146/hour) 0.7 0.6 0.5 Table 3: Number of congestion events in Figure 9 0.4 0.3 0.2 tion in the core. In addition, the clearer time-of-day 0.1 0 12:00 00:00 12:00 00:00 12:00 00:00 12:00 00:00 12:00 00:00 12:00 00:00 12:00 effect observed in Figure 9(a) is also due to the unbi- Sat Sun Sun Mon Mon Tue Tue Wed Wed Thu Thu Fri Fri ased coverage achieved by our algorithm. In particular, (a) Link congestion the link coverage of Pong paths is optimized to be uni- form because our triggering logic effectively reduces the Number of congestion events chance that multiple Pong paths concurrently measure Number of paths associated 1 0.9 the same congested link. 0.8 0.7 4.4 Link-level Congestion Correlation 0.6 0.5 In this section, we utilize our methodology to analyze 0.4 0.3 congestion correlation across core links and describe ba- 0.2 0.1 sic findings. We present an in-depth analysis about un- 0 12:00 00:00 12:00 00:00 12:00 00:00 12:00 00:00 12:00 00:00 12:00 00:00 12:00 Sat Sun Sun Mon Mon Tue Tue Wed Wed Thu Thu Fri Fri derlying correlation causes in the next section. To quantify the congestion correlation, we introduce (b) End-to-end congestion a measure observed correlation. This measure repre- Figure 9: Comparison between link congestion sents the pairwise congestion correlation between two and end-to-end congestion observation links. To compute it, we first search concurrent link congestion events during each 30-second time period and update a matrix that records the times each link As a result, the figure actually shows aggregate link- pair being concurrently congested (we call this measure level congestion on core links. The figure is plotted overlap count). Then, we divide the overlap count by i in terms of three measures: ( ) number of congestion the smaller total congestion events number of the two ii events, ( ) number of distinct Pong paths associated links in the pair. The quotient is the observed correla- iii with these events, ( ) number of distinct links associ- tion. We did not use the classical statistical measure of ated with these events. Similarly to the way we analyze correlation due to the complexity to acquire the over- end-to-end congestion, all three measures are computed lapped measurement period of two links in our trigger- over 30-minute-long epochs and are normalized by their based measurement context10. peak values. In addition, to prevent scenarios in which the smaller Figure 9(b) shows results of the end-to-end conges- total congestion events number is too small, such that it tion monitored via TMon during the same week. The might lead to an unreasonably high correlation, we re- two figures reveal important differences in link-level and quire the overlap count to be at least 10; otherwise, we end-to-end congestion dynamics. In particular, Figure filter this link pair. We put a requirement on the over- 9(a) shows a clearer time-of-day pattern — peaks are lap count instead of the smaller total congestion events higher and valleys are deeper. Daily peaks in Figure number because we also want to filter the cases when 9(a) do not necessarily correspond to daily peaks in Fig- two links do not share many concurrent measurement ure 9(b), but it still corresponds to some minor peaks. periods, hence an unreasonably low correlation. Indeed, This is because congestion on core links will definitely because concurrent measurements might not always be lead to end-to-end congestion but it does not domi- available, observed correlation may slightly underesti- nantly affect the end-to-end congestion. For the val- mate correlation between links. leys, the two figures show an even larger mismatch. For example, the valleys in Figure 9(a) could correspond to 1 Weekdays Weekend epochs in Figure 9(b) at which there is high conges- 0.8 tion. This happens because the end-to-end congestion 0.6

is dominated by congestion at edges. CDF In addition to the daily differences, the two figures 0.4 also show significant weekly difference. As summarized 0.2 in Table 9, Figure 9(a) shows much fewer congestion 0 events during the weekend than the weekdays. On the 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Observed correlation between link pairs contrary, Figure 9(b) shows slightly more congestion events during the weekend than the weekdays. Figure 10: CDF of observed correlation Overall, we find that congestion in the core tends To present the results, we use the data corresponding to behave very differently from end-to-end congestion; 10We only know the overlapped measurement period of paths. likewise, our link-level monitoring approach makes pos- It is complex to resolve whether a specific path measurement sible a much more accurate insights about the conges- period is triggered by a specific link or not.

10 to Figure 9(a), which are data of link-level congestion on most likely to happen fortifies our hypothesis (Section the core links in the one-week-long experiment. We plot 4.5.3). the CDF of observed correlations for the weekend and 4.5.1 Spatial Distribution of Correlated Links weekdays separately as shown in Figure 10. The figure To perform an in-depth analysis about congestion shows that (i) congestion correlation across core links correlation, we select top 20 links in terms that they are is not rare and there are quite a number of repetitively commonly correlated with the largest number of other congested links that bear relatively high congestion cor- links. We call them the “causal links.” Note that these relations; (ii) the observed correlations are much higher links are not necessarily the real cause, but they tend to during the weekend than during the weekdays. be the most sensitive indicators for underlying causes; The second phenomenon implies a relationship be- hence, we can roughly treat them as the cause. tween the observed correlations and overall congestion We analyze the locations of each “causal link” to- extent on the core links. Recall that in Figure 9(a) we gether with those links highly correlated with it. To observe a much lower overall congestion extent on the represent the location, we classify links into intra-AS core links during the weekend than during the week- links and inter-AS links.11 We find that the above links days. We infer that the reason we observe higher corre- can reside at diversified locations, including (i) intra- lation during the weekend is that a pairwise correlation AS links within large ISPs (including Tier-1, Tier-2 and is much easier to observe when the overall congestion other backbone networks); (ii) inter-AS links between extent is low. We hypothesize that we observe lower cor- large ISPs; and (iii) inter-AS links from a lower tier to relation during weekdays not because the same pairwise a higher tier. correlation does not exist, but because they are blurred as a result of multiple pairwise correlations when the Congestedinter-ASlink(arrowshowstrafficdirection) overall congestion extent is high. Congestedintra-ASlink Peeringbetweentwo Ases PeeringwithallTier-1ISPs AS 81 Tie-1network To understand how far a pair of correlated links can (Peeringarefullymeshed) AS 174 AS be distant from each other, we select top 25% (in terms AS 2914 19401 AS 3549 NTT America COGENT of observed correlation) link pairs in the weekdays and AS 7018 GBLX AT&T AS 293 weekend respectively. We roughly estimate the short- AS 701 ESNET UUNET AS 237 est distance between the two links in each pair based AS 1299 MERIT AS 7029 TELIANET on our topology database. We represent the distance WINDSTREAM “Causal AS 2500 AS link” JPNIC by two measures: AS distance and hop distance. The AS Tier-1orlarge 11317 7896 AS 11537 former quantifies the AS level distance, the latter quan- Tier-2ISPs AS 20965 ABILENE GEANT AS AS AS 600 tifies the router level distance. Table 4 summarizes the AS 559 AS 10578 40498 OARNET SWITCH 11714 AS average and standard deviation of estimated distances. AS 19782 AS 5661 OtherISPs In addition, our result shows that the correlated links Universitiesandacademicinstitutions 14085 can span across up to three neighboring ASes (i.e., AS A typicalexampleoftheaggregationeffect distance ≤ 3). This figure shows locations of congested links correlated with an inter-AS link between AS 11537 and AS 237. The lower part shows a typical example of the aggregation effect. AS Distance Hop Distance Weekdays Weekend Weekdays Weekend Figure 11: In-depth analysis on spatial distribu- Avg Dev Avg Dev Avg Dev Avg Dev tion of correlated links 0.82 0.76 0.85 0.77 3.1 1.9 3.7 1.9 Avg: Average distance, Dev: Standard deviation of distance. To describe our in-depth analysis results about corre- lated links, we use the top “causal link” as the example. Table 4: Distance between correlated links It is an inter-AS link between AS 11537 (Abilene) and 4.5 Aggregation Effect Hypothesis AS 237 (Merit). This link shows high correlation with One important finding from our experiments is that 25 other links. Figure 11 illustrates AS-level locations the effect of traffic aggregation tends to play an impor- for 23 of these links that we can resolve. The 23 links tant role on congestion behavior at core links. Here, cover an area that consists of as much as 24 distinct traffic aggregation describes the situation that traffic ASes. All of them are no more than 3 ASes away from from a number of upstream links converges at a down- the “causal link”. As we can see from Figure 11, most stream AS-level aggregation link. We make this hypoth- links reside upstream from the “causal link”. Examples esis based on two phenomena that we measured: (i) The are: (i) an intra-AS link within AS 11537. (ii) Inter-AS spatial distribution pattern of correlated links tends to links from seven universities that directly or indirectly result from the aggregation effect (Section 4.5.1). (ii) access AS 11537. (iii) Intra-AS links within Tier-1, There are some hot spots in the core exhibiting time- Tier-2 ISPs, and backbone networks that are close to independent high congestion, which also tend to result AS 11537. from the aggregation effect (Section 4.5.2). A compar- 11To classify intra- and inter-AS links is a non-trivial task. ison between congestion locations related to these two We use a best-effort algorithm to do this based on IP own- phenomena and locations that the aggregation effect is erships of the two nodes of each link.

11 Analysis on other “causal links” shows a similar con- much higher average congestion intensity when the total gestion distribution pattern. Such distribution pattern number of congestion events is small. fortifies our hypothesis that the congestion correlation We find that such time-independent hot spots are could result from the aggregation effect: When up- inter-AS links between large backbone networks all over stream traffic converges to a relatively thin aggrega- the world as well as intra-AS links within these net- tion point, upstream traffic surges can cause congestion works. The former ones are about 1.5 times as many at the aggregation point, hence a high probability that as the latter ones. Most of these links are not inter- congestion at the two places (the upstream network and continental links as we initially hypothesized. Table 3 the aggregation point) is correlated. Take Abilene for shows the top ten ASes that host the most of such time- example. Its cross-country backbone is 10 Gbps, while independent hot spots. We list them in descending or- the two aggregation points (one in Chicago, the other der of the maximum congestion intensity measured on in ) from Abilene to Merit tend to have less the hot spots they host, i.e., AS 174 shows the largest provisioned bandwidths, e.g., the one in Chicago still maximum congestion intensity. In the next section, we uses the OC-12 (622 Mbps) connection [5]. In addition, show that these locations are the places where the ag- Abilene aggregate network statistics in 2007 [1] shows gregation effect is most likely to happen. that aggregate traffic from Abilene to Merit is usually about twice as much as that in the reverse direction. AS# Description This matches the congestion direction that we observe 174 Cogent Communications, a large Tier-2 ISP. in Figure 11. 1299 TeliaNet Global Network, a large Tier-2 ISP. GEANT, a main European multi-gigabit com- 4.5.2 Locations of Time-Independent Hot Spots 20965 puter network for research and education pur- In addition to the spatial distribution of correlated poses, Tier-2. links, we observe another phenomenon that fortifies our 4323 Time Warner Telecom, a Tier-2 ISP in US. aggregation hypothesis. This phenomenon is that there 3356 Level 3 Communications, a Tier-1 ISPs. are a number of hot spots exhibiting time-independent 237 Merit, a Tier-2 network in US. 6461 Abovenet Communications, a large Tier-2 ISP. high congestion. By analyzing locations of such hot RedCLARA, a backbone connects the Latin- spots, we find a high likelihood that such hot spots could 27750 American National Research and Education Net- result from the aggregation effect. works to Europe. The Y axis is the ratio to the peak value of each curve. 6453 Teleglobe, a Tier-2 ISP. 2914 NTT America, a Tier-1 ISPs. Number of congestion events Average congestion intensity over events 1 3549 Global Crossing, a Tier-1 ISPs. 0.9 Abilene, an backbone network in US. 0.8 11537 0.7 4538 China Education and Research Network. 0.6 0.5 0.4 Table 5: Backbone networks with strong hot spots 0.3 0.2 0.1 0 4.5.3 AS-level Traffic Aggregation Effect 12:00 00:00 12:00 00:00 12:00 00:00 12:00 00:00 12:00 Tue Wed Wed Thu Thu Fri Fri Sat Sat To analyze the aggregate effect hypothesis, we com- pare the top 20 correlated links (that are commonly Figure 12: Effect of time-independent hot spots correlated with the largest number of links) mentioned To understand properties of such time-independent in Section 4.5.1 and locations of the time-independent hot spots, we show their effect on end-to-end congestion hot spots mentioned in the previous section with loca- observation as shown in Figure 12. The figure plots dy- tions that the aggregation effect is most likely to take namics of two measures: number of congestion events place. and average per-event congestion intensity. Both mea- The aggregation effect could happen most probably sures are computed over 30-minute-long time bins and at (i) networks that have the largest number of peers, are normalized by their peak values. The figure shows and (ii) ISPs that are most aggressively promoting cus- that the number of congestion events show a clear time- tomer access. Table 6 shows the networks within the of-day pattern. On the contrary, the average conges- top ten of both categories that match locations of the tion intensity exhibits a very different dynamics. It is top 20 correlated links and the time-independent hot largely independent on the time-of-day effect, and even spots that we measured. For the top ten networks with shows its peaks at the valleys of the number of congested the largest number of peers, we use ranks provided by events curve, and vice versa. FixedOrbit [4]. For the top ten networks most aggres- The above phenomenon results from the following sively promoting customer access, we use the ranks in properties of the time-independent hot spots that we terms of the Renesys customer base index [7] of three find through investigation: Although such hot spots ex- different continents. The definition of the Renesys cus- hibit the time-off-day effect in terms of the number of tomer base index is highlighted in the table. congestion events (fewer events when overall network Table 6 shows a remarkable location match among congestion is low), their per-event congestion intensities our results and statistics collected by others. Indeed, always remain high. This is why we can often observe Table 6 shows that almost all the hot spots shown in

12 Table 5 and Figure 11 locate at (i) networks that have are designed for monitoring a single end-to-end path, the largest number of peers [4], or (ii) ISPs that are ag- and hence may not be suitable for large-scale Internet- gressively promoting customer access. Hence, the phe- scale monitoring due to large measurement overhead. nomena of both the congestion correlation and time- Because we depend upon a light measurement tool [23], independent hot spots could be closely related to the we are capable of generating concurrent measurements AS-level traffic aggregation effect. First, an upstream over a larger Internet area and reveal correlation among link could bear congestion correlation with a down- concurrently congested hot spots, yet without overload- stream AS-level aggregation link. Second, time-independent ing either the network or the monitoring vantage points. hot spots usually take place at AS-level aggregation In particular, each node in our system experiences mod- points. erate measurement overhead of 30 kbps. Internet Correlations. The Internet is a complex 5. RELATED WORK system composed of thousands of ASes. Necessarily, Our system and findings relate to other network mon- events happening in one part of the network can have itoring systems, Internet tomography, root-cause- and repercussions on other network parts. For example, performance-modeling analysis that we outline below. Sridharan et al. [35] find correlation between route dy- Triggered-based Monitoring. One of the key fea- namics and routing loops; Feamster et al. [26] show tures of our monitoring system is its triggered mea- that there exists correlation between link failures and surement nature. In this context, Zhang et al.’s Plan- BGP routing messages; Teixeira et al. [37] find that etSeer [41] monitoring system is closest to ours, both hot-potato routing can trigger BGP events; on the other in spirit and design. PlanetSeer detects network path side, Agarwal et al. [10] show that BGP routing changes anomalies such as outages or loops using passive ob- can cause traffic shifts in a single backbone network. servations of the CoDeeN CDN [2], and then initiates active probing. Contrary to PlanetSeer, in absence of 6. CONCLUSIONS passive traffic monitoring, we use a mesh-like light ac- In this paper, we performed in-depth study on Inter- tive monitoring subsystem TMon. Also, instead of mon- net congestion behavior with a focus on the Internet itoring loops and outages, we focus on congestion hot- core. We developed a large scale triggered monitoring spots. Understanding if and how loops or outages affect system which integrates the following unique proper- congestion is a part of our intended future work. ties: (i) it is able to locate and track congestion events Other systems that apply a triggered-measurement concurrently at a large fraction of Internet links; (ii) approach include Boschi et al.’s SLA-validation system it exploits triggering mechanisms to effectively allocate [17] and Wen et al.’s on-demand monitoring system [39]. measurement resources to hot spots; (iii) it deploys a In addition to the fact that our system’s goals are fun- set of novel online path-selection algorithms that deliver damentally different from all the above, it also requires highly-accurate observations about the underlying con- measuring entire network areas, and hence triggering a gestion. Our system’s ability to directly observe distinct large number of vantage points, concurrently. link-level congestion events concurrently allows a much Information Plane. Our congestion monitoring sys- deeper understanding of Internet-wide congestion. tem relates to information dissemination systems that Our major findings from experiments using this sys- monitor and propagate important network performance tem are as follows: (i) Congestion events in the core can inferences to end-points, (e.g., [14,22,28,30,38,40]). In be highly correlated. Such correlation can span across general, such systems could be divided into two types. up to three neighboring ASes. (ii) There are a small The first manages information about network or nodes number of hot spots between or within large backbone under control of the information plane, (e.g. RON [14]), networks exhibiting highly intensive time-independent while the second type predicts path performance at congestion. (iii) The phenomena of both the congestion Internet-scale, (e.g., iPlane [30]). As a result, the first correlation and the time-independent hot spots could be type provides information over short time-scales, while closely related to the AS-level traffic aggregation effect. the second one does it over much longer time scales, (iv) Congestion dynamics in the core is quite different e.g., 6 hours [30]. from that at edges. Our system takes the best of the two worlds: its light mesh-like monitoring approach enables covering Internet-wide areas, yet its triggered-based approach 7. REFERENCES helps effectively focus on a smaller number of important [1] Abilene aggregate network statistics. http://stryper.grnoc.iu.edu/abilene/aggregate/html/. events [34], and consequently disseminate the informa- [2] CoDeeN. http://codeen.cs.princeton.edu/. tion about the same over shorter time scales. [3] Emulab. http://www.emulab.net/. Locating Internet Bottlenecks. A number of tools [4] FixedOrbit. http://fixedorbit.com/. have been designed to detect and locate Internet bottle- [5] . http://www.merit.edu/. necks and their properties, e.g., [11, 29, 31, 33]. A more [6] PlanetLab. http://www.planet-lab.org/. [7] Renesys Corporation. http://www.renesys.com/. detailed discussion about these approaches is available [8] Skype. http://www.skype.com/. in [23]. Common to most of these tools is that they [9] Verio backbone SLA terms and conditions.

13 North America Europe Asia Rank Network Peers Rank ISP Rank ISP Rank ISP 1 UUNET 2,346 1 Level 3 Comm. 1 Level 3 Comm. 2 NTT America 2 AT&T WorldNet 2,092 2 UUNET 2 TeliaNet Global Network 6 UUNET 3 Level 3 Comm. 1,742 3 AT&T WorldNet 4 Global Crossing 8 AT&T WorldNet 5 Cogent Comm. 1,642 6 Cogent Comm. 8 Teleglobe 9 Level 3 Comm. 7 Global Crossing 1,041 9 Global Crossing 10 Teleglobe 8 Time Warner 918 The Renesys Customer Base Index [7] is defined based on the following five criteria: (1) Which service 9 Abovenet 798 providers have the most customer networks? (2) Which service providers are acquiring customer net- works at the fastest rate? (3) Which service providers are experiencing the least customer churn? (4) (a) Matched locations in the top Which service providers have the most customers with only one link to the Internet? (5) Which service ten networks defined by the providers connect to the most customer networks, both directly and through peering relationships? number of peers (b) Matched locations in the top ten ISPs defined by the Renesys customer base index across three continents for the month of February 2006 Table 6: Matched locations in top ten networks/ISPs defined by the number of peers and the Renesys Customer Base Index

http://www.verio.com/global-ip-guarantee/. F. Kaashoek. Measuring the effects of Internet path [10] S. Agarwal, C. Chuah, S. Bhattacharyya, and C. Diot. failures on reactive routing. In SIGMETRICS’03. The impact of BGP dynamics on intra-domain traffic. [27] C. Fraleigh, S. Moon, B. Lyles, C. Cotton, M. Khan, In SIGMETRICS’04. D. Moll, R. Rockwell, T. Seely, and C. Diot. [11] A. Akella, S. Seshan, and A. Shaikh. An empirical Packet-level traffic measurements from the Sprint IP evaluation of wide-area Internet bottlenecks. In backbone. IEEE Network, 2003. IMC’03. [28] J. Hellerstein, V. Paxson, L. Peterson, T. Roscoe, [12] A. Akella, A. Shaikh, and S. Seshan. A comparison of S. Shenker, and D. Wetherall. The network oracle. overlay routing and multihoming route control. In IEEE Data Engineering Bulletin, 2005. SIGCOMM’04. [29] N. Hu, L. Li, Z. Mao, P. Steenkiste, and J. Wang. [13] A. Akella, A. Shaikh, and S. Seshan. Multihoming Locating Internet bottlenecks: Algorithms, performance benefits: An experimental evaluation of measurements, and implications. In SIGCOMM’04. practical enterprise strategies. In USENIX Technical [30] H. Madhyastha, T. Isdal, M. Piatek, C. Dixon, Conference, 2004. T. Anderson, A. Krishnamurthy, and [14] D. Anderson, H. Balakrishnan, M. Kaashoek, and A. Venkataramani. iPlane: An information plane for R. Morris. Resilient overlay networks. In SOSP’01. distributed services. In OSDI’06. [15] T. Anderson, A. Collins, A. Krishnamurthy, and [31] R. Mahajan, N. Spring, D. Wetherall, and J. Zahorjan. PCP: Efficient endpoint congestoin T. Anderson. User-level Internet path diagnostics. In control. In NSDI’06. SOSP’03. [16] P. Barford, A. Bestavros, J. Byers, and M. Crovella. [32] V. Padmanabhan, L. Qiu, and H. Wang. Server-based On the marginal utility of network topology inference of Internet link lossiness. In INFOCOM’03. measurements. In SIGCOMM IMW’01. [33] V. Ribeiro, R. Riedi, and R. Baraniuk. Locating [17] E. Boschi, M. Bossardt, and T. Dubendorfer. available bandwidth bottlenecks. IEEE Internet Validating inter-domain SLAs with a programmable Computing, 2004. traffic control system. In IWAN’05. [34] H. Song, L. Qiu, and Y. Zhang. NetQuest: A flexible [18] L. Brakmo and L. Peterson. TCP Vegas: End to end framework for large-scale network measurement. In congestion avoidance on a global Internet. IEEE SIGMETRICS’06. JSAC, 1995. [35] A. Sridharan, S. Moon, and C. Diot. On the [19] T. Bu, N. Duffield, F. Presti, and D. Towsley. Network correlation between route dynamics and routing loops. tomography on general topologies. In In IMC’03. SIGMETRICS’02. [36] C. Tang and P. McKinley. On the cost-quality tradeoff [20] C. Caceres, N. Duffield, J. Horowitz, and D. Towsley. in topology-aware overlay path probing. In ICNP’03. Multicast-based inference of network-internal loss [37] R. Teixeira, A. Shaikh, T. Griffin, and J. Rexford. characteristics. IEEE Transactions on Information Dynamics of hot-potato routing in IP networks. In Theory, 1999. SIGMETRICS’04. [21] Y. Chen, D. Bindel, H. Song, and R. Katz. An [38] M. Wawrzoniak, L. Peterson, and T. Roscoe. Sophia: algebraic approach to practical and scalable overlay An information plane for networked systems. In network monitoring. In SIGCOMM’04. HotNets’03. [22] D. Clark, C. Partridge, J. Ramming, and [39] Z. Wen, S. Triukose, and M. Rabinovich. Facilitating J. Wroclawski. A knowledge plain for the Internet. In focused Internet measurements. In SIGMETRICS’07. SIGCOMM’03. [40] P. Yalagandula, P. Sharma, S. Banarjee, S. Basu, and [23] L. Deng and A. Kuzmanovic. Pong: Diagnosing S. Lee. S3: A scalable sensing service for monitoring spatio-temporal internet congestion properties. In large networked systems. In SIGCOMM INM’06. SIGMETRICS (poster), 2007. A longer version is [41] M. Zhang, C. Zhang, V. Pai, L. Peterson, and available at R. Wang. PlanetSeer: Internet path failure monitoring http://networks.cs.northwestern.edu/pong/. and characterization in wide-area services. In [24] M. Dischinger, A. Haeberlen, K. Gummadi, and OSDI’04. S. Saroiu. Characterizing residential broadband [42] Y. Zhang, N. Duffield, V. Paxson, and S. Shenker. On networks. In IMC’07. the consistency of Internet path properties. In [25] N. Duffield, J. Horowits, D. Towsley, W. Wei, and SIGCOMM IMW’01. T. Friedman. Multicast-based loss inference with [43] Y. Zhao, Y. Chen, and D. Bindel. Towards unbiased missing data. IEEE JSAC, 2002. end-to-end network diagnosis. In SIGCOMM’06. [26] N. Feamster, D. Anderson, H. Balakrishnan, and

14