<<

REQUEST ROUTING IN CONTENT DELIVERY NETWORKS

by HUSSEIN A. ALZOUBI

Submitted in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY

Dissertation Advisor: Prof. Michael Rabinovich

Department of Electrical Engineering and Computer Science CASE WESTERN RESERVE UNIVERSITY

January 2015 The Dissertation Committee for Hussein A. Alzoubi certifies that this is the approved version of the following dissertation:

Request Routing in Content Delivery Networks

Committee:

Michael Rabinovich, Supervisor

Christos Papachristou

Daniel Saab

Francis Merat

Date: 12 / 03 / 2014 To my parents for all their effort that led me to reach this point, To my beloved wife Enaas for all her help, patient, and support, To my kids Leen and Jawad with sincere love. Contents

List of Tables x

List of Figures xi

Acknowledgments xiv

Abstract xvi

Chapter 1 Introduction 1

Chapter 2 The Anatomy of LDNS Clusters: Findings and Implications for Web Content Delivery 9

2.1 Introduction...... 9

2.2 RelatedWork ...... 10 2.3 SystemInstrumentation ...... 11 2.4 TheDataset...... 14 2.5 ClusterSize ...... 16 2.5.1 NumberofClients ...... 16

2.5.2 Cluster Activity ...... 18 2.6 TTLEffects...... 20 2.7 Client-to-LDNSProximity ...... 23 2.7.1 Air-Miles Between Client and LDNS ...... 24

vi 2.7.2 GeographicalSpan ...... 25 2.7.3 ASSharing ...... 27 2.8 Top-10LDNSsandClients...... 30 2.9 ClientSiteConfigurations ...... 33

2.9.1 ClientsORLDNSs?! ...... 34 2.9.2 LDNSPools...... 38 2.10 Implications for Web Content Delivery ...... 41 2.11Summary ...... 42

Chapter 3 A Practical Architecture for an CDN 44

3.1 Introduction...... 44

3.2 RelatedWork ...... 48 3.3 Architecture...... 50 3.3.1 Load-awareAnycastCDN ...... 50 3.3.2 ObjectivesandBenefits ...... 53

3.3.3 Dealing with Long-Lived Sessions ...... 55 3.3.4 Dealing with Network Congestion ...... 56 3.4 RemappingAlgorithm ...... 56 3.4.1 Problem Formulation ...... 57 3.4.2 Minimizing Cost ...... 58

3.4.3 Minimizing Connection Disruption ...... 60 3.5 EvaluationMethodology ...... 62 3.5.1 DataSet...... 62 3.5.2 Simulation Environment ...... 64

3.5.3 SchemesandMetricsforComparison ...... 65 3.6 ExperimentalResults...... 67 3.6.1 Server Load Distribution ...... 68 3.6.2 DisruptedandOver-CapacityRequests ...... 70

vii 3.6.3 Request Air Miles ...... 75 3.6.4 ComputationalCostofRemapping ...... 78 3.6.5 TheEffectofRemappingInterval ...... 81 3.7 Summary ...... 86

Chapter 4 Performance Implications of Unilateral Enabling of IPv6 87

4.1 Introduction...... 87

4.2 Background ...... 88 4.3 RelatedWork ...... 90 4.4 Methodology ...... 91 4.5 TheDataset...... 94

4.6 TheResults ...... 95 4.6.1 DNS Resolution Penalty ...... 95 4.6.2 End-to-EndPenalty ...... 96 4.7 Summary ...... 99

Chapter 5 IPv6 Anycast CDNs 100

5.1 Introduction...... 100

5.2 Background ...... 101 5.2.1 IPv6 ...... 101 5.2.2 TCP...... 103 5.2.3 IPv6 Mobility Overview ...... 105

5.3 RelatedWork ...... 107 5.4 Lightweight IPv6 Anycast for Connection-Oriented Communication . 109 5.5 IPv6AnycastCDNArchitecture...... 113 5.6 Summary ...... 116

Chapter 6 Conclusion 117

viii Bibliography 119

ix List of Tables

2.1 High-level dataset characterization ...... 14

2.2 ClientsOSbreakdown ...... 15 2.3 Clientsbrowsersbreakdown ...... 15 2.4 Activity of client-LDNS associations sharing the same AS ...... 28

4.1 ThebasicIPv6statistics ...... 94

x List of Figures

1.1 BasicarchitectureofCDNs...... 3

1.2 AnycastBasedRedirection...... 5

2.1 MeasurementSetup...... 12 2.2 Distribution of LDNS cluster sizes...... 17 2.3 Distribution of sub1 requests and client/LDNS pairs attributed to

LDNSclustersofdifferentsizes ...... 18 2.4 LDNSs Activity in terms of DNS and HTTP requests...... 19 2.5 LDNS cluster sizes within TTL windows (all windows)...... 21 2.6 Average LDNS cluster sizes within a TTL window. (averaged over all

windowsforagivenLDNS) ...... 22 2.7 HTTP requests within TTL windows (all windows)...... 23 2.8 Average number of HTTP requests per LDNS within a TTL window (averagedoverallwindowsforagivenLDNS)...... 24 2.9 Air miles for all client/LDNS pairs ...... 25

2.10 Avg client/LDNS distance in top LDNS clusters ...... 26 2.11 Avg client/LDNS distance for all LDNS clusters ...... 27 2.12 CDF of LDNS clusters with a given % of clients/LDNSs outside their LDNS’s/Client’sautonomoussystem...... 28

2.13 AS sharing of top-10 LDNSs and their clients ...... 30

xi 2.14 Air miles between top-10 LDNSs and their clients...... 31 2.15 AS sharing of top-10 clients and their LDNSs ...... 32 2.16 Air-miles for top 10 LDNSs and top-10 clients...... 33 2.17 DistributionofLDNStypes ...... 34

2.18 Cluster size distribution of LDNS groups...... 36 2.19 The number of sub1 requests issued by LDNSs of different types.... 37 2.20 Number of sub1 requests issued by One2One LDNSs...... 38 2.21LDNSPool ...... 39

3.1 Load-awareAnycastCDNArchitecture ...... 51 3.2 Application level redirection for long-lived sessions ...... 54 3.3 Number of concurrent requests for each scheme (Large files group) . . 69 3.4 Number of concurrent requests for each scheme (Small objectsgroup) 71

3.5 Service data rate for each scheme (Large files group) ...... 72 3.6 Service data rate for each scheme (Small objects group) ...... 73 3.7 Disrupted and over-capacity requests for each scheme (Y-axis in log scale)...... 74 3.8 Average miles for requests calculated every 120 seconds ...... 76

3.9 99th percentile of request miles calculated every 120 seconds . . ... 77 3.10 Execution time of the alb-a and alb-o algorithms in the trace envi- ronment ...... 77 3.11 Total offered load pattern (synthetic environment) ...... 80

3.12 Scalability of the alb-a and alb-o algorithms in a synthetic environment 81 3.13 The effect of remapping interval on disrupted connections ...... 82 3.14 The effect of remapping interval on cost (common 6-hour trace period) 83 3.15 The effect of remapping interval on dropped requests (common 6-hour

traceperiod) ...... 84

xii 3.16 The effect of over-provisioning on over-capacity requests (common 6- hourtraceperiod)...... 85

4.1 Measurement Setup. Presumed interactions are marked in blue font.. 92 4.2 Time difference between A and AAAA “sub” requests ...... 96

4.3 ComparisonofallIPv6andIPv4delays...... 97 4.4 IPv4andIPv6delaysperclient...... 98

5.1 IPv6PacketHeaderFormat...... 102 5.2 IPv6DestinationOptionHeaderFormat...... 103

5.3 TypicalTCPConnection...... 104 5.4 TCP Interaction For an IPv6 Anycast Server ...... 110 5.5 TCP Interaction For an IPv6 Anycast Established Connection . . . . 111 5.6 IPv6AnycastCDN ...... 114

5.7 RedirectioninIPv6AnycastCDN...... 115

xiii Acknowledgments

First and foremost I would like to take this opportunity to express my thanks and gratitude to God for all his blessings and bounties that he has bestowed upon me. I would like to thank my advisor, professor Michael Rabinovich for his great efforts, patience, insights, and endless guidance. I would like also to thank my disser- tation committee, Professors Christos Papachristou, Daniel Saab, and Francis Merat for their time and valuable comments. Many thanks to my friends and companions during my journey at Case. Thanks to Osama Al-khaleel, Zakaria Al-Qudah, Ahmad Al-Hammouri, Mohammad Darawad, Mohammad Al-Oqlah, Saleem Bani Hani, Khalid Al-Adeem and all my friends here in the US. Special thanks to Huthaifa Al-Omari and Abdullah Jordan, You have always been supportive and you have made my journey an enjoyable journey. My parents and siblings: Mohammad Rabee, Ahmad, Ali, Sajidah and Sojood, Thank you all for all your help, all your support, and all your encouragements. Last but not least: Enaas my beloved wife, Leen and Jawad the joy of my life. I wish I have the words to express my thanks to you for your endless support, great patience and encouragement. Thank you from the bottom of my heart.

Hussein A. Alzoubi

Case Western Reserve University

January 2015

xiv xv Request Routing in Content Delivery Networks

HUSSEIN A. ALZOUBI

Internet has become - and continues to grow as - the main distributor of digital media content. The media content in question runs the gamut from operating system patches and gaming , to more traditional Web objects and streaming events and more recently user generated video content. Content Delivery Networks (CDNs) (e.g. Akamai, Limelight, AT&T ICDS) have emerged over the last decade to help content providers deliver their digital content to end users in a timely and efficient manner. The challenge to the effective operation of any CDN is to redirect clients to the “best” service server from which to retrieve the content, a process normally referred to as “redirection” or “request routing”. Most commercial CDNs make use of DNS-based request routing mechanism to perform redirection. In this mechanism, the (DNS) operated by the CDN receives queries - via clients Local DNS (LDNS) servers- for hostnames of the accelerated URLs and resolves these queries into the IP address of a CDN server that the DNS system selects for a given query. DNS-based request routing, however, exhibits several well-known limitations. First, DNS based request routing operates at the granularity of LDNS servers and what might be a good choice for an LDNS is not necessary a good choice for all its clients. Second, redirecting a single LDNS might cause a large number of clients

xvi behind that LDNS to be redirect to the same CDN node causing potential load balancing problems. In addition, DNS based CDNs suffers from the limitation that DNS system was not designed for very dynamic changes in the mapping between hostnames and IP addresses. Another problem that is facing not only CDNs but also the entire Internet apparatus, is the scarcity of available IPv4 addresses. IPv4 only supports 4 billion globally routed IP addresses. Even though IPv6 was developed to deal with this long-anticipated IPv4 address exhaustion, the overall Internet transition to IPv6 is still lagging. Further, network paths between clients to Web sites commonly do not support IPv6 even if the two end-hosts are both IPv6-enabled. This dissertation quantifies the effect of the above limitations of DNS-based request routing in CDNs, and offers a practical mechanism for replacing DNS-based with anycast-based request routing. Our proposed CDN architecture effectively ad- dresses the long-known drawbacks of anycast request routing allowing us to reconsider the practicality of this mechanism. Further, this dissertation addresses the issue of transitioning to IPv6, by first showing that there is virtually no performance penalty for a web site to unilaterally enable IPv6 support, and then proposing a light-weight architecture for implementing IPv6 anycast for connection oriented transport. The proposed architecture preserves security and privacy, and facilitates anycast’s inherent proximal routing. In addition, this dissertation presents an architecture of an any- cast IPv6 CDN that utilizes the proposed IPv6 anycast architecture as the redirection mechanism.

xvii Chapter 1

Introduction

As the Internet continues to grow and continues to become an essential utility in this - Internet - age, Web content providers are expected - if not required - to have their digital content delivered to end users in a timely and efficient manner. However, accomplishing this goal is challenging because of the often bursty nature of demand for such content [52], and also because content owners require their content to be highly available and be delivered in timely manner without impacting presentation quality [72]. Content delivery networks (CDNs) (e.g., Akamai, Limelight) have emerged over the last decade as an answer to this challenge and have become an essential part of the current Internet apparatus. In fact, Akamai alone claims to deliver between 15% - 30% of all Web traffic [6]. The basic architecture of most CDNs consists of a set of CDN nodes distributed across the Internet [20]. These CDN nodes serve as content servers’ surrogates from which clients retrieve content from the CDN nodes using a number of standard pro-

tocols. The key to the effective operation of any CDN is to direct users to the “best” CDN node, a process normally referred to as “redirection” or “request routing” [17]. Throughout this dissertation, the terms “request routing” and “redirection” are used interchangeably.Request routing is challenging, because not all content is available

1 from all service servers, not all servers are operational at all times, servers can be- come overloaded and a client should be directed to a server that is in close proximity to ensure satisfactory user experience. A keystone component of – not only CDNs but also – today’s Internet appa-

ratus is Domain Name System (DNS). Its primary goal is to resolve human-readable host names, such as “cnn.com” to hosts IP addresses. Virtually all Internet interac- tions start with a DNS query to resolve a hostname into IP addresses. In particular, suppose a user wants to retrieve some web contents from a Web site on the Inter-

net. The first step here is for the Web browser to send a DNS query to the user’s Local DNS Servers (LDNS) to obtain the hostname of the requested URL. Unless the hostname is stored in its local cache, the LDNS in turn sends the request (by nav- igating through the DNS infrastructure) to the Authoritative DNS Server (ADNS)

that is responsible for the requested name. The ADNS server maintains the mapping information of hostnames to IP addresses and returns the corresponding IP addresses back to the LDNS. The LDNS saves the resolution in its local cache and forwards that resolution to the client. The Web browser then stores the resolution in its own cache and proceeds with the HTTP interactions by establishing a session using the

provided IP address. Content delivery networks fundamentally rely on DNS to re-route user com- munication from the servers to the CDN infrastructure. A typical technique to achieve this goal leverages DNS protocol’s name aliasing, which is done through a special response type, a CNAME record. As part of service provisioning, the ori- gin site’s ADNS is configured to respond to DNS queries for host-names that are outsourced to the CDN not with the IP address of the origin server but with a CNAME record specifying a hostname from the CDN’s domain. For instance, con- sider a web site foo.com that wants to outsource the delivery of an object with URL http://images.foo.com/pic.jpg to a content delivery network cdn-x.com, as shown in

2 Figure 1.1. When a client (say, Client 1 in the figure) tried to access the above object, it sends a DNS query for images.foo.com to its local DNS server (step 1 in the figure), which, after traversing the DNS infrastructure, ultimately forwards it to foo.com’s ADNS (step 2). Upon receiving this query, the ADNS responds with a

CNAME record listing hostname images.foo.com.cdn-x.com (step 3). The requester (the LDNS that had sent the original query) will now attempt to resolve this new name using another query, which will now arrive at the ADNS for cdn-x.com, as this is the domain to which the new name belongs (step 4). The CDN’s ADNS now can resolve this query to any IP address within its platform (135.207.24.11 in the figure, step 5). The LDNS returns this response to the client (step 6), which then proceeds with the actual HTTP download from the prescribed IP address (step 7). The end result is that the HTTP download has been redirected from foo.com’s content server to the CDN platform, which will provide the desired content, either from its cache or, if the object is not locally available, first obtaining it from the origin server (and in this case storing it in its cache for future requests).

Figure 1.1: Basic architecture of CDNs

3 By returning different IP addresses to different queries, ADNS can direct dif- ferent HTTP requests to different servers. This commonly forms the basis for trans- parent client request routing in replicated web sites, content delivery systems, and – more recently – platforms. In this mechanism, the DNS system operated by the CDN receives queries for hostnames of the accelerated URLs and resolves them into the IP address of a CDN node that the DNS system selects for a given query. While the mechanism for using DNS for these purposes is well understood, the algorithms and policies involved in request routing and load balancing are still a subject of active research. In fact, these algorithms and policies represent the “se- cret sauce” of various CDN providers [6, 58, 15]. DNS-based redirection, however, exhibits several limitations which we will be discussing in chapter 3. In this disser- tation, we have constructed a web based measurement to study the properties of the groups of clients “hiding” behind their LDNSs (referred to as LDNS clusters) from the perspective of their effect on client request routing. In this measurement we have highlighted the limitations of DNS-based redirection and investigated the challenges in reconstructing LDNS clusters from the perspective of an external observer.

As an alternative to DNS-based redirection mechanism, this dissertation revis- its IP anycast as redirection technique. IP anycast, although examined early in the CDN evolution process [17], was considered infeasible at the time. IP anycast refers to the ability of the IP routing and forwarding architecture to allow the same IP address to be assigned to multiple endpoints, and to rely on Internet routing to select between these different endpoints. Endpoints with the same IP address are then typ- ically configured to provide the same service. For example, IP anycast is commonly used to provide redundancy in the DNS root-server deployment [42]. Similarly, in the case of a CDN, all endpoints with the same IP anycast address can be configured to be capable of serving the same content.

4 Figure 1.2: Anycast Based Redirection

Figure 1.2 shows an example of servers deployment utilizing anycast as a redi- rection mechanism. Servers A and B in our example are deployed in the Internet with the same (anycast) IP address and are responsible for providing the same content for foo.com. Server A is connected to access routers R1 and R2 while Server B is con- nected to access routers R4 and R5. When a client tries to fetch the content from foo.com, the first step is to send a DNS request through its LDNS. Eventually, the DNS requests reaches the ADNS of foo.com. The ADNS replies back with a single

IP regardless of the requester. Upon receiving the DNS response, the client sends its HTTP request directly to the anycast address, All DNS interactions are omitted from figure 1.2 for clarity reasons. By virtue of anycast, clients requests will follow the most proximal path towards the anycast destination. In our example, for simplicity,

5 we are assuming proximity is solely based on the number of hubs traversed to reach the destination. So from the perspective of client 1, the anycast destination is either a single hub away using access router R1 or a two hubs a way using access router R6. Thus the shortest path will be through R1 and client 1 will be connected to server A.

Same applies to other clients in which case client 2 will choose the direct connection through R2 and so on. IP anycast fits seamlessly into the existing Internet routing mechanisms. As Figure 1.2 shows, IP anycast packets are routed “optimally” from an IP forwarding prospective. That is, for a set of anycast destinations within a particular network, packets are routed along the shortest path and thus follow a proximally optimal path from the network perspective. Packets traveling towards a network advertising an IP anycast prefix will similarly follow the most proximal path towards that destination within the constraints of the inter-provider peering agreements along the way. Another problem that is facing not only CDNs but also the entire Internet apparatus is the scarcity of available IPv4 addresses. Internet protocol (IPv4), which is responsible for routing packets, is the principle protocol that establishes the In- ternet. IPv4 is currently used for routing almost all Internet traffic. However, IPv4 only supports 4 billion globally routed IP addresses, inherently limiting the number of globally routable address on the Internet. Internet has grown to the extent that we are witnessing the exhaustion of IPv4 addresses. Address exhaustion is a major factor for transition to the next major version of the IP protocol, that is IPv6. IPv6 was developed to solve the issue of IPv4 address exhaustion. In addition to providing the world with a larger address space (3.4 × 1038 IP addresses), IPv6 also provides classes of IP addresses that are of interest to this dissertation. One of these classes is anycast addresses.

Unfortunately IPv6 anycast has been kept in the shadows in terms of specifi- cations, implementation and deployment. In fact, in the past, the designers of IPv6

6 imposed restrictions on IPv6 anycast to prevent any packet from having an anycast address in the source field. The main contributions of this dissertation can be summarized as follows: This dissertation quantifies the effect of the above limitations of DNS-based request routing in CDNs by showing that currently prevalent DNS-based request routing is fundamentally affected by the sets of hosts sharing LDNS servers (which we call LDNS clusters). Chapter 2 presents a measurement study [12] that was carried out on a busy consumer-oriented Web site. We report on a large scale measurement of clusters of hosts sharing the same LDNS servers. Our analysis is based on a measurement that was carried out for 28 days, during which over 21 Million unique client-LDNS associations were collected. This dissertation presents a practical mechanism for replacing DNS-based with anycast-based request routing [9, 10]. In Chapter 3, we show that our proposed CDN architecture effectively addresses the long-known drawbacks of anycast request rout- ing allowing us to reconsider the practicality of this mechanism. We also present practical load balancing algorithms that take into consideration the practical con- straints of a CDN. We use server logs from an operational production CDN to evaluate our algorithms by trace driven simulation and illustrate their benefit by comparing with native IP anycast and an idealized load-balancing algorithm, as well as with the current DNS-based approach. Finally, this dissertation addresses the issue of transitioning to IPv6 in Chap- ters 4 and 5. In Chapter 4, this dissertation shows that there is virtually no perfor- mance penalty for a web site to unilaterally enable IPv6 support [13]. In Chapter 5, we propose a light-weight architecture for implementing IPv6 anycast for connection oriented transport [11]. The proposed architecture preserves security and privacy, and utilizes the best of anycastning. The proposed architecture leverages IP based binding while maintaining transport layer awareness. We also present an architecture

7 of an anycast IPv6 CDN that utilizes the proposed IPv6 anycast architecture as the redirection mechanism. Our proposed CDN architecture achieves the finest granular- ity control in term of request routing. The proposed IPv6 anycast CDN can redirect clients at the granularity of a single session. This capability is of benefit for both the

CDNs and clients. And since this architecture is based on anycast, it inherently route clients to the “best” anycast server.

8 Chapter 2

The Anatomy of LDNS Clusters: Findings and Implications for Web Content Delivery

2.1 Introduction

In DNS based redirection systems, the ADNS is the entity that is responsible for selecting the service server for a given request. However, ADNS only knows the identity of the requesting LDNS and not the client that originated the query. In other

words, the LDNS acts as the proxy for all its clients. We call the group of clients “hiding” behind a common LDNS server an “LDNS cluster”. DNS-based request routing systems can only distribute client demand among data centers or servers at the granularity of the entire LDNS clusters, leading to two fundamental problems,

hidden load problem [29], which is that a single load balancing decision may lead to unforeseen amount of load shift, and the originator problem [70], which is that when the request routing apparatus attempts to route clients to the nearest , the apparatus only considers the location of the LDNS and not the clients behind.

9 Various proposals were put forward to include the identity of the client into the LDNS queries but they have not been adopted, presumably because these pro- posals are incompatible with shared LDNS caching, where the same response can be reused by multiple clients. There is another such proposal (Faster Internet initiative) currently underway, spearheaded by [38] in which a truncated version of the client IP address is being passed as an extension to the DNS request which will be passed to the CDN’s DNS servers to help them better route clients. However, most CDNs -including Akamai- do not support the use of this extension as it might allow external users to scan the CDN’s infrastructure [80, 22]. In this dissertation we begin by studying properties of LDNS clusters from the perspective of their effect on client request routing. We have instrumented a measurement to characterize such LDNS clusters from the vantage point of a busy

Web site. i.e., based on the activity seen by this site. For instance, when we consider an LDNS cluster size, this size reflects the clients that visited the studied Web site over the duration of the experiment. There may be other hosts behind this LDNS that we would not have seen. Our vantage point, however, is an example of what a busy web site may face when performing request routing.

2.2 Related Work

Among the previous client-side DNS studies, Liston et al. [59] and Ager et al. [4] measured LDNS by resolving a large number of hostnames from a limited set of client vantage points (60 in one case and 75 in the other), Pang et al. [68] used access logs from Akamai as well as active probes, and [32] based their studies on large-scale scanning for open resolvers. Our goal in this dissertation is a broad characterization of clients’ LDNS clusters from the perspective of a busy Web site. Both Ager et al. [4] and Huang et al. [47] compared the performance implica-

10 tions of using public DNS resolvers, such as Google DNS, with ISP-deployed resolvers and found the former to be at significantly greater distances from clients. Further, Huang et al. considered the geographical distance distribution between clients and their LDNSs (Figure 5 in [47]). Our measurement technique is an extension of [62],

which we augmented to allow measurement of LDNS pools. Bermudez et al. proposed a tool that combines a packet sniffer and analyzer to associate content flows with DNS queries [18]. This tool is targeted to operators of client-side access networks, in particular to help them understand which content

comes from third-party platforms, while our approach is website-centric, with the goal of characterizing LDNS clusters to inform DNS-based request routing. As an alternative to the Faster Internet initiative mentioned earlier [38], Huang et al. [46] recently proposed a different method to inform Web sites about clients behind the LDNS. This proposal does not require changes to DNS, instead it makes modification to hostnames (URLs) so that they carry augmented information about clients to facilitate server selection.

2.3 System Instrumentation

To characterize LDNS clusters (i.e., sets of hosts behind a given LDNS), we need to associate hosts with their LDNSs. We used an enhanced approach from our prior work [62] to gather our measurements. As shown in Figure 2.1, we deploy a measurement machine that runs both a custom authoritative DNS server for a domain we registered for the purpose of this experiment (dns-research.com) and a custom . The Web server hosts a special image URL, dns-research.com/special.jpg.

When a user accesses this URL, the following steps occur:

• The user’s browser sends a DNS query to its local DNS server to resolve dns- research.com. We call dns-research.com a “base domain” and a query for it

11 Client Side Internet Measurement Side

(2/3) dns-research.com? (8/9) 1_2_3_4.sub1.dns-research.com ? Instrumented CNAME 1_2_3_4.sub2.dns-research.com/special.jpg ADNS/HTTP (10/11) 1_2_3_4.sub2.dbs-research.com ? 5.6.7.8 server

5.6.7.8 (4) 5.6.7.8 (12) 5.6.7.8 (1) dns-research.com?(1) 5.6.7.8 200 OK

(5/6) GET special.jpg

(7) 1_2_3_4.sub.dns-research.com ? 302 Moved to 1_2_3_4.sub1.dns-research.com/special.jpg

(13/14) GET special.jpg to 5.6.7.8

1.2.3.4

Figure 2.1: Measurement Setup.

“base query” (step 1 in the figure).

• The LDNS recursively resolves this query, ultimately our DNS server (steps 2 and 3) and returns the result (the IP address of our measurement machine) to the client (step 4).

• The client sends the HTTP request for special.jpg to our Web server (step 5). Our server responds with an HTTP redirect (“302 Moved”) specifying another URL in the dns-research.com domain (step 6). Our server constructs this new URL dynamically by embedding the client’s IP address into the hostname of the

URL. For example, when our Web server receives a request for special.jpg from client 206.196.164.138, the Web server replies to the client with the following redirection link: 206_196_164_138.sub1.dns-research.com/special.jpg.

• Following the redirection, the client issues another DNS request to its LDNS for

hostname that embeds its own IP address, in the example above (step 7) the

12 request URL is 206_196_164_138.sub1.dns-research.com. The LDNS even- tually sends this request to our DNS server (step 8), which can now record both the IP address of the LDNS that sent the query and the IP address of its asso- ciated client that had been embedded in the hostname. Thus, the association

of the client and its LDNS is accomplished.

In the original approach of [62], ADNS server would now complete the interac- tion by resolving the query to the IP of our Web server, which would respond to the client’s HTTP request with a 1-pixel image. We augmented this approach as follows.

Our ADNS responds to a query for *.sub1.dns-research.com with the correspond- ing CNAME *.sub2.dns-research.com, where * denotes the same string representing client’s IP address (step 9), forcing the client to perform another DNS resolution for the latter name. Moreover, our DNS server includes its own IP address in the author- ity section of its reply to the “sub1” query, which ensures that the LDNS sends the second request (“sub2” request) directly to our DNS server. We added this function- ality to discover LDNS pools (Section 2.9.2). Upon receiving the “sub2” query (step 10), our ADNS returns its own IP address (steps 11-12), which is also the IP address of our Web server, and the client performs the final HTTP download of our special image (steps 13-14). We have partnered with a high-volume consumer-oriented Web site1, which embedded the base URL for our special image into their home page. This allowed us to collect a large amount of measurement data as discussed below. To obtain repeated measurements from a given client, we used a low 10 seconds TTL for our DNS records – lower than any CDN we are aware of – and added a “cache-control:no-cache” HTTP header field to our HTTP responses.

1Part of the conditions for this collaboration is that we are unable to name the site.

13 2.4 The Dataset

The measurement data included the DNS and HTTP logs collected at our measure- ment host. The DNS logs contained the timestamp of the query, the IP address of the requesting LDNS, query type, and query string, and the HTTP logs contained the request time and User-Agent and Host headers.

We conducted our measurements over 28 days, from Jan 5th, 2011 to Feb 1st. During this period, we collected the total of over 67.7 million sub1 and sub2 DNS requests and around 56 million of the HTTP requests for the final image (steps 13/14 in Figure 2.1; we refer to these final HTTP requests as simply HTTP requests in the rest of the chapter, but stress that we do not include the initial redirected HTTP requests in steps 5-6 of the setup into any of the results). The higher number of HTTP requests compared to DNS queries (indeed, as Figure 2.1 shows, a client access should generate a sub1 and sub2 DNS request for a final HTTP request) is due to the well known fact that clients and LDNSs reuse DNS responses much longer than the TTL values assigned by the ADNS [66]. We verified that some HTTP accesses occur long past the 10s (our TTL) since the preceding sub1 and sub2 queries.

Table 2.1: High-level dataset characterization Unique LDNS IPs 278,559 Unique Client IPs 11,378,020 Unique Client/LDNS IP Pairs 21,432,181

Table 2.1 shows the overall statistics of our dataset. Our measurements include over 11.3M clients and almost 280K LDNS resolvers representing, respectively, 17,778 and 14,627 autonomous systems (ASs). We have obtained over 21M unique associ- ations between these clients and LDNSs, where an association (or pair) connects a client and the LDNS used by this client for a DNS resolution. Tables 2.2 and 2.3 summarize clients breakdown in terms of OS and browsers

14 as observed from the User-agent HTTP request header. (Related entries are grouped together, e.g., IOS devices are included in the Mac OS category.)

Table 2.2: Clients OS breakdown Operating System # Connections % Connections Windows 169,199,808 93.54 Mac OS 10,866,529 6.01 Linux 635,981 0.35 PS 76,633 0.04 Android 48,847 0.03 BlackBerry 23,618 0.01 Google 14,407 0.01 Others 14,618 0.01

The client breakdown data is generally as expected except for greater domina- tion of the Microsoft internet explorer (IE) browser than reported by some commercial measurement firms such as [63].

We refer to all clients that used a given LDNS as the LDNS cluster. Thus, an LDNS cluster contains one LDNS IP and all clients that used that LDNS in our experiment. Note that the same client can belong to multiple LDNS clusters if it used more than one LDNS during our experiment.

Table 2.3: Clients browsers breakdown Browser # Connections % Connections Microsoft IE 127,294,806 70.38 Mozilla Firefox 37,361,521 20.66 Google Chrome 8,058,623 4.46 Safari 7,982,430 4.41 Rim 20,813 0.01 Opera 9,148 0.01 Others 153,100 0.08

15 2.5 Cluster Size

We begin by characterizing LDNS clusters in terms of their size. This is important to DNS-based server selection because of the hidden load problem [29]: a single DNS response to an LDNS will direct HTTP load to the selected server from all clients behind this LDNS for the TTL duration. Uneven hidden loads may lead to unexpected results from the load balancing perspective. On the other hand, knowing activity characteristics of different clusters would allow one to take hidden loads into account during server selection process. For example, dynamic adjustments of the TTL in DNS responses to different LDNSs can be used to compensate for different hidden loads [28, 29]. We characterize LDNS cluster sizes from two perspectives - the number of clients behind a given LDNS and the amount of activity originated from all clients in the cluster. We should stress that the former is done purely based on IP addresses, and our use of the term “client” is simply a shorthand for “client IP address”. It has been shown that IP addresses may not be a good representation of individual hosts due to the presence of network address translation boxes and dynamic IP addresses [60, 25]. We characterize cluster sizes from the perspectives of the number of clients in a cluster and the amount of access activity originated from a cluster.

2.5.1 Number of Clients

Figure 2.2 shows the CDF of LDNS cluster sizes while the cut-in subfigure shows the sizes of the 1000 largest LDNS clusters (in the increasing order of cluster size). We found that a vast majority of LDNS clusters are small - over 90% of LDNS clusters contain fewer than 10 clients. This means that most clusters do not provide much benefit of shared DNS cache to their clients when they access our partner Web site. To see the potential impact on clients, Figure 2.3 shows the cumulative

16 1

0.9 Top 1000 LDNSs 0.8 1e+06

0.7 100000 0.6

0.5 10000 CDF 0.4 # Associated# Clients 0.3 1000 0 200 400 600 800 1000 0.2

0.1

0 1 10 100 1000 10000 100000 1e+06 # Clients associated with a LDNS

Figure 2.2: Distribution of LDNS cluster sizes. percentage of sub1 requests issued by LDNSs representing clusters of different sizes as well as cumulative percentages of their client/LDNS associations. More precisely, for a given cluster size X, the corresponding points on the curves show the percentage of sub1 requests issued by LDNS clusters of size up to X, and the percentage of all client/LDNS associations belonging to these clusters. As seen on the figure, small

clusters, with less than 10 clients, only contribute less than 10% of all sub1 requests and comprise less than 1% of all client/LDNS associations. Thus, even though these small clusters represent over 90% of all LDNS clusters, an overwhelming majority of clients belong to larger clusters, which are also responsible for most of the activity. Thus, most clients are not affected by limited shared DNS caching in small clusters.

Moreover, despite the prevalence of DHCP-driven DNS configuration of end- hosts and – more recently – anycasted resolvers, both facilitating distributed resolver infrastructures, we still observed a few “elephant” clusters. The largest cluster (with LDNS IP 167.206.254.14) comprised 129,720 clients and it alone was responsible for

almost 1% of all sub1 requests. Elephant clusters may affect dramatically load distri-

17 1

0.9

0.8

0.7

0.6

0.5 CDF 0.4

0.3

0.2

0.1 % of Sub1 requests % of LDNS-Client pairs 0 1 10 100 1000 10000 100000 1e+06 LDNS cluster size (# Clients)

Figure 2.3: Distribution of sub1 requests and client/LDNS pairs attributed to LDNS clusters of different sizes bution, and their small number suggests that it might be warranted and feasible to identify and handle them separately from the rest of the LDNS population. Overall, the size of LDNS clusters ranged from 1 to 129,720 clients, with the average size being 76.94 clients. We further consider top-10 elephant LDNS clusters in Section 2.7.3.

2.5.2 Cluster Activity

We now turn to characterizing the activity of LDNS clusters. We characterize it by the number of their sub1 requests as well as by the number of the final HTTP requests. Since a client may belong to multiple LDNS clusters (e.g., when it used a different LDNS at different times), we associate an HTTP request with the last LDNS that was used by the client prior to the HTTP request in question.

Figure 2.4 shows the CDF of the number of sub1 queries issued by LDNSs, as well as the CDF of the number of HTTP requests issued by clients behind each LDNS during our experiment. Again, both curves in the figure indicate that there

18 1

0.9

0.8

0.7

0.6

0.5 CDF 0.4

0.3

0.2

0.1 Sub1 requests HTTP requests 0 1 10 100 1000 10000 100000 1e+06 LDNS activity (# requests )

Figure 2.4: LDNSs Activity in terms of DNS and HTTP requests. are only a small number of high-activity clusters. Indeed, 35% of LDNSs issued only one sub1 request, and 96% of all LDNSs issued less than 100 sub1 requests over the entire experiment duration. Yet the most active LDNS sent 303,042 sub1 requests. The HTTP activity presents similar trends although we do observe some hidden load effects even among low-activity clusters: whereas 35% of LDNSs issued a single DNS query, only less than 20% of their clusters issued a single HTTP request. This is due to DNS caching, which often extends beyond our low TTL of 10s. Overall, our observations of LDNS cluster sizes, both from the number of clients and activity perspectives, confirm that platforms using DNS-based server selection may benefit from treating different LDNSs differently. At the same time, they may

only need to concentrate on a relatively small number of “elephant” LDNSs for such special treatment.

19 2.6 TTL Effects

The above analysis considered the LDNS cluster activity over the duration of the experiment. However, platforms that use DNS-based server selection, such as CDNs, usually assign relatively small TTL to their DNS responses to retain an opportunity for further network control. In this section, we investigate the hidden loads of LDNS

clusters observed within typical TTL windows utilized by CDNs, specifically 20s (used by Akamai), 120s (AT&T’s ICDS content delivery network) and 350s (Limelight). In order to get the above numbers, we use our DNS and HTTP traces to emu- late the clients’ activity under a given TTL. The starting idea behind this simulation

is simple: the initial sub1 query from an LDNS starts a TTL window, and then all subsequent HTTP activity associated with this LDNS (using the procedure described in Section 2.5.2 ) is “charged” to this window; the next sub1 request beyond the current window starts a new window. However, two subtle points complicate this procedure. First, if after the initial sub1 query to one LDNS, the same client sends another DNS query through a different LDNS within the emulated TTL window (which can happen since the actual TTL in our experiments was only 10s) we “charge” these subsequent queries and their associated HTTP activity to the TTL window of the first LDNS. This is because with the longer TTL, these subsequent queries would not have occurred since the client would have reused the initial DNS response from its cache. Second, confirming the phenomenon previously measured in [66], we have en- countered a considerable number of requests that violated TTL values, with violations sometimes exceeding the largest TTL values we simulated (350s). Consequently, in reporting the hidden loads per TTL, we use two lines for each TTL value. The lines la- beled “strict” reflect only the HTTP requests that actually fell into the TTL window2.

2For simplicity of implementation, we also counted HTTP requests whose corresponding sub* queries were within the window but the HTTP requests themselves were within our real TTL of 10s past the window. There were very small number of such requests (a few thousand out of 56M total)

20 1

0.8

0.6 CDF 0.4

0.2 Clients in 20s TTL Clients in 120s TTL Clients in 350s TTL 0 1 10 100 Client IPs per LDNS per TTL

Figure 2.5: LDNS cluster sizes within TTL windows (all windows).

Thus, these results ignore requests that violate the TTL value. The “non-strict” lines include these violating HTTP requests and count them towards the hidden load of the TTL window to which the associated DNS query was assigned. Figure 2.5 shows the CDF of the LDNS cluster sizes observed for each LDNS in each TTL window, i.e., each LDNS contributed a separate data point to the CDF for each window The majority of windows, across all LDNSs, contained only one client. As expected, the larger the TTL windows, the larger the number of clients an LDNS serves. Still, only around 10% of TTL intervals had more than 10 clients under TTL of 350s, and less than 2% of the intervals had more than 10 clients with

TTL of 120s. Figure 2.6 shows the CDF of the average in-TTL cluster sizes for LDNSs across all their TTL intervals. That is, each LDNS contributes only one data point to the CDF, reflecting its average cluster size for all its TTL intervals3. The average in-TTL thus this does not materially affect our results. 3The average in-TTL cluster sizes per-LDNS may reflect better the kind of input data available to a request routing algorithm.

21 1

0.99

0.98

0.97

0.96

CDF 0.95

0.94

0.93

0.92 Clients in 20s TTL Clients in 120s TTL Clients in 350s TTL 0.91 1 10 100 Avg # Client IPs per LDNS

Figure 2.6: Average LDNS cluster sizes within a TTL window. (averaged over all windows for a given LDNS) cluster sizes are even smaller, with virtually all LDNSs exhibiting average in-TTL cluster size below 10 clients under all TTL values. The difference between the two figures is explained by the fact that busier LDNSs (i.e., those showing up with more clients within a TTL) tend to appear more frequently in the trace, thus contributing more data points in Figure 2.5. To assess how the hidden load of LDNSs depends on TTL, Figures 2.7 and 2.8 show, respectively, the CDFs of the number of HTTP requests in all TTL windows and average in-TTL number of HTTP requests for all LDNS across their TTL intervals.

A few observations are noteworthy. First, the difference between strict and non- strict lines in Figure 2.7 indicates violations of the TTL we considered; as expected, these violations decrease for larger TTL and, importantly, all but disappear for TTL of 350 sec. This shows that at these TTL levels, a CDN might not need to be concerned about unforeseen affect of these violations on hidden load. Second, while the sizable differences in hidden loads among some LDNSs for

22 1

0.9

0.8

0.7

0.6 CDF 0.5

Strict TTL 20 0.4 Strict TTL 120 Strict TTL 350 0.3 UN-Strict TTL 20 UN-Strict TTL 120 UN-Strict TTL 350 0.2 1 10 100 1000 10000 100000 1e+06 HTTP requests per LDNS per TTL

Figure 2.7: HTTP requests within TTL windows (all windows). some TTL values are significant, their absolute values are small overall - virtually all windows contain fewer than 100 requests even for the largest TTL of 350s (Fig- ure 2.7). Thus, low TTL values are important not for proper load-balancing granu- larity in routine operations but mostly to react quickly to unforeseen flash crowds. It is obviously undesirable to have to pay overhead on routine operation while using it only for extraordinary scenarios. A better knob would be desirable and should be given consideration in future Internet architectures.

2.7 Client-to-LDNS Proximity

We consider the proximity of clients to their LDNS servers, which determines the severity of the originator problem and can have other implications for proximity-based request routing. Prior studies [76, 62] looked at several proximity metrics, including TCP tracer-

oute divergence, network delay difference as seen from a given external vantage point,

23 1

0.9

0.8

0.7

0.6 CDF 0.5

Strict TTL 20 0.4 Strict TTL 120 Strict TTL 350 0.3 UN-Strict TTL 20 UN-Strict TTL 120 UN-Strict TTL 350 0.2 1 10 100 1000 10000 100000 Avg number of HTTP requests per LDNS

Figure 2.8: Average number of HTTP requests per LDNS within a TTL window (averaged over all windows for a given LDNS). and autonomous system sharing - how many clients reside in the same AS as their LDNS servers. We revisit the AS-sharing metric, but instead of the other metrics, which are vantage-point dependent, we consider the air-mile distance between clients and their LDNSs. Ideally we would also have liked to know the network delay between these parties but we have no visibility into this metric from our vantage point.

2.7.1 Air-Miles Between Client and LDNS

We utilized the GeoIP city database from Maxmind [64], which provides the geo- graphic location information for IP addresses, to study geographical properties of LDNS clusters. Using the database dated from February 1, 2011 (so that our analysis would reflect the GeoIP map at the time of experiment), we mapped the IP addresses of the clients and their associated LDNSs and calculated the geographical distance (“air-miles”) between them.

24 1

0.9

0.8

0.7

0.6

0.5 CDF 0.4

0.3

0.2

0.1

0 0 2000 4000 6000 8000 10000 12000 AirMiles between Clients and their LDNSs

Figure 2.9: Air miles for all client/LDNS pairs

Figure 2.9 shows the cumulative distribution function (CDF) of air-miles of all client/LDNS pairs. The figure shows that clients are sometimes situated surprisingly far from their LDNS servers. Only around 25% of all client/LDNS pairs were less than 100 miles apart while 30% were over 1000 miles apart. This suggests an inherent limitation to how accurate, in terms of proximity, DNS-based server selection can be. We note that our measurements show significantly greater distances than previously measured in [47] (see Section 2.2 for more details).

2.7.2 Geographical Span

We are also interested in the geographical span of LDNS clusters. Geographically compact clusters are more amenable to proximity routing than the spread-out ones. If a content platform can distinguish between these kinds of clusters, it could treat them differently: requests from an LDNS representing concentrated cluster could be preferentially resolved to a highly proximal content server, while requests from LDNSs

25 representing spread-out clusters could be used to even out load with less regard for proximity. This would result in more requests resolving to proximal content servers when it actually counts.

100000 1e+06 Avg AirMiles for a LDNS # Clients associated with a LDNS 10000 100000

1000 10000

100 1000

AvgAirMiles 10 100

1 10 # Clients# associated withLDNS a

0.1 1 0 5000 10000 15000 20000 25000 LDNSs

Figure 2.10: Avg client/LDNS distance in top LDNS clusters

In figure 2.10 we focus on the LDNSs with more than 10 clients, which repre- sent almost 10% of all LDNSs in the data set (results on small clusters are reported in figure 2.11). Figure 2.10 plots, for each such LDNS server, the average airmiles from this server to all its clients and the number of clients for the same LDNS. The X-axis shows LDNSs sorted by the size of their client cluster, and within LDNSs of equal cluster size, by the average air miles distance. The Y-Axis shows the average air-miles and the number of clients for these top LDNS clusters. The “teeth” in the graph are due to the above sorting of LDNSs: each “tooth” represents a set of all LDNS clusters with the same number of clients, and the tooth-like shape reflects the fact that each such set, except for the sets comprising the largest clusters, contains clusters with the average geographical span ranging from 0 to up to 10,000 miles. Even for LDNS clusters of size of 1 client, figure 2.11 shows that there exists LDNS clusters with more than 10000 air miles distance between the LDNS and its client. The majority of 1 client clusters have a 0 air miles distance between the LDNS and its client. In section 2.9 we investigate these clusters and report on the reason

26 100000 1e+06 Avg AirMiles for a LDNS # Clients associated with a LDNS 10000 100000

1000 10000

100 1000

AvgAirMiles 10 100

1 10 # Clients# associated withLDNS a

0.1 1 0 50000 100000 150000 200000 250000 LDNS

Figure 2.11: Avg client/LDNS distance for all LDNS clusters for such observation. As the number of clients increases, the variation of geographical span among clusters narrows but still remains significant, with an order of magnitude differences between clusters. This provides evidence in support of differential treatment of LDNSs not just with respect to their differences in size and activity as we saw in Sections 2.5 and 2.5.2 but also with respect to proximity-based server selection.

2.7.3 AS Sharing

Another measure of proximity is the degree of AS sharing between clients and their LDNSs. Figure 2.12 shows this information from, respectively, LDNS and client per- spective. The LDNS perspective reflects, for a given LDNS, the percentage of its associated clients that are in the same AS as the LDNS itself. The clients’ perspec- tive considers, for a given client, the percentage of its associated LDNSs that are in the same AS as the client itself. While almost 77% of LDNSs have all their clients in the same AS as they are,

27 1

0.9

0.8

0.7

0.6

0.5 CDF 0.4

0.3

0.2

0.1 % LDNSs outside a Client's AS % Clients outside a LDNS's AS 0 0 20 40 60 80 100 % of associated Clients (LDNSs) outside the LDNS’s (Client’s) AS

Figure 2.12: CDF of LDNS clusters with a given % of clients/LDNSs outside their LDNS’s/Client’s autonomous system.

15% of LDNSs have over half of their clients outside their AS and 10% have all their clients in a different AS. From the clients’ perspective, we found that more than 9 million client have all the LDNSs in their AS while nearly 2 million (almost 17%) have all their LDNSs in a different AS. Only a small number of clients - over 180K had a mix of some LDNSs within and some LDNSs outside the client’s AS. Such a strong dichotomy (i.e., that clients either had all or none of their LDNSs in their own

AS) is explained by the fact that most clients associate with only a small number of LDNSs.

Table 2.4: Activity of client-LDNS associations sharing the same AS DNS requests HTTP requests 73.79% 81.97%

Our discussion so far concerned the prevalence of AS sharing in terms of client population. In other words, each client-LDNS association is counted once in those

28 statistics. However, different clients may have different activity levels, and we now consider the prevalence of AS sharing from the perspective of clients’ accesses to the Web site. Table 2.4 shows the fraction of client activity stemming from client-LDNS associations that share the same AS. The first column reports the fraction of all sub1

and sub2 request pairs with both the client and LDNS belonging to the same AS. This metric reflects definitive information but it only approximates the level of client activity because of the 10s TTL we use for sub1 and sub2 responses: we do not expect the same client to issue another DNS query for 10 seconds (or longer, if the client

violates TTLs) no matter how many HTTP requests it issues within this time. The second column shows the fraction of all HTTP requests such that the preceding DNS query that originated from the same client used an LDNS in the same AS as the client. This metric reflects definitive activity levels but is not iron-clad in attributing

the activity to a given client/LDNS association. First, we note that the prevalence of AS sharing measured based on activ- ity is somewhat lower than the prevalence of AS sharing measured based on client populations. Second, these levels of AS sharing are still significantly higher than those re- ported in the 10-year old study [62] (see Table 5 there). This is an encouraging development for DNS-based request distribution. Overall, while the prevalence of AS sharing increased form 10 years ago, we found a sizable fraction (15 - 17%) of client/LDNS associations where clients and

LDNSs reside in different ASs. One of the goals in server selection by CDNs, especially those with a large number of locations such as Akamai, is to find an edge server sharing the AS with the originator of the request [69]. Our data shows fundamental limits to the benefits from this approach.

29 2.8 Top-10 LDNSs and Clients

We have investigated the top 10 LDNSs manually through reverse DNS lookups, namely “whois” records, and MaxMind ISP records for their IP addresses. The top- 10 LDNSs in fact all belong to just two ISPs, which we refer to as ISP1 (LDNSs ranked 10-4), and ISP2 (ranked 3-1). The top three clusters of ISP2 contributed 1.6% of all unique client-LDNS associations in our traces and 2.33% of all sub1 requests.

4333333"

433333"

43333"

4333"

433"

43"

4"

#!$&5" #!$&6" #!$&8" #!$&9" #!$&:" #!$&<" #!$&4" #!$&7" #!$&;" #!$&43" "$/2&" %.-/2&" +.*-"+'"!)("2&0,")*"%.-/2&"

Figure 2.13: AS sharing of top-10 LDNSs and their clients

The extent of the AS sharing for these clusters is shown in Figure 2.13. In the

figure, the bars for each cluster represent (from left to right) the number of clients sharing the AS with the cluster’s LDNS, the number of clients in other ASs, and for the latter clients, the number of different ASs they represent. The figure shows very low degree of AS sharing in the top clusters. Virtually all clients belong to a different AS from the one where their LDNS resides, and each cluster spans dozens of different ASs. We further verified that these ASs belong to different organizations from those owning the corresponding LDNSs. Interestingly, the AS sharing is very similar between ISP1 and ISP2.

30 ;::::"

;:::"

;::"

;:"

;"

"!$&;" "!$&<" "!$&=" "!$&>" "!$&?" "!$&@" "!$&A" "!$&B" "!$&C" "!$&;:" 05+7"0-0#-.*1" #*)-'/"0-0#-.*1" C?2,"%*0(*/3.*"0-0#-.*1" CC2,"%*0(*/3.*"0-0#-.*1" F" .-*/21"G";::"#-.*1"

Figure 2.14: Air miles between top-10 LDNSs and their clients.

We also consider the geographical span of the top 10 clusters using MaxMind

GeoIP city database. Figure 2.14 shows the average and median air-miles distance between the LDNS and its clients, as well as the 95th and 99th percentiles for these distances for each LDNS Cluster. The last bar shows the percentage of clients in that cluster that are less than 100 Mile apart from the LDNS.

While figure 2.13 shows that top-10 LDNS clusters spans dozens of ASs which suggests topologically distant pairs, Figure 2.14 shows that these clusters are very compact, with most clients less than 100 miles away from their LDNSs. Further, although both ISPs exhibit similar trends, ISP2 clearly has more ubiquitous LDNS infrastructure and more of their customers can expect better service from CDN- accelerated Web sites. Indeed, ISP2’s LDNSs have more than 99.9% of their clients within 100 AirMiles radius, with the average range between 20 - 43 AirMiles. Turning to the top 10 clients, they represent eight different firms and the top four appear to be firewalls or proxies (as they include “firewall”, “WebDefense”, and

“proxy” in their reverse DNS names). Figure 2.15 shows the extent to which these clients share their AS with the their LDNSs. The bars for each client represent (from

31 6555$

655$

65$

6$

506$

*)&+.7$ *)&+.8$ *)&+.:$ *)&+.;$ *)&+.<$ *)&+.>$ *)&+.6$ *)&+.9$ *)&+.=$ *)&+.65$ "#16%$ $/.16%$ ,/+.$,'$!)($6%2-$)+$$/.16%$

Figure 2.15: AS sharing of top-10 clients and their LDNSs left to right) the number of LDNSs sharing the AS with the client, the number of

LDNSs in other ASs, and for the latter LDNSs, the number of different ASs they represent. Reflecting more diverse ownership of these clients, their AS sharing behavior is also more diverse than that of top LDNSs. Still, 9 out of 10 of these clients had the majority of their LDNSs outside their AS, including one client with all its LDNSs

outside its AS. This was true even for clients representing major ISP proxies (clients ranked 6 and 4) or otherwise belonging to major ISPs (clients 10 and 7). At the same time, for 7 out of 10 clients, most of these external LDNSs were concentrated in just one (or a small number of) other ASs. Again, we verified that these external ASs do

in fact belong to different organizations. From this, we speculate that these clients, or their autonomous systems, subscribe to an external DNS service that employs a large farm of LDNS servers. Finally, Figure 2.16 examines the geographical distance of top-10 clients from

their LDNS servers. Two of the curves in the figure give a CDF of these distances

32 1 Top 10 Clients - Unique Pairs Top 10 LDNSs - Unique Pairs 0.9 Top 10 Clients - Requests Top 10 LDNSs - Requests 0.8

0.7

0.6

0.5 CDF 0.4

0.3

0.2

0.1

0 0.1 1 10 100 1000 10000 AirMiles

Figure 2.16: Air-miles for top 10 LDNSs and top-10 clients. for every unique association of these clients and their LDNSs, as well as for each request coming from these top clients. For comparison, the figure provides similar CDFs for the top-10 LDNSs considered earlier. We see that the top clients are often situated much farther away from their LDNSs than the clients of the top LDNSs; in particular, only around 12% of the top-10 clients are within 100 miles of their LDNSs as compared to 80% of top-10 LDNSs being within 100 miles of their clients. Another observation is that top clients are either co-located with their LDNSs or use LDNSs that are hundreds of miles away. Furthermore, the volume of requests arriving from the top-10 clients that used co-located LDNSs contributed around 54% of all the activity despite constituting only 12% of all top-client/LDNS associations.

2.9 Client Site Configurations

We now discuss some noteworthy client and LDNS behaviors we observed in our experiments.

33 2.9.1 Clients OR LDNSs?!

Our first observation is a wide-spread sharing of DNS and HTTP behavior among

clients. Out of 278,559 LDNS servers in our trace, 170,137 or 61.08% also show up among the HTTP clients. We refer to these LDNSs as the “Act-Like-Clients” group. A vast majority of these LDNSs – 98% or 166,859 – have themselves among their own associated clients. We will call these LDNSs the Self-Served group. The other 3278 LDNS IP addresses always used different LDNSs when acted as HTTP

clients. Within the Self-Served group, we found that 149,013 of these LDNSs, or 53% of all the LDNSs in our dataset, had themselves as their only client during our experiment (“self-served-one-client”) while the remaining 17,846 LDNSs had other clients as well (“Self-Served-2+Clients”). Moreover, 105,367 of the self-served-one-

client client/LDNS IP addresses never used any other LDNS. We call them the “Self- Served-One2One” group. This leaves us with 43,646 LDNSs that had themselves as their only client but in their client role, they also utilize other LDNSs. This group will be called the “Self-Served-1Client2+LDNSs”. Figure 2.17 summarizes the

distribution of these types of LDNSs.

A'24"+,)4 -+).214#/24 &)-*4&)03)() &)-*4&)03)() &)-*4&)03)() :9>?) $.)9$.)) 8 -+).29D"!#&1) 8C) 87<:=>) ;:=;=) :?C) 8=C)

%)12)/*)"!#&1) 87?;99) :@C) &)-*4&)03)() 9D -+).21) 8>?;=) =C)

Figure 2.17: Distribution of LDNS types

34 While a likely explanation for the Act-Like-Client Not-Self-Served group is the reuse of dynamic IP addresses (so that the same IP address is assigned to an HTTP client at some point and to an LDNS host at another time), the Self-Served behavior could be caused by sharing of a common middle-box between the LDNS and its clients. In particular, the following two cases are plausible.

• Both clients and their LDNS are behind a NAT or firewall, which exposes a com-

mon IP address to the public Internet. A particular case of this configuration is when a home network configures its wireless router to behave as LDNSs. Such configuration is easily enabled on popular wireless routers (e.g., Linksys), al- though these routers often resolve their DNS queries through ISP LDNS servers [74].

• Clients are behind a proxy that acts as both HTTP proxy/cache and its own LDNS resolver.

We find support for the above explanation using an approach similar to [60]. We utilized the User-Agent headers to identify hosts sharing the same middle-box based on the operating system and browser footprints. We consider an IP address as a possible middle-box if it shows two or more operating systems or operating system versions, or three or more different browsers or browser versions. Out of the total 11.7M clients, we have flagged only 686,651 (5.87%) clients who fall into the above category.4 However, 51,864 clients among them were from the Self-Served

LDNS group, out of the total of 166K such LDNSs. Thus, the multi-host behavior is much more prevalent among self-serving LDNSs than the general client population even though our technique misses single-host NAT’ed networks (which constitute a majority of NAT networks according to [25] although not according to [60]) and NATs whose all hosts have the same platform. 4Compared to previous studies of NAT usage, notably [60] and [25], this finding is more in line with the latter. Note that our vantage point - from the perspective of a Web site - is also closer to [25] than [60].

35 An important question from a CDN perspective is whether these configurations deviate from the “regular” LDNS cluster behavior, in which case they might need special treatment in DNS-based demand distribution. For example, a proxy acting as its own LDNS might show as a small single-client cluster yet impose incommensurately high load on a CDN node as a result of a single act of the CDN server selection.

1

0.9

0.8

0.7

0.6

0.5 CDF 0.4

0.3

0.2 Self-Served Act-Like-Clinet Not Self-Served 0.1 Not Act-Like-Clients All LDNSs 0 1 10 100 1000 10000 100000 1e+06 # Clients associated with a LDNS (cluster size)

Figure 2.18: Cluster size distribution of LDNS groups.

Figure 2.18 compares the cluster sizes of self-served and other LDNSs. It shows that self-served LDNS clusters are much smaller than other clusters, in fact they overwhelmingly contain only one client IP address. This finding is in conformance with the behavior expected from a middle-box fronted network. A more revealing finding is displayed in Figure 2.19, which compares the activity (in terms of sub1 requests) of the self-served LDNSs with other groups. The figure shows that the self-served LDNS clusters exhibit lower activity lev- els than the not-act-like-clients clusters. Thus, while middleboxes aggregate demand from several hosts behind a single IP address, these middleboxes seem to predomi- nantly front small networks - smaller than other LDNS clusters.

36 1

0.9

0.8

0.7

0.6

0.5 CDF 0.4

0.3

0.2 Self-Served Act-Like-Clients Not Self-Served 0.1 Not Act-Like-Clients All LDNSs 0 1 10 100 1000 10000 100000 1e+06 Sub1 requests issued by a LDNS

Figure 2.19: The number of sub1 requests issued by LDNSs of different types.

To confirm the presence of demand aggregation in self-served LDNS clusters,

Figure 2.20 factors out the difference in client sizes and compares the activity of self-served and not-self-served LDNSs only for One2One clusters.There were 105,367 LDNSs/clients in the One2One self-served group and 27,640 in the One2One not-self- served group. Figure 2.20 shows that the One2One self-served LDNSs in general are indeed more active than the not-self-served LDNSs. For instance, 66% of the not-self- served LDNSs issued a single request while only 46% of the self-served LDNSs issued only a single request. Such observation of the increased activity of the self-served LDNSs is consistent with moderate aggregation of hosts behind a middle-box.

In summary, we found a large number of LDNSs operating from within middle- box fronted networks - they are either behind the middleboxes or operated by the middleboxes themselves. However, while these LDNSs exhibit distinct demand aggre- gation, their clusters are if anything less active than other clusters. Thus, a middle- box fronted LDNS in itself does not seem to be an indication for separate treatment

37 1

0.9

0.8

0.7

0.6

0.5 CDF 0.4

0.3

0.2 All One2One One2One Self-Served 0.1 One2One Not-Self-Served All LDNSs 0 1 10 100 1000 10000 100000 1e+06 Sub1 requests issued by a LDNS

Figure 2.20: Number of sub1 requests issued by One2One LDNSs. in DNS-based request routing.

2.9.2 LDNS Pools

We now consider another interesting behavior. As a reminder, our sub* DNS inter- actions start with a sub1 request issued by the LDNS to our setup, to which we reply with a sub2 CNAME, forcing the LDNS to send another query, this time for sub2. However, we observed occurrences in our traces where these two consecutive queries

(which we can attribute to the same interaction because both embed the same client IP address) came from different LDNS servers. In other words, even though we sent our CNAME response to one LDNS, we got the subsequent sub2 query from a dif- ferent LDNS. Note that this phenomenon is distinct from resolver clusters mentioned in [55]. Indeed those other sets of resolvers occur when clients (ISPs on their behalf) load-balance their original DNS queries among multiple resolvers – the measurements mentioned do not consider which resolvers might handle CNAME redirections. In

38 Client Facing ADNS Facing LDNSs LDNSs

Sub1 CNAME

Sub2

End User ADNS

LDNS Pool

Figure 2.21: LDNS Pool contrast, in the behavior discussed here, CNAME redirections arrive from different IP addresses. Such behavior could be caused by an LDNS server with multiple ethernet ports (in which case a server might select different ports for different queries), or by a load- balancing LDNS server farm with shared state. An example of such configuration, hinted by Google in [40], is shown in Figure 2.21, where two distinct layers of LDNS clusters face, respectively, clients and ADNSs, and the ADNS-facing LDNSs are not recursive. Here, client-facing servers load-balance their queries among ADNS-facing servers based on a hash of the queried hostname; ADNS-facing servers send CNAME responses back to the client-facing server, which forward the subsequent query to a different ADNS-facing server due to different hostname. In this dissertation we will call such a behavior – for the lack of a better term – the multiport behavior and LDNS IP addresses showing together within the same interactions LDNS pools to indicate that they belong to the same multiport host or a server farm.

39 In an attempt to remove fringe scenarios involving rare timeout combinations,

we only considered LDNSs L1 and L2 to be part of a pool if (1) the sub1 request for

a given client came from L1 while sub2 request for the same client came from L2; and

(2) the sub2 request from L2 came within one second of the sub1 request from L1.

Using the above filter, we consider the prevalence of multiport behavior. We found 5,105,467 cases of such behavior representing 407,303 unique LDNS multiport IP address pairs and involving 36,485 unique LDNS IP addresses, or 13% of all LDNSs in our trace. Furthermore, 1,924,359 clients (17% of all clients) were found to be di- rectly involved in LDNS multi-port behavior (i.e., observed to have sub1 and sub2 requests within the same interaction coming from different LDNS IP addresses), and over 10M clients – 90% of all the clients in our trace – were associated at some point with an LDNS belonging to a pool. Overall, the 13% of LDNSs with multiport be- havior were the busiest – they were responsible for over 90% of both sub* queries and subsequent HTTP requests. We conclude that multiport behavior is rather common in today’s Internet. Such significant occurrence of LDNS pools warrants a closer look at this phe- nomenon as it may have important implications for DNS-based request routing. In- deed, if LDNS pools always pick a random LDNS server to forward a given query, the entire pool and all clients associated with any of its member LDNSs should be treated as a single cluster. If, however, the LDNS pools attempt to preserve client affinity when selecting LDNS servers (i.e., if the same client tends to be assigned the same LDNS for the same hostname resolution, as would be the case with hash-based assignment) then individual LDNSs in the pool and clients associated with thesm could be treated as separate LDNS clusters. A careful investigation of LDNS pools is an open issue for future work.

40 2.10 Implications for Web Content Delivery

This section summarizes the implications of our findings for Web platforms that employ DNS-based CDNs. Obviously, these lessons were derived from the study of one busy consumer-oriented Web site. While we believe this Web site is typical of similar informational sites, sites of different nature may need to re-evaluate these lessons, in which case our study can serve as a blueprint for such assessment. The implications discussed here are necessarily qualitative; they follow logically from our findings but each would have to be carefully evaluated in a separate study in the specific target environment.

First, despite a long-held concern about the hidden load problem of DNS- based CDNs, this is not a serious issue in practice for all but a small fraction of local DNS servers. For most LDNSs, the amount of hidden load – while different from one LDNS to the next – appears small enough to provide sufficiently fine granularity for load distribution. Thus, a proper request routing could achieve a desired load distribution without elaborate specialized mechanisms for dealing with hidden load such as [28, 29]. Second, due to their relatively small number, the exceptions to the above finding (“elephant” LDNS clusters) can be identified, tracked and treated separately, perhaps even by semi-automated policy configuration. This is especially true for the very largest elephants as they appear to be geographically compact: even though these clusters contain tens of thousands clients, their clients are mostly situated within a hundred miles from their LDNS. Thus, these clusters both benefit significantly from being served from a proximally optimal location in the platform and are not amenable to being shifted between locations using DNS resolution, due to their large hidden load. More fine-grained demand distribution techniques, such as L4-7 load balancers or HTTP or RTP redirection might be needed. Third, there is a large variation in the compactness of LDNS clusters, both

41 in terms of geographical distribution of their clients and autonomous system sharing between the clients and the LDNSs. This provides rich opportunities for improved request routing policies. For instance, the ADNS of the platform can try to “pin” compact clusters to be served from the respective optimal locations in the platform, while resolving any load imbalances within the global platform by re-routing requests from non-compact clusters to the extent possible. The specific policies must be worked out; however, the amount of diversity in terms of cluster compactness at a large range of cluster sizes makes it a promising avenue for improving efficiency of a Web platform.

Finally, there has been a shift in client-side DNS setup. The traditional model of a stub resolver at a client host talking to a local recursive DNS server, which interacts with the rest of the DNS infrastructure, no longer applies to vast numbers of clients. Many clients appear to be behind middleboxes, which masquerade as both a client and its LDNS to the rest of the Internet. Also common are complex setups involving layers of resolvers with shared state, which we called “LDNS pools”. While we find no evidence that the former setup requires special treatment from a Web platform, the implications of the wide deployment of LDNS pools is another direction for further investigation.

2.11 Summary

In this chapter, we present a study of the properties of LDNS clusters from the perspective of their effect on client request routing. In terms of the LDNS cluster size, the study shows that an overwhelming majority of the LDNS clusters, even as seen from a very high-volume web site, are very small, posing no issue with respect to hidden load. However, despite recent trends transforming busy resolvers into complex distributed infrastructures (e.g. anycast based platforms such as [65, 40]) there remain

42 a few “elephant” LDNS clusters 5). Thus, a DNS-based request routing system may benefit by tracking the elephant LDNS clusters and treating them differently. We also report on the geographical and autonomous system (AS) span. We show that LDNS clusters differ widely in terms of their geographical and autonomous system (AS) span. Furthermore, the extent of this span does not correlate with cluster size: the busiest clusters are very compact geographically but not in terms of AS-sharing. Thus, a DNS-based request routing system can benefit by treating LDNS clusters differently depending on a combination of their size and compactness: when there is a need to rebalance server load, the system may re-route requests from non-compact clusters first because they benefit less from proximity-sensitive routing anyway. In this study we also show that a large number of IPs act as both Web clients

and their own LDNSs. We find evidence that much of this phenomenon is explained by the presence of middle-boxes (NATs, firewalls, and web proxies). However, although they aggregate traffic from multiple hosts, these clusters exhibit, if anything, lower activity. Hence this aspect by itself does not appear to warrant special treatment from the request routing system.

Finally, this study provides strong evidence of LDNS pools with shared cache, where a set of “worker servers” shares work for resolving clients’ queries. While the implications of this behavior for network control remain unclear, the prevalence of this LDNS behavior warrants a careful future study.

5Note that our measurement setup is able to distinguish clients behind individual resolver nodes in these platforms as distinct clusters, so we do not conflate these platforms into an elephant cluster.

43 Chapter 3

A Practical Architecture for an Anycast CDN

3.1 Introduction

As we mentioned earlier, most commercial CDNs make use of DNS based redirection mechanism to perform server selection among their nodes. However, as Chapter 2 showed, DNS-based redirection, exhibits several well-known limitations among which we mentioned the originator problem and the hidden load problem.

Another problem that DNS-based request routing systems encounter is that the DNS system was not designed for very dynamic changes in the mapping between hostnames and IP addresses. As a consequence, the LDNS server can cache and reuse its DNS query responses for a certain period of time and for multiple clients. This complicates load distribution decisions for the CDN by limiting the granularity of its control over load balancing; This problem can be mitigated significantly by having the DNS system make use of very short time-to-live (TTL) values, which control the extent of the DNS response reuse. However, a rather common practice of caching

DNS queries by local-DNS servers, and especially certain browsers, beyond specified

44 TTL means that this remains an issue [67]. Furthermore, the DNS-based redirection assumes that the CDN explicitly se- lects a nearby CDN node for the originator of a given DNS query. To know the distance between any two IP addresses on the Internet requires a complex measure-

ment infrastructure. In the previous chapter of this dissertation we also investigated clusters of hosts sharing the same LDNS server (“LDNS clusters”). We found that among the two fundamental issues in DNS-based request routing systems - hidden load and

client-LDNS distance, hidden load plays appreciable role only for a small number of “elephant” LDNS servers. while the client-LDNS distance is significant in many cases. We also found that LDNS clusters vary widely in both characteristics and size,

Thus, a request routing system such as a content delivery network can attempt to balance load by reassigning non-compact LDNSs first as their clients benefit less from proximity-sensitive routing. In this chapter we revisit IP anycast as redirection technique. Anycast request routing seems to fit seamlessly into the existing Internet routing mechanisms. From a CDN point of view, however, there are a number of problems with IP anycast redirection. First, because it is tightly coupled with the IP routing apparatus, any routing change that causes anycast traffic to be re-routed to an alternative instance of the destination IP anycast address may cause a session reset to any session-based traffic such as TCP. Second, because the IP routing infrastructure only deals with connectivity, and not the quality of service achieved along those routes, IP anycast likewise is unaware of and can not react to network conditions. Third, IP anycast is similarly not aware of any server (CDN node) load, and therefore cannot react to node overload conditions. For these reasons, IP anycast was originally not considered a viable approach as a CDN redirection mechanism.

45 Our revisiting of IP anycast as a redirection mechanism for CDNs was prompted by two recent developments. First, route control mechanisms have recently been developed that allow route selection within a given autonomous system to be in- formed by external intelligence [87, 36, 88]. Second, recent anycast based measure-

ment work [16] shed light on the behavior of IP anycast, as well as the appropriate way to deploy IP anycast to facilitate proximal routing. Based on these developments, we present a design of a practical load-aware IP anycast CDN architecture for the case when the CDN is deployed within a single global network, such as AT&T’s ICDS content delivery service [15]. When anycast end-points are within the same network provider, the route control mechanism can install a route from a given network entry point to the anycast end-point deemed the most appropriate for this entry point; in particular, both CDN node load and internal network conditions can be taken into account. This addresses the load- awareness concern and in part the route quality of service concern – although the latter only within the provider domain. Route control also deals with the concern about resetting sessions because of route changes. We note that in practice there are two aspects of this problem: (i) Route changes within a network that deploys

IP anycast addresses and (ii) Route changes outside of the network which deploys anycast IP addresses. Route control mechanisms can easily deal with the first aspect preventing unnecessary switching between anycast addresses within the network. As for route changes outside of the IP anycast network, the study in [16] has shown that most IP prefixes exhibit very good affinity, i.e., would be routed along the same path towards the anycast enabled network. An anycast CDN is free of the limitations of DNS-based CDNs: redirects actual client demand rather than local DNS servers and thus is not affected by the distance between eyeballs and their local CDN servers; it is not impacted by DNS caching; and it obviates the need for determining proximity between CDN nodes and

46 external destinations. Moreover, the approach presented here addresses the flaws of using anycast in the CDN context discussed earlier. However, there are further potential limitations. First, anycast delivers client requests to the nearest entry point of the CDN network with regard to the forward path from the client to the CDN network. However, due to route asymmetry, this may not produce the optimal reverse path used by response packets. Second, while route control can effectively account for network conditions inside the CDN’s autonomous system, external parts of the routes are purely the result of BGP routing and thus can not be controlled, e.g. to avoid congestions in the external network. An obvious question, which we could not answer with the data available, is the end-to-end performance comparison between the anycast and DNS CDNs. Our contribution is rather to make the case that, contrary to the common view, anycast CDN is a viable approach to build a content delivery platform and that it improves the operation of the CDN in comparison with an existing DNS-based approach within the CDN’s network. The key aspects of this contribution are as follows:

• We present a practical anycast CDN architecture that utilizes server and net- work load feedback to drive route control mechanisms to realize CDN redirection

(Section 3.3).

• We formulate the required load balancing algorithm as a Generalized Assign- ment Problem and present practical algorithms for this NP-hard problem that take into consideration the practical constraints of a CDN (Section 3.4).

• Using server logs from an operational production CDN (Section 3.5), we eval- uate our algorithms by trace driven simulation and illustrate their benefit by comparing with native IP anycast and an idealized load-balancing algorithm, as well as with the current DNS-based approach (Section 3.6).

47 3.2 Related Work

IP anycast has been used in some components of content delivery. In particular, Limelight CDN [58] utilizes anycast to route DNS queries to their DNS servers: each data center has its own DNS server, with all DNS servers sharing the same address. Whenever a DNS query arrives to a given DNS server, the server resolves it to an

edge server co-located in the same data center. Thus, even though edge servers use unicast addresses, Limelight sidesteps the need to determine the nearest data center for a client, leveraging the underlying BGP routing fabric for data center selection. However, similar to the DNS-based CDNs, clients are still directed to the data center

that is nearest to client’s DNS server and not the client itself. CacheFly [21] is to our knowledge the first CDN utilizing the anycast technol- ogy for content download itself. Our approach targets a different CDN environment: while CacheFly follows the co-location approach with edge servers obtaining con- nectivity from multiple ISPs, we assume a single-AS CDN where the operator has control over intra-platform routing. No information on which (if any) load reassign- ment mechanisms CacheFly uses is available. Utilizing IPv6 for anycast request routing in CDNs has been independently proposed in [83, 3]. Our work shows that anycast can be used for CDN content delivery even in current IPv4 networks. We formulate our load balancing problem as generalized assignment problem (GAP). One related problem is the multiple knapsack problem (MKP), where we are given a set of items with different size and profit, and the objective is to find a subset of items that allows a feasible packing in the bins without violating bin capacity and maximize the total profit. Chekuri and Khanna [26] present a polynomial time approximation scheme (PTAS) for this problem. MKP is a special case where the profit for an item is the same for all bins, which cannot handle our setting since a cost of serving a request in a CDN varies depending on the server. Aggarwal,

48 Motwani, and Zhu [5] consider the problem of load rebalancing. Given the current assignment of request-server pairs, they focus on minimizing the completion time of queued requests by moving up to k requests to different servers and present a linear-time 1.5 approximation algorithm to this NP-hard problem. While the limited amount of rebalancing is relevant to our case to reduce ongoing session disruptions, our work has a different objective of maintaining server load under capacity. Another related problem is the facility location problem, where the goal is to select a subset of potential sites to open facilities and minimize the sum of request service cost and opening costs of the facilities [79]. This problem is more relevant in the provisioning time scale, when we can determine where to place CDN servers for a content group. In our setting, we are given a set of CDN servers and load-balance between them without violating the capacity constraint.

Current CDNs predominantly use DNS-based load balancing, and a number of load-balancing algorithms for this environment have been proposed in research [56, 29, 23, 71, 19] and made available in commercial products [75, 27, 2]. Since load-balancing is done at the application layer, these algorithms are able to make load balancing decisions at the granularity of individual DNS requests. For example,

[19] uses a simple algorithm of resolving each request to the nearest non-overloaded server, while [29] proposes intricate variations in DNS response TTL to control the amount of load directed to the server. These algorithms are not applicable in our environment where load-balancing decisions are at the drastically coarser granularity of the entire PEs. Content delivery networks can benefit from peer-to-peer content sharing, which can be used to share cached content either among CDN servers (and thus reduce the need to forward requests to the origin server) [37] or among users’ local caches directly

[51]. These approaches are complimentary to and can be used in conjunction with our architecture. There has also been a rise in the use of peer-to-peer content delivery

49 as an alternative to traditional content delivery networks, with and other P2P platforms providing examples of this approach. The general scalability of this style of content delivery is considered in [81]. Our work targets traditional CDNs, which offer their subscribers content delivery from a dedicated commercially operated platform with tight control and usage reporting. Our architecture uses IP anycast to route HTTP requests to edge servers, with a subsequent HTTP redirection of requests for particularly large downloads. Our parallel work addresses the increased penalty of disrupted connections in CDNs that deliver streaming content and very large objects [8]. That work proposes to induce connection disruption as a way to reassign a client to a different edge server if load conditions change during the long-running download. That work is complimentary to our present approach: the latter can use this technique instead of HTTP redirects to deal with long-running downloads.

3.3 Architecture

In this section we first describe the workings of a load-aware anycast CDN and briefly discuss the pros and cons of this approach vis-a-vis more conventional CDN architec- tures. We also give an informal description of the load balancing algorithm required for our approach before describing it more formally in later sections.

3.3.1 Load-aware Anycast CDN

Figure 3.1 shows a simplified view of a load-aware anycast CDN. We assume a single autonomous system (AS) in which IP anycast is used to reach a set of CDN nodes distributed within the AS. For simplicity we show two such CDN nodes, A and B in

Figure 3.1. In the rest of the chapter, we use the terms “CDN node” and “content server” interchangeably. We further assume that the AS in question has a large foot-

50 Figure 3.1: Load-aware Anycast CDN Architecture print in the country or region in which it will be providing CDN service; for example, in the US, Tier-1 ISPs have this kind of footprint1. This dissertation investigates synergistic benefits of having control over the PEs of a CDN. We note that these assumptions are both practical, and, more importantly, a recent study of IP any- cast [16] has shown this to be the ideal type of deployment to ensure good proximity properties2. Figure 3.1 also shows the route controller component that is central to our

approach [87, 88]. The route controller activates routes with provider edge (PE) routers in the CDN provider network. As described in [87], this mechanism involves pre-installed MPLS tunnels routes for a destination IP address (an anycast address in our case) from each PE to every other PE. Thus, to activate a route from a given

PE P Ei to another PE P Ej, the controller only needs to signal P Ei to start using an appropriate MPLS label. In particular, route change does not involve any other routers and in this sense is an atomic operation.

1http://www.business.att.com, http://www.level3.com 2Note that while our focus in this work is on anycast CDNs, we recognize that these conditions can not always be met in all regions where a CDN provider might provide services, which suggests that a combination of redirection approaches might be appropriate.

51 The route controller can use this mechanism to influence the anycast routes selected by the ingress PEs. For example, in Figure 3.1, to direct packets entering

through PE P E1 to the CDN node B, the route controller would signal P E1 to activate

the MPLS tunnel from P E1 to P E5; to send these packets to node A instead, the route

controller would similarly activate the tunnel from P E1 to P E0. For our purposes, the route controller takes as inputs, ingress load from the PEs at the edge of the network, server load from the CDN nodes for which it is performing redirection, and the cost matrix of reaching a given CDN server from a given PE to compute the routes according the algorithms described in Section 3.4. The load-aware anycast CDN then functions as follows (with reference to Fig- ure 3.1): All CDN nodes that are configured to serve the same content (A and B), advertise the same IP anycast address into the network via BGP (respectively through

P E0 and P E5). P E0 and P E5 in turn advertise the anycast address to the route con- troller, which is responsible to advertise the (appropriate) route to all other PEs in the network (P E1 to P E4). These PEs in turn advertise the route via eBGP sessions with peering routers (P Ea to P Ed) in neighboring networks so that the anycast ad- dress becomes reachable throughout the Internet (in the figure represented by access networks I and II). Request traffic for content on a CDN node will follow the reverse path. Thus, a request will come from an access network, and enter the CDN provider network via one of the ingress routers P E1 to P E4. In the simple setup depicted in Figure 3.1 such request traffic will then be forwarded to either P E0 or P E5 en-route to one of the CDN nodes. Based on the two load feeds (ingress PE load and server load) provided to the route controller it can decide which ingress PE (P E1 to P E4) to direct to which egress

PE (P E0 or P E5). By assigning different PEs to appropriate CDN nodes, the route controller can minimize the network costs of processing the demand and distributed

52 the load among the CDN nodes. In summary, our approach utilizes the BGP-based proximity property of IP anycast to deliver clients packets to nearest ingress PEs. These external portions of the paths of anycast packets are determined purely by inter-AS BGP routes. Once packets enter the provider network, it is the route controller that decides where these packets will be delivered through mapping ingress PEs to content servers. The route controller makes these decisions taking into account both network proximity of the internal routes and server loads.

3.3.2 Objectives and Benefits

We can summarize the goals of the architecture described above as follows: (i) To utilize the natural IP anycast proximity properties to reduce the distance traffic is carried towards the CDN’s ISP. (ii) To react to overload conditions on CDN servers by steering traffic to alternative CDN servers. (iii) To minimize the disruption of traffic that results when ongoing sessions are being re-mapped to alternative CDN servers. Note that this means that “load-balancing” per server is not a specific goal of the algorithm: while CDN servers are operating within acceptable engineering loads, the algorithm should not attempt to balance the load. On the other hand, when overload conditions are reached, the system should react to deal with that, while not compromising proximity. A major advantage of our approach over DNS-based redirection systems is that the actual eyeball request is being redirected, as opposed to the local-DNS request in the case of DNS-based redirection. Further, with load-aware anycast, any redirection changes take effect very quickly, because PEs immediately start to route packets based on their updated routing table. In contrast, DNS caching by clients (despite short TTLs) typically results in some delay before redirection changes have an effect. The granularity of load distribution offered by our route control approach is

53 Figure 3.2: Application level redirection for long-lived sessions at the PE level. For large tier-1 ISPs the number of PEs is typically in the high hundreds to low thousands. A possible concern for our approach is whether PE granularity will be sufficiently fine grained to adjust load in cases of congestion. Our results in Section 3.6 indicate that even with PE-level granularity we can achieve significant performance benefits in practice. Obviously, with enough capacity, no load balancing would ever be required. However, a practical platform needs to have load-balancing ability to cope with un-

expected events such as flash crowd and node failures, and to flexibly react to even more gradual demand changes because building up physical capacity of the platform is a very coarse-grain procedure. Our experiments will show that our architecture can achieve effective load balancing even under constrained resource provisioning. Before we describe and evaluate redirection algorithms that fulfill these goals,

we briefly describe two other CDN-related functions enabled by our architecture that are not further elaborated upon in this dissertation.

54 3.3.3 Dealing with Long-Lived Sessions

Despite increased distribution of rich media content via the Internet, the average Web

object size remains relatively small [54]. This means that download sessions for such Web objects will be relatively short lived with little chance of being impacted by any anycast re-mappings in our architecture. The same is, however, not true for long-lived sessions, e.g., streaming or large file download [86]. (Both of these expectations are validated with our analysis of connections disruption count in Section 3.6.)

In our architecture we deal with this by making use of an additional application level redirection mechanisms after a particular CDN node has been selected via our load-aware IP anycast redirection. This interaction is depicted in Figure 3.2. As before an eyeball will perform a DNS request which will be resolved to an IP anycast

address (i and ii). The eyeball will attempt to request the content using this address (iii), however, the CDN node will respond with an application level redirect (iv) [85] containing a unicast IP address associated with this CDN node, which the eyeball will use to retrieve the content (v). This unicast address is associated only with this

CDN node, and the eyeball will therefore continue to be serviced by the same node regardless of routing changes along the way. While the additional overhead associated with application level redirection is clearly unacceptable when downloading small Web objects, it is less of a concern for long lived sessions where the startup overhead is

amortized. In parallel work, an alternative approach was proposed to handle extremely large downloads using anycast, without relying on HTTP redirection [8]. Instead, the approach in [8] recovers from any disruption by reissuing the HTTP request for the remainder of the object using a range HTTP request. The CDN could then trigger these disruptions intentionally to switch the user to a different server mid-stream if the conditions change. However, that approach requires a browser extension. Recently, some CDNs started moving into utility (also known as cloud) com-

55 puting arena, by deploying applications at the CDN nodes. In this environment, applications often form long-lived sessions that encompass multiple HTTP requests, with individual requests requiring the entire session state to execute correctly. Com- mercial application servers, including both Weblogic and Websphere, allow servers to

form a wide-area cluster where each server in the cluster can obtain the session state after successfully receiving any HTTP request in a session. Based on this feature, our approach for using anycast for request redirection can apply to this emerging CDN environment.

3.3.4 Dealing with Network Congestion

As described above, the load-aware CDN architecture only takes server load into account in terms of being “load-aware”. (In other words, the approach uses network load information in order to effect the server load, but does not attempt to steer traffic away from network hotspots.) The Route Control architecture, however, does

allow for such traffic steering [87]. For example, outgoing congested peering links can be avoided by redirecting response traffic on the PE connecting to the CDN node

(e.g., P E0 in Figure 3.1), while incoming congested peering links can be avoided by exchanging BGP Multi-Exit Discriminator (MED) attributes with appropriate peers [87]. We leave the full development of these mechanisms for future work.

3.4 Remapping Algorithm

The algorithm for assigning PEs to CDN nodes has two main objectives. First, we want to minimize the service disruption due to load balancing. Second, we want to minimize the network cost of serving requests without violating server capacity constraints. In this section, after presenting an algorithm that minimizes the network cost, we describe how we use the algorithm to minimize service disruption.

56 3.4.1 Problem Formulation

Our system has m servers, where each server i can serve up to Si concurrent requests. A request enters the system through one of n ingress PEs, and each ingress PE j contributes rj concurrent requests. We consider a cost matrix cij for serving PE j at server i. Since cij is typically proportional to the distance between server i and PE j

as well as the traffic volume rj, the cost of serving PE j typically varies with different servers.

The first objective we consider is to minimize the overall cost without violating the capacity constraint at each server. The problem is called Generalized Assignment Problem (GAP) and can be formulated as the following integer linear optimization problem [78].

m n minimize X X cijxij i=1 j=1 m subject to X xij =1, ∀j i=1 n X rjxij ≤ Si, ∀i j=1

xij ∈{0, 1}, ∀i, j

where indicator variable xij=1 iff server i serves PE j, and xij=0 otherwise. Note that the above formulation reflects our “provider-centric” perspective with the focus on minimizing the costs on the network operator. In particular, the model favors

overall cost reduction even this means redirecting some load to a far-away server. In principle, one could bound the proximity degradation for any request by adding a constraint that no PE be assigned to a content server more than k times farther away than the closest server. In practice, however, as we will see later (Figure 3.9), the penalty for a vast majority of requests is very small relative to the current system.

When xij is an integer, finding an optimal solution to GAP is NP-hard, and

57 even when Si is the same for all servers, no polynomial algorithm can achieve an ap- proximation ratio better than 1.5 unless P=NP [78]. Recall that an α-approximation algorithm always finds a solution that is guaranteed to be at most a times the opti- mum.

Shmoys and Tardos [78] present an approximation algorithm (called st-algorithm in this dissertation) for GAP, which involves a relaxation of the integrality constraint and a rounding based on a fractional solution to the LP relaxation. It first obtains the initial total cost value C using linear programming optimization, by removing the

restriction that xij be integer (in which case the above problem formulation becomes an LP optimization problem). Then, using a rounding scheme based on the fractional solution, the algorithm finds an integer solution whose total cost is at most C and

the load on each server is at most Si + max rj. st-algorithm forms the basis for the traffic control decisions in our approach, as discussed in rest of this section. In our approach, the route controller periodically re-examines the PE-to-server assignments and computes a new assignment if necessary using a remapping algorithm. We call the period between consecutive runs of the mapping algorithm the remapping interval. We explore two remapping algorithms: one that attempts to minimize the

cost of processing the demand from the clients (thus always giving cost reduction a priority over connection disruption), and the other that attempts to minimize the connection disruptions even if this leads to cost increase.

3.4.2 Minimizing Cost

The remapping algorithm for minimizing costs is shown in pseudocode as Al-

gorithm 1. It begins by running what we refer to as expanded st-algorithm to try to find a feasible solution as follows. We first run st-algorithm with given server capac- ity constraints, and if it could not find a solution (which means the load is too high to be satisfied within the capacity constraints at any cost), we increase the capacity

58 Algorithm 1 Minimum Cost Algorithm INPUT: CurrentLoad[i], OfferedLoad[j], Cost[i][j] for each server i and PE j Run expanded st-algorithm {Post Processing} repeat Find the most overloaded server i; Let Pi be the set of PEs served by i. Map PEs from Pi to i starting from largest OfferedLoad, until i reaches its capacity Remap(i, {the set of still-unmapped PEs from Pi}) until None of the servers is overloaded OR No further remapping would help return

Subroutine Remap(Server: i, PE Set: F): for all j in F, in the descending order of OfferedLoad: do Find server q = argminkCost[k][j] with enough residual capacity for OfferedLoad[j] Find server t with the highest residual capacity if q exists and q != i then Remap j to q else Map j to t {t is less overloaded than i} end if end for

of each server by 10% and try to find a solution again.3 In our experiments, we set the maximum number of tries at 15, after which we give up on computing a new remapping and retain the existing mapping scheme for the next remapping interval. However, in our experiments, the algorithm found a solution for all cases and never skipped a remapping cycle. Note that st-algorithm can lead to server overload (even relative to the in- creased server capacities), although the overload amount is bounded by max rj. In practice, the overload volume can be significant since a single PE can contribute a large request load (e.g., 20% of server capacity). Thus, we use the following post- processing on the solution of st-algorithm to find a feasible solution without violating the (possibly increased) server capacities. We first identify the most overloaded server i, and then among all the PEs

3Note that the capacity constraints are just parameters in the algorithm and in practice assigned to be less than the physical capacity of the servers.

59 served by i, find the set of PEs F (starting from the least-load PE) such that server i’s load becomes below the capacity Si after off-loading F . Then, starting with the highest-load PEs among F , we offload each PE j to a server with enough residual

capacity q, as long as the load on server i is above Si. (If there are multiple such servers for j, we choose the one with minimum cost to j, although other strategies such as best-fit are possible.) If there is no server with enough capacity, we find server t with the highest residual capacity and see if the load on t after acquiring j is lower than the current load on i. If so, we off-load PE j to server t even when the load on

t goes beyond St, which will be fixed in a later iteration. Once the overload of server i is resolved, we repeat the whole process with then-highest overloaded server. Note that the overload comparison between i and t ensures the monotonic decrease of the maximum overload in the system and therefore

termination of the algorithm – either because there are no more overloaded servers in the system or “repeat” post-processing loop could further offload any of the overloaded servers.

3.4.3 Minimizing Connection Disruption

Algorithm 2 Minimum Disruption Algorithm INPUT: CurrentLoad[i], OfferedLoad[j], Cost[i][j] for each server i and PE j Let FP be the set of PEs mapped to non-overloaded servers. {These will be excluded from re-mapping} For every non-overloaded server i, set server capacity Si ← Si−CurrentLoad[i] For every overloaded server i and all PEs j currently mapped to i, set Cost[i][j] ← 0 Run expanded st-algorithm to find a server for PEs 6∈ FP {This will try to remap only PEs currently mapped to overloaded servers while they prefer their current server.} {Post Processing} repeat Find the most overloaded server i Map PEs ∈ Pi− FP to i, starting from the largest OfferedLoad, until i reaches it capacity Remap(i, {the set of still-unmapped PEs from Pi}) until None of the servers is overloaded OR No further remapping would help

60 While the above algorithm attempts to minimize the cost, it does not take the current mapping into account and can potentially lead to a large number of connection disruptions. To address this issue, we present another algorithm, which gives connection disruption a certain priority over cost. For clarity, we start by

describing an algorithm that attempts a remapping only when there is a need to off-load one or more overloaded servers. The pseudo-code of the algorithm is shown as Algorithm 2. The algorithm divides all the servers into two groups based on load: overloaded servers and non- overloaded servers. The algorithm keeps the current mapping of the non-overloaded servers and only attempts to remap the PEs assigned to the overloaded servers. Fur- thermore, even for the overloaded servers, we try to retain the current mappings as much as possible. Yet for the PEs that do have to be remapped due to overloads, we would like to use st-algorithm to minimize the costs. We manipulate input to st-algorithm in two ways to achieve these goals. First, for each non-overloaded server i, we consider only its residual capacity as the capacity

Si in st-algorithm. This allows us to retain the server current PEs while optimizing costs for newly assigned PEs. Second, for each overloaded server j, we set the cost

of servicing its currently assigned PEs to zero. Thus, current PEs will be reassigned only to the extent necessary to remove the overload. As described, this algorithm reassigns PEs to different servers only in over- loaded scenarios. It can lead to sub-optimal operation even when the request volume has gone down significantly and a simple proximity-based routing would yield a feasi- ble solution with lower cost. One way to address this is to exploit the typical diurnal pattern and perform full remapping once a day at a time of low activity (e.g., 4am every day). Another possibility is to compare the current mapping and the potential lowest-cost mapping at that point, and initiate the reassignment if the cost difference is beyond a certain threshold (e.g., 70%). Our experiments do not account for these

61 optimizations. To summarize, in our system, we mainly use the algorithm in Section 3.4.3 to minimize the connection disruption, while we infrequently use the algorithm in Sec- tion 3.4.2 to find an (approximate) minimum-cost solution for particular operational scenarios.

3.5 Evaluation Methodology

This section describes the methodology of our experimental study.

3.5.1 Data Set

We obtained two types of data sets from a production single-AS CDN: the netflow datasets from its ingress PEs and Web access logs from its cache servers. The access logs were collected for a weekday in July, 2007. Each log entry has detailed informa- tion about an HTTP request and response such as client IP, cache server IP, request size, response size, and the time of arrival. Depending on the logging software, some servers provide service response time for each request in the log, while others do not.

In our experiments, we first obtain sample distributions for different response size groups based on the actual data. For log entries without response time, we choose an appropriate sample distribution (based on the response size) and use a randomly generated value following the distribution.

We use concurrent requests being processed by a server as the load metric that we control in our experiments. In addition, we also evaluate data serving rate as server load indication. To determine the number of concurrent requests rj coming through an ingress PE j, we look at the client and server IP pair for each log entry and use netflow data to determine where the request has entered the system. We then use the request time from the log and the service response time (actual or computed

62 as described above) to determine whether a request is currently being served. One of our objectives is to maximize network proximity in processing client requests. In particular, because we focus on reducing the costs of the CDN’s network provider, our immediate goal is to maximize network proximity and network delays inside the CDN’s autonomous system. Since the internal response path is always degenerate independently of our remapping (it uses hot-potato routing to leave the AS as quickly as possible), the network proximity between the client’s ingress PE and server is determined by the request path4. Thus, we use the request path as our cost

metric reflecting the proximity of request processing. Specifically, we obtained from

the CDN the distance matrix dij between every server i and every ingress PE j in terms of air miles and used it as the cost of processing a request. While we did not have access to the full topological routing distances, the latter are known to be highly

correlated with air-miles within an autonomous system since routing anomalies within an AS are avoided. Thus, using air miles would not have any significant effect on the results and at the same time make the results independent of a particular topology and routing algorithms. Topological routing distances, if available, could be equally used in our design.

We use the product rjdij as the cost cij of serving requests from PE j at server i.

Another input required by st-algorithm is the capacity Si of each server i. To assign server capacity, we first analyze the log to determine the maximum aggregate number of concurrent requests across all servers during the entire time period in

the log. Then, we assign each server the capacity equal to the maximum aggregate

4The proximity of the request’s external path (from the client to an entry point into the CDN’s AS) is further provided by IP anycast. At the same time, our focus on internal proximity may result in a suboptimal external response path since we choose the closest CDN node to the ingress PE and the reserve path could be asymmetric. In principle, the route controller could take into account the proximity of the various CDN nodes to the clients from the perspective of the overall response path. The complication, however, is that our current architecture assumes re-pinning is done at the granularity of the entire ingress PEs. Thus, any server selection decision would apply to all clients that enter the network at a given PE. Whether these clients are clustered enough in the Internet to exhibit similar proximity when reached from different CDN nodes is an interesting question for future work.

63 concurrent requests divided by the number of servers. This leads to a high-load scenario for peak time, where we have sufficient aggregate server capacity to handle all the requests but only assuming ideal load distribution. Note that server capacity is simply a parameter of the load balancing algorithm, and in practice would be specified

to be below the server’s actual processing limit. We refer to the latter as the server’s physical capacity. In most of the experiments we assume the physical capacity to be 1.6 times the server capacity parameter used by the algorithms. The CDN under study classifies content into content groups and assigns each content group to a certain set of CDN nodes. We use two such content groups for our analysis: one containing Small Web Objects, assigned to 11 CDN nodes, and the other Large File Downloads, processed by 8 CDN nodes.

3.5.2 Simulation Environment

We used CSIM [31] (http://www.mesquite.com) to perform our trace driven simu- lation. CSIM creates process-oriented, discrete-event simulation models. We imple- mented our CDN servers as a set of facilities that provide services to requests from ingress PEs, which are implemented as CSIM processes. For each request that arrives we determine the ingress PE j, the response time t, and the response size l. We as- sume that the server responds to a client at a constant rate calculated as the response size divided by the response time for that request. In other words, each request causes a server to serve data at the constant rate of l/t for t seconds. Multiple requests from the same PE j can be active simultaneously on server i. Furthermore, multiple PEs can be served by the same facility at the same time.

To allow flexibility in processing arbitrary load scenarios, we configured the CSIM facilities that model servers to have infinite capacity and very large bandwidth. We then impose capacity limits at the application level in each scenario. Excessive load is handled differently in different systems. Some systems impose access control

64 so that servers simply return an error response to excess requests to prevent them from affecting the remaining workload. In other systems, the excessive requests are admitted and may cause overall performance degradation. Our simulation can handle both these setups. In the setup with access control, at each request arrival, it will be

passed to the simulation or dropped depending on the current load of the destination server of this connection. In the setup without access control, we admit all the requests and simply count the number of over-capacity requests. An over-capacity request is a request that at the time of its arrival finds the number of existing concurrent requests

on the server to already equal or exceed the server’s physical capacity limit. In describing our experiments below, we will specify which of the setups various experiments follow. In general, the number of over-capacity requests in the setup without access control will exceed the number of dropped requests in the case with access control because, as explained above, a dropped request imposes no load on the server while the over-capacity connection contributes to server load until it is processed. However, the ultimate goal in dimensioning the system is to make the number of excess requests negligible in either setup, in which case both setups will have the same behavior.

The scale of our experiments required us to perform simulation at the time granularity of one second. To ensure that each request has a non-zero duration, we round the beginning time of a request down and the ending time up to whole seconds.

3.5.3 Schemes and Metrics for Comparison

We experiment with the following schemes and compare the performance:

• Trace Playback (pb): In this scheme we replayed all requests in the trace with- out any modification of server mappings. In other words, (pb) reflects the current CDN routing configuration.

65 • Simple Anycast (sac): This is “native” anycast, which represents an idealized proximity routing scheme, where each request is served at the geographically closest server.

• Simple Load Balancing (slb): This scheme employs anycast to minimize the

difference in load among all servers without considering the cost.

• Advanced Load Balancing, Always (alb-a): This scheme always attempts to find a minimum cost mapping as described in Section 3.4.2.

• ALB, On-overload (alb-o): This scheme aims to minimize connection disrup-

tions as described in Section 3.4.3. Specifically, it normally only reassigns PEs currently mapped to overloaded servers and performs full remapping only if the cost reduction from full remapping would exceed 70%.

In sac, each PE is statically mapped to a server, and there is no change in the mappings across the entire experiment run. slb and alb-a recalculate the mappings every ∆ seconds (the remapping interval). The initial ∆ value that we used to evaluate the different algorithms is 120 seconds. Later in section 3.6.5 we examine various values of ∆.

We utilize the following metrics for performance comparison.

• Server load: We use the number of concurrent requests and service data rate at each server as measures of server load. A desirable scheme should keep the number below the capacity limit all the time.

• Request air-miles: We examine the average miles a request traverses within

the CDN provider network before reaching a server as a proximity metric of content delivery within the CDN’s ISP. A small value for this metric denotes small network link usage in practice.

66 • Disrupted connections and over-capacity requests: Another metric of redirec- tion scheme quality is the number of disrupted connections due to re-mapping. Disruption occurs when a PE is re-mapped from server A to server B; the ongo- ing connections arriving from the PE may be disconnected because B may not

have the connection information. Finally, we use the number of over-capacity requests as a metric to compare the ability of different schemes to prevent server overloading. A request is counted as over-capacity if it arrives at a server with existing concurrent requests already at or over the physical capacity limit.

With our redirection scheme, a request may use a server different from the one used in the trace, and its response time may change, for example, depending on the server load or capacity. In our experiments, we assume that the response time of each request is the same as the one in the trace no matter which server processes it as a result of our algorithms.

3.6 Experimental Results

In this section, we present our simulation results. We first consider the server load, the number of miles for request traffic and the number of disrupted and over-capacity requests that resulted for each of the redirection schemes. In all these experiments, presented in Sections 3.6.1-3.6.3, we assume all server capacities to be the same and equal in aggregate the 100% of the maximum total number of concurrent requests in the trace (as described in section 3.5.1). Specifically, this translates to 1900 concurrent requests per server for the large-file group and 312 concurrent requests for the small- object group. The remapping interval in these experiments is fixed at 120 seconds, and we assume there is no access control in the system (i.e., excess requests are not dropped and only counted as over-capacity requests). Section 3.6.5 investigates different remapping interval values.

67 3.6.1 Server Load Distribution

We first present the number of concurrent requests at each server for the large files

group. For the clarity of presentation, we use the points sampled every 60 seconds. In Figure 3.3, we plot the number of concurrent requests at each server over time. Figure 3.3(a) shows the current behavior of the CDN nodes (Trace Playback (pb)). It is clear that some servers (e.g. server 4) process a disproportionate share of load – 4 to 5 times the load of other servers. This indicates current over-provisioning

of the system and an opportunity for significant optimization. Turning to anycast-based redirection schemes, since sac does not take load into account but always maps PEs to a closest server, we observe from Figure 3.3(b) that the load at only a few servers grows significantly, while other servers get very few requests. For example, at 8am, server 6 serves more than 55% of total requests (5845 out of 10599), while server 4 only receives fewer than 10. Unless server 6 is provisioned with enough capacity to serve significant share of total load, it will end up dropping many requests. Thus, while reducing the peak load of the most- loaded server compared to the playback, sac still exhibits large load misbalances. As another extreme, Figure 3.3(c) shows that slb evenly distributes the load across servers. However, slb does not take cost into account and can potentially lead to high connection cost.

In Figures 3.3(d) and Figure 3.3(e), we present the performance of alb-a and alb-o - the two schemes that attempt to take both cost and load balancing into account in remapping decisions. According to the figures, these algorithms do not balance the load among servers as well as slb. This is expected because their main objective is to find a mapping that minimizes the cost as long as the resulting mapping does not violate the server capacity constraint. Considering alb-a (Figure 3.3(d)), in the morning (around 7am), a few servers receive only relatively few requests, while other better located servers run close to their capacity. As the traffic load increases

68 16000 Aggregate. Server 1. Server 2. 14000 Server 3. Server 4. Server 5. Server 6. 12000 Server 7. Server 8.

10000

8000

6000 ConcurrentRequests EachServer at 4000

2000

0 6 8 10 12 14 16 18 Time (Hours in GMT) (a) pb

8000 8000 Server 1 Server 1 7000 Server 2 7000 Server 2 Server 3 Server 3 6000 Server 4 6000 Server 4 Server 5 Server 5 Server 6 Server 6 5000 Server 7 5000 Server 7 Server 8 Server 8 4000 4000

3000 3000

2000 2000

1000 1000 ConcurrentRequests EachServer at ConcurrentRequests EachServer at 0 0 6 8 10 12 14 16 18 6 8 10 12 14 16 18 Time (Hours in GMT) Time (Hours in GMT) (b) sac (c) slb 8000 8000 Server 1 Server 1 7000 Server 2 7000 Server 2 Server 3 Server 3 6000 Server 4 6000 Server 4 Server 5 Server 5 Server 6 Server 6 5000 Server 7 5000 Server 7 Server 8 Server 8 4000 4000

3000 3000

2000 2000

1000 1000 ConcurrentRequests EachServer at ConcurrentRequests EachServer at 0 0 6 8 10 12 14 16 18 6 8 10 12 14 16 18 Time (Hours in GMT) Time (Hours in GMT) (d) alb-a (e) alb-o Figure 3.3: Number of concurrent requests for each scheme (Large files group)

69 (e.g., at 3pm), the load on each server becomes similar in order to serve the requests without violating the capacity constraint. alb-o initially shows a similar pattern to alb-a (Figure 3.3(e)), while the change in request count is in general more graceful. However, the difference becomes clear after the traffic peak is over (at around 4pm).

This is because alb-o attempts to reassign the mapping only when there is an over- loaded server. As a result, even when the peak is over and we can find a lower-cost mapping, all PEs stay with their servers that were assigned based on the peak load (e.g., at around 3pm). This property of alb-o leads to less traffic disruption at the expense of increased overall cost (as we will see later in Section 3.6.2). Overall, from the load perspectives, we see that both alb-a and alb-o manage to keep maximum server load within roughly 2000 concurrent requests, very close to the 1900 connections capacity used as a parameter for these algorithms. Within these load limits, the algorithms attempt to reduce the cost of traffic delivery. In Figure 3.4, we present the same set of results using the logs for small object downloads. We observe a similar trend for each scheme, although the server load changes more frequently. This is because their response size is small, and the average service time for this content group is much shorter than that of the previous group.

We also present the serviced data rate of each server in Figure 3.5 for the large file server group and Figure 3.6 for small object server group. We observe that there is strong correlation between the number of requests (Figures 3.3 and 3.4) and data rates (Figures 3.5 and 3.6). In particular, the data rate load metric confirms the observations we made using the concurrent requests metric.

3.6.2 Disrupted and Over-Capacity Requests

Remapping of PEs to new servers can disrupt active connections. In this subsection, we investigate the impact of each remapping scheme on connection disruption. We also study the number of over-capacity requests assuming the physical capacity limit

70 3500 Aggregate. Server 1. Server 2. 3000 Server 3. Server 4. Server 5. Server 6. Server 7. 2500 Server 8. Server 9. Server 10. Server 11. 2000

1500

ConcurrentRequests EachServer at 1000

500

0 6 8 10 12 14 16 18 Time (Hours in GMT) (a) pb

1000 1000 Server 1 Server 1 Server 2 Server 2 Server 3 Server 3 800 Server 4 800 Server 4 Server 5 Server 5 Server 6 Server 6 600 Server 7 600 Server 7 Server 8 Server 8 Server 9 Server 9 400 Server 10 400 Server 10 Server 11 Server 11

200 200 ConcurrentRequests EachServer at ConcurrentRequests EachServer at 0 0 6 8 10 12 14 16 18 6 8 10 12 14 16 18 Time (Hours in GMT) Time (Hours in GMT) (b) sac (c) slb 1000 1000 Server 1 Server 1 Server 2 Server 2 Server 3 Server 3 800 Server 4 800 Server 4 Server 5 Server 5 Server 6 Server 6 600 Server 7 600 Server 7 Server 8 Server 8 Server 9 Server 9 400 Server 10 400 Server 10 Server 11 Server 11

200 200 ConcurrentRequests EachServer at ConcurrentRequests EachServer at 0 0 6 8 10 12 14 16 18 6 8 10 12 14 16 18 Time (Hours in GMT) Time (Hours in GMT) (d) alb-a (e) alb-o Figure 3.4: Number of concurrent requests for each scheme (Small objects group)

71 200 Server 1. Server 2. Server 3. Server 4. Server 5. Server 6. Server 7. 150 Server 8.

100

DataEachRatesServer at (MB/s) 50

0 6 8 10 12 14 16 18 Time (Hours in GMT) (a) pb

200 200 Server 1 Server 1 Server 2 Server 2 Server 3 Server 3 150 Server 4 150 Server 4 Server 5 Server 5 Server 6 Server 6 Server 7 Server 7 Server 8 Server 8 100 100

50 50 DataEachRatesServerat (MB/s) 0 DataEachRatesServerat (MB/s) 0 6 8 10 12 14 16 18 6 8 10 12 14 16 18 Time (Hours in GMT) Time (Hours in GMT) (b) sac (c) slb 200 200 Server 1 Server 1 Server 2 Server 2 Server 3 Server 3 150 Server 4 150 Server 4 Server 5 Server 5 Server 6 Server 6 Server 7 Server 7 Server 8 Server 8 100 100

50 50 DataEachRatesServerat (MB/s) DataEachRatesServerat (MB/s) 0 0 6 8 10 12 14 16 18 6 8 10 12 14 16 18 Time (Hours in GMT) Time (Hours in GMT) (d) alb-a (e) alb-o Figure 3.5: Service data rate for each scheme (Large files group)

72 20 Server 1. Server 2. Server 3. Server 4. Server 5. Server 6. 15 Server 7. Server 8. Server 9. Server 10. Server 11.

10

5 DataEachRatesServer at (MB/s)

0

6 8 10 12 14 16 18 Time (Hours in GMT) (a) pb

20 20 Server 1 Server 1 Server 2 Server 2 Server 3 Server 3 15 Server 4 15 Server 4 Server 5 Server 5 Server 6 Server 6 Server 7 Server 7 Server 8 Server 8 10 Server 9 10 Server 9 Server 10 Server 10 Server 11 Server 11 5 5 DataEachRatesServerat (MB/s) DataEachRatesServerat (MB/s) 0 0 6 8 10 12 14 16 18 6 8 10 12 14 16 18 Time (Hours in GMT) Time (Hours in GMT) (b) sac (c) slb 20 20 Server 1 Server 1 Server 2 Server 2 Server 3 Server 3 15 Server 4 15 Server 4 Server 5 Server 5 Server 6 Server 6 Server 7 Server 7 Server 8 Server 8 10 Server 9 10 Server 9 Server 10 Server 10 Server 11 Server 11 5 5 DataEachRatesServerat (MB/s) 0 DataEachRatesServerat (MB/s) 0 6 8 10 12 14 16 18 6 8 10 12 14 16 18 Time (Hours in GMT) Time (Hours in GMT) (d) alb-a (e) alb-o Figure 3.6: Service data rate for each scheme (Small objects group)

73 Figure 3.7: Disrupted and over-capacity requests for each scheme (Y-axis in log scale)

of servers to be equal to 1.6 times the capacity parameter used in the remapping algorithms. Specifically, the server physical capacity is assumed to be 2500 concurrent

requests in the large file group and 500 concurrent requests in the small file group. The results are shown in Figure 3.7. Since sac only considers PE-to-server proximity in its mappings and the proximity does not change, sac mappings never change and thus connection disruption does not occur. However, by not considering load, this scheme exhibits many over-capacity requests – over 18% in the large-file group. In contrast, slb always remaps to achieve as balanced load distribution as possible. As a result, it has no over-capacity requests but a noticeable number of connection disruptions. The overall number of negatively affected requests is much smaller than for sac but as we will see in the next section, this comes at the cost of increasing the request air-miles. Figure 3.7 shows significant improvement of both alb-a and alb-o over sac and slb in the number of affected connections. Furthermore, by remapping PEs judiciously, alb-o reduces the disruptions by an order of magnitude over alb-a

74 without affecting the number of overcapacity requests. Overall, alb-o reduces the number of negatively affected connections by two orders of magnitude over sac, by an order of magnitude over slb in the small files group, and by a factor of 5 over slb in the large file group.

Finally, Figure 3.7 shows that large file downloads are more susceptible to dis- ruption in all the schemes performing dynamic remapping. This is because the longer service response of a large download increases its chance of being remapped during its lifetime (e.g., in the extreme, if an algorithm remapped all active connections every time, every connection lasting over 120 seconds would be disrupted). This confirms our architectural assumption concerning the need for application level redirect for long lived sessions. In summary, the disruption we observed in our experiments is negligible: at most 0.04% for the ALB-O algorithm (which we ultimately advocate), and even less – 0.015% – for small objects download. Further, disruption happens in the ALB-O algorithm when the platform is already overloaded, when the quality of service is already compromised. In fact, by pinning a client to a fixed server at the beginning of the download, DNS-based CDNs may lead to poor performance in long-running downloads (during which conditions can change). On the other hand, with a simple extension to browsers, as we show in a separate work [8], an anycast-based CDN could trigger these disruptions intentionally to switch the user to a different server on the fly.

3.6.3 Request Air Miles

This subsection considers the cost of each redirection scheme, measured as the average number of air miles a request must travel within the CDN’s ISP. Figures 3.8(a) and 3.8(b) show the ratio of schemes average cost over the pb average cost calculated every 120 seconds. In sac, a PE is always mapped to the closest server, and the

75 2 2 SLB. SLB. SAC. SAC. ALB-A. ALB-A. ALB-O. ALB-O.

1.5 1.5

1 1

0.5 0.5 RatioScheme of Miles overPlayback Miles RatioScheme of Miles overPlayback Miles

0 0 6 8 10 12 14 16 18 6 8 10 12 14 16 18 Time (Hours in GMT) Time (Hours in GMT) (a) Small objects group (b) Large files group Figure 3.8: Average miles for requests calculated every 120 seconds

average mileage for a request is always the smallest (at the cost of high drop ratio as previously shown). This can be viewed as the optimal cost one could achieve

and thus it always has the lowest ratio in figure 3.8. slb balances the load among servers without taking cost into account and leads to the highest cost. We observe in Figure 3.8(a) that alb-a is nearly optimal in cost when the load is low (e.g., at 8am) because in this case each PE can be assigned to the closest server. As the traffic load increases, however, not all PEs can be served at their closest servers without violating the capacity constraint. Then, the cost goes higher as some PEs are re-mapped to different (farther) servers. alb-o also finds an optimal-cost mapping in the beginning when the load is low. As the load increases, alb-o behaves differently from alb-a because alb-o attempts to maintain the current PE-server assignment as much as possible, while alb-a attempts to minimize the cost even when the resulting mapping may disrupt many connections (Figure 3.7). This restricts the solution space for alb- o compared to alb-a, which subsequently increases the cost of alb-o solution. With our focus on optimizing the costs for the ISP, our optimization formu- lation does not restrict the distance for any individual request. Thus, a pertinent question is, to which extent individual requests might be penalized by our schemes. Consequently, Figure 3.9 plots the ratio of the cost of the 99-percentile requests in each scheme. Specifically, in every 120-second interval, we find the request whose cost

76 2 2 SLB. SLB. SAC. SAC. ALB-A. ALB-A. ALB-O. ALB-O.

1.5 1.5

1 1

0.5 0.5 Ratio of Scheme Miles over Playback Ratio of Scheme Miles over Playback

0 0 6 8 10 12 14 16 18 6 8 10 12 14 16 18 Time (Hours in GMT) Time (Hours in GMT) (a) Small objects group (b) Large files group Figure 3.9: 99th percentile of request miles calculated every 120 seconds

0.5 0.5 ALB-A ALB-A ALB-O ALB-O

0.4 0.4

0.3 0.3

0.2 0.2 Execution Time (Seconds) Execution Time (Seconds) 0.1 0.1

0 0 6 8 10 12 14 16 18 6 8 10 12 14 16 18 Time (Hours in GMT) Time (Hours in GMT) (a) Small objects group (b) Large files group Figure 3.10: Execution time of the alb-a and alb-o algorithms in the trace envi- ronment is higher than 99% of all requests in a given anycast scheme and the request with cost higher than 99% of all requests in the playback, and we plot the ratio of the cost of these requests. Note that because the possible costs for individual requests can only take discrete values, the curves are less “noisy” than for the average costs. We can see that both adaptive anycast algorithms do not penalize individual requests exces- sively. The ALB-A algorithm actually reduces the cost for a 99-percentile request compared to the playback, and the ALB-O’s penalty is at most 12.5% for the Large File Downloads group and 37.5% for the Small Web Object group.

77 3.6.4 Computational Cost of Remapping

We now consider the question of the execution time of the remapping algorithms

themselves, concentrating on alb-a and alb-o as they are the most computationally intensive and also shown above to exhibit the best overall performance among the algorithms we considered. We first time the algorithms on the trace environment and then consider how they scale with a potential growth of the platform size. All the experiments were conducted on a single-core Intel PC with Pentium 4 3.2GHz CPU

and 1G RAM, running Linux 2.6.31.12 kernel. Figure 3.10 plots the measured execution time of both remapping algorithms in our trace-driven simulation. Each data point reflects the actual measured time of each execution after each 120-second remapping interval. The figure shows that for both small and large groups, both algorithms never exceed 0.5s to execute, and in most cases take much less time: the 95-th percentile is 0.26s for alb-a and 0.14s for alb-o in the small objects group, and 0.23s and 0.17s in the large files group. This is negligible time compared with expected frequencies of remapping decisions.

We also observe that alb-o is more efficient than alb-a. This is because alb-o only performs remapping only for overloaded servers, in effect reducing the size of the solution search space and in fact often not solving the LP problem at all (which is reflected in seemingly zero execution time in the figure).

We now turn to the question of how our algorithms will scale with the platform size. To this end, we time the algorithms in a synthetic environment with a synthetic randomly generated workload. We consider a platform with 1000 PEs and up to 100 data centers (compared to the trace environment of around 550 PEs and 8-11 data centers). Each data center (represented as a single aggregate server) has maximum capacity of 1000 concurrent connections. To generate the synthetic workload, we start with a given fraction of aggregate platform capacity as total offered load, and distribute this offered load randomly among the PEs in the following three steps.

78 • We iterate through the PEs, and for each PE, we assign it a random load between 0 and maximum server capacity (1000). This step results in random load assignment, but the aggregate offered load can significantly deviate from the target level. We make it close (within 10%) to the target level in the next

two steps.

• While the total load assigned to all the PEs is below 0.9 of the target:

– Pick a random PE P and a random load value L between 0 and 250 (

one-forth of the server capacity);

– If current load(P )+ L is less than server capacity, add L to P ’s offered load.

• While the total load assigned to all the PEs is above 1.1 of the target:

– Pick a random PE P and a random load value L between 0 and 250 ( one-forth of the server capacity);

– If current load(P ) − L> 0, subtract L from P ’s offered load.

We perform the above random load assignment every two minutes and then time our algorithms as they remap the PEs to servers. Further, to see how the running time depends on the overall load (one can expect that the higher the total load relative to total capacity, the harder the algorithm has to work to find a solution), we increase the load target every hour. The resulting pattern of the total offered load relative to the total capacity is shown in Figure 3.11. Again, within each period of stable total load, its distribution among the PEs changes randomly every two minutes. Figure 3.12 shows the execution time of the alb-a and alb-o algorithms for different platform sizes. Understandably, larger platform sizes translate to greater execution time. However, even for 100 data centers, as long as the total load is within 75%,alb-o generally completes remapping under 5s, and alb-a within 10s.

79 100

90

80

70

60

50 25 Servers 50 Servers % Total Offered Load to System Capacity 75 Servers 100 Servers 40 6 8 10 12 14 16 18 Simulated Time (Hours in GMT)

Figure 3.11: Total offered load pattern (synthetic environment)

The more efficient execution of alb-o is again due to the fact that it performs only incremental remapping each time5, and as we see, in as the platform grows in size, the

difference can be significant. This again (in addition to the reduction in the number of disrupted connections) argues in favor alb-o. Overall, we conclude that even using our very modest machine for remapping, the execution time of our remapping algorithms, especially alb-o, is acceptable for a platform of a significant scale as long as the total load does not approach too closely the total platform capacity. The fact that our algorithm slows down significantly un- der extreme load conditions suggests a strategy where the remapping algorithm first checks the total load and if it is found close to (e.g., over 75% of) the platform capac- ity, switches to a “survival mode” whereby it no longer solves the cost optimization problem but merely redistributes excess load from overloaded to underloaded servers.

5Observe that its initial remapping takes the same time as in the alb-a case.

80 35 35 25 Servers 25 Servers 50 Servers 50 Servers 75 Servers 75 Servers 30 100 Servers 30 100 Servers

25 25

20 20

15 15

10 10 Execution Time (Seconds) Execution Time (Seconds)

5 5

0 0 6 8 10 12 14 16 18 6 8 10 12 14 16 18 Simulated Time (Hours in GMT) Simulated Time (Hours in GMT) (a) ALB-A (b) ALB-O Figure 3.12: Scalability of the alb-a and alb-o algorithms in a synthetic environ- ment 3.6.5 The Effect of Remapping Interval

In this section, we consider the issue of selecting the remapping interval ∆. Specif- ically, we consider how different values of the remapping interval affect our main performance metrics: the number of disrupted connections, the cost of operation (measured as the air miles that the request must travel within the AS), and the num- ber of connections dropped due to server overload. Since we already showed that the alb-a and alb-o algorithms exhibited the best performance among the algorithms we considered, we concentrate on these two algorithms here. We ran our simulations with the default server capacity (1900 and 312 con- current requests for the large and small file groups respectively) which we refer to as 100% capacity scenario, and with a lower capacity equal to 75% of the default.

Lowering server capacity for the same load (which is given by the trace) allows us to investigate the behavior of the system under high load conditions. To consider the effect of different assumptions, we assume that overcapacity requests are dropped by the admission control mechanism at their arrival; hence they do not consume any

server resources. In the experiments of this subsection, we run our simulations for the entire trace but collect the results only for the last 6-hour trace period. This allows every

81 scenario to experience at least one remapping (not counting the initial remapping at the end of the first second) before the results are collected. For instance, for ∆ = 6 hours, the first remapping occurs on the first second of the trace and is affected by the initially idle servers, the second remapping occurs at 6 hours, and the results are

collected in the remaining 6 hours of the trace. For smaller deltas, the results are collected for the same trace period to make the results the comparable across different deltas. Note that this is different from previous experiments where the remapping interval was fixed at 120s, which is negligible relative to the trace duration, allowing

us to report the results for the entire trace.

Disruption Count

100000 1e+06 ALB-Always-100 ALB-Always-100 ALB-Always-75 ALB-Always-75 ALB-O-100 ALB-O-100 ALB-O-75 100000 ALB-O-75 10000

10000 1000

1000

100 100 # Disrupted# Requests Disrupted# Requests

10 10

1 1 30 60 120 300 600 1800 3600 30 60 120 300 600 1800 3600 Time Granularity in sec Time Granularity in sec (a) Small objects group) (b) Large files group Figure 3.13: The effect of remapping interval on disrupted connections

Figures 3.13(a) and 3.13(b) show the affect of the remapping interval on the disrupted connections. As expected, the number of disrupted connections decreases with the increase of the remapping interval in both schemes. However, the figures confirms the superiority of alb-o in this metric: alb-o exhibits a smaller number of disrupted connections for all ∆ values. This is a consequence of the design of alb-o, which only performs remapping to relieve overloaded servers when there is a significant potential for cost reduction. alb-a on the other hand, performs remapping any time it can reduce the costs. Interestingly, the high-load scenario (corresponding

82 450 700 ALB-Always-100 ALB-Always-100 ALB-Always-75 ALB-Always-75 ALB-O-100 ALB-O-100 400 ALB-O-75 600 ALB-O-75

350 500

300 400

250 300

200 200 AverageMiles Request Per AverageMiles Request Per

150 100

100 0 30 60 120 300 600 1800 3600 21600 30 60 120 300 600 1800 3600 21600 Time Granularity in sec Time Granularity in sec (a) Small objects group (b) Large files group Figure 3.14: The effect of remapping interval on cost (common 6-hour trace period) to 75% curves) does not significantly affect the disruptions. We speculate that this is due to the fact that the trace period reported by these figures corresponds to the highest load in the trace, and even the 100% capacity scenario triggers similar number of remappings.

Request Air Miles

We now turn to the effect of remapping interval on the cost (in terms of average request air miles) of content delivery. Figures 3.14(a) and 3.14(b) show the results. An immediate and seemingly counter-intuitive observation is that costs generally decrease for larger remapping intervals. A closer inspection, however, reveals that this is due to the fact that less frequent remappings miss overload conditions and do not rebalance the load by using suboptimal (e.g., more distance but less loaded) servers.

Indeed, the last 6-hour trace period reported in these graphs correspond to the period of the highest load; as the load increases, higher values of ∆ retain a proximity-driven mapping from a less-loaded condition longer. This has especially pronounced effect for extremely large deltas, such as ∆ = 6 hours, when no remapping occurs after the load increase. Note also that these graphs reflect the costs for successful requests only. We will see how longer remapping intervals affect over-capacity requests below. The comparison of different scenarios in Figures 3.14(a) and 3.14(b) reveals no

83 further surprises. alb-a has lower cost than alb-o for a given server capacity, and a lower-capacity scenario has higher cost than the same scheme with higher-capacity. This is natural, since alb-a optimizes cost at every remapping interval while alb- o only when there is a compelling reason to do so. Also, the lower the capacity of servers, the more often the system must change mappings to relieve overloaded servers at the expense of increased costs.

Over-Capacity Requests

1e+07 1e+07 ALB-Always-100 ALB-Always-100 ALB-Always-75 ALB-Always-75 ALB-O-100 ALB-O-100 1e+06 ALB-O-75 1e+06 ALB-O-75

100000 100000

10000 10000

1000 1000 # Dropped Requests # Dropped Requests 100 100

10 10

1 1 30 60 120 300 600 1800 3600 21600 30 60 120 300 600 1800 3600 21600 Time Granularity in sec Time Granularity in sec (a) Small objects group (b) Large files group Figure 3.15: The effect of remapping interval on dropped requests (common 6-hour trace period)

Finally, we consider the effect of remapping interval on over-capacity requests. Intuitively, larger remapping intervals must lead to more over-capacity requests as the scheme would miss overload conditions between remappings. The results are shown

in Figures 3.15(a) and 3.15(b). They confirm the above intuition; however, for the 100% capacity scenario, they show that no schemes exhibit any dropped requests until the remapping interval reaches 6 hour. Coupled with the previous results, this might suggest that very large values of delta, in the order of hours, should be used as

it decreases connection disruption without increasing the costs and dropped requests. However, in setting the physical capacity limit to be 1.6 times the server ca- pacity used by the algorithms, we provisioned a significant slack between the load

84 100 100 Over Capacity at 1.1 Over Capacity at 1.1 Over Capacity at 1.2 Over Capacity at 1.2 10 Over Capacity at 1.3 10 Over Capacity at 1.3 Over Capacity at 1.4 Over Capacity at 1.4 Over Capacity at 1.5 Over Capacity at 1.5 Over Capacity at 1.6 Over Capacity at 1.6 1 1

0.1 0.1

0.01 0.01

0.001 0.001

0.0001 0.0001 % of Over Capacity Over of %Requests Capacity Over of %Requests

1e-05 1e-05

1e-06 1e-06 30 60 120 300 600 1800 3600 21600 30 60 120 300 600 1800 3600 21600 Time Granularity in sec Time Granularity in sec (a) Small objects group (b) Large files group Figure 3.16: The effect of over-provisioning on over-capacity requests (common 6-hour trace period)

level where algorithms attempts to rebalance load and the level when a request is

dropped. Consequently, Figures 3.16(a) and 3.16(b) show the behavior of the alb-o algorithm when the servers are provisioned with a smaller capacity slack. In these experiments, we use 100% server capacity and do not drop over-capacity requests. For the small files group, we still do not see over-capacity requests until the slack is reduced to 1.2. At 1.2 over-provisioning, the over-capacity requests appear only when remapping interval reaches 30 minutes. With over-provisioning factor of 1.1, over-capacity requests appear at 1-minute remapping interval and grow rapidly for larger intervals. In the large files group, reducing over-provisioning from the factor of 1.6 to

1.1 increases the over-capacity requests more smoothly. At x1.3 over-provisioning, over-capacity requests appear at remapping intervals of 5 minutes and higher. Less slack leads to over-capacity requests at deltas as small as 30 seconds. Again, once appear, over-capacity requests increase for longer the remapping intervals (with one exception in Figure 3.16(b), which we consider an aberration). Overall, these results show the intricacies in tuning the system. Clearly, provi- sioning a larger capacity slack allows one to reduce frequency of remappings: indeed, the proximity factor in remapping decisions does not change, and the load factor

85 becomes less significant with the increased slack. This also results in lower deliver cost, as the system rarely sends traffic to non-proximal servers because of overload. Less slack requires more frequent remappings and can result in higher delivery cost. A proper choice of the remapping interval in this case requires careful analysis of the workload similar to the one performed in this dissertation.

3.7 Summary

New route control mechanisms, as well as better understanding of the behavior of IP anycast in operational settings, allowed us to revisit IP anycast as a CDN redirection mechanism.

In this chapter we present a load-aware IP anycast CDN architecture and de- scribes algorithms which allow redirection to utilize IP anycast’s inherent proximity properties, without suffering the negative consequences of using IP anycast with ses- sion based protocols.

This chapter also presents an evaluation of our algorithms using trace data from an operational CDN. We show that our algorithms perform almost as well as native IP anycast in terms of proximity. Our algorithms manage to keep server load within capacity constraints and significantly outperform other approaches in terms of the number of session disruptions. In the future we expect to gain experience with our approach in an operational deployment. We also plan to exploit the capabilities of our architecture to avoid network hotspots to further enhance our approach.

86 Chapter 4

Performance Implications of Unilateral Enabling of IPv6

4.1 Introduction

The address space of IPv4 is practically exhausted: the last block was allocated to regional Internet registries in February 2011. While registries can still distribute their allocated addresses internally, the last allocation brought the issue of IPv6 transition into stark focus. With the revived efforts for IPv6 transition, many clients are now dual-stack, that is, are capable to using both IPv4 and IPv6 protocols. Hence, another challenge for Web content providers is to properly route IPv4 Vs IPv6 clients. High profile Web content providers, e.g., Google, started to deploy IPv6 plat- forms to serve redirected IPv6 clients [39]. However, as the overall Internet tran- sition to IPv6 is lagging, the network paths between these clients and the content provider’s servers commonly do not support IPv6, in which case the two end-hosts cannot communicate over IPv6 even if they both are IPv6-enabled. Despite a recent IETF standard1 on how end-hosts should handle this situation [89], in practice the

1At the time of this measurement, The IETF standard [89] was in its early recommendation status

87 lack of end-to-end IPv6 path may expose the user to excessive delays or outright connectivity disruption. The possibility of such delays complicates client redirection process and can influence the content provider’s IPv6 transition strategy. For exam- ple, Google only directs clients to its IPv6 servers if they have verified the end-to-end

IPv6 connectivity and explicitly opted in for service over IPv6 [39]. This chapter quantifies the basis for such a conservative strategy. In other words, we try to answer an important question: what are the implications of an Internet platform, such as a Web content provider, unilaterally switching to a dual- stack mode, whereby it would simply redirect IPv6-enabled clients to an IPv6 server, and IPv4 clients to an IPv4 server? In our approach, since almost every interaction on the Internet starts with a DNS request, content providers (we use a Web site as an example in this study) configure their authoritative DNS servers to resolve IPv6 requests with IPv6 addresses of their platform servers, and resolve legacy DNS requests with IPv4 addresses of their platform servers. Thus, clients indicating the willingness to communicate over IPv6 are allowed to immediately do so, even if an end-to-end IPv6 path between these clients and platform servers might not exist.

We found no evidence of any performance penalty (subject to 1 sec. granularity of our measurement) and an extremely small increase in failure to download the object (from 0.0038% to 0.0064% of accesses). This suggests the feasibility of the unilateral IPv6 deployment, which could in turn spur a speedier overall IPv6 transition.

4.2 Background

A user access to any Web service is usually preceded with a DNS resolution of the service domain name. An IPv6-enabled client would issue a DNS query for an IPv6 address (an AAAA-type query) while an IPv4 client would send an A-type query for

88 an IPv4 address. Our goal is to assess the implications of a unilateral enabling of a dual-stack IPv4/6 support by the Web content providers. In this setup, the Web content provider would deploy both the IPv6 and IPv4 service servers. The authori- tative DNS server would then resolve AAAA DNS queries to the IPv6 address of the

IPv6 server, and A-type queries to the IPv4 address of the IPv4 server. Thus, clients indicating the willingness to communicate over IPv6 are allowed to immediately do so. The danger of this approach is that, given the current state of IPv6 adoption in the core networks, a valid end-to-end IPv6 path or tunnel between a host-pair may not exists, even if both end-points are IPv6-enabled. When the IPv6 path does not exist, plausible scenarios for IPv6-enabled clients can be grouped into two categories. In the first scenario, the client follows the recent IETF recommendation [89] to avoid any delay in attempting to use an unreachable IPv6 service server. Basically, assuming our Web content is a Web site, clients would issue both AAAA and A queries to obtain both IPv6 and IPv4 addresses of the Web site server(s), then establish both IPv4 and IPv6 HTTP connections at the same time using both addresses; if the IPv6 connection advances through the TCP handshake, the IPv4 connection is abandoned through an RST segment. The other scenario is that the client attempts to use IPv6 first and then, after failure to connect, resorts to IPv4, which leads to a delay penalty. The macro-effects of dual IPv4/6 Web content servers deployment are the result of complex interactions between behaviors of user applications (browsers for the case of Web services), operating systems, and DNS resolvers, which differ widely, leading to drastically different delay penalties (see [35] for an excellent survey of different browser and OS behaviors). Consequently, to avoid the possibility of high delay penalty, high-profile Web content providers, such as Google, only resolve AAAA DNS queries to IPv6 addresses for clients that have verified the existence of an end-to-end IPv6 path between themselves and Google and explicitly opted-in for IPv6 service.

89 This procedure is valuable as a demonstration and testbed for IPv6 migration but it does not scale as making a client network to duplicate this procedure for every Web content provider is infeasible. Clients typically resolve DNS queries through a client-side DNS resolver (“LDNS”), which is often shared among multiple clients. It is possible that the resolver submits AAAA queries even if some (or all) of its clients are not IPv6-enabled. A Web content provider that unilaterally deploys IPv6 as described above has no way of knowing the status of IPv6 support of the actual client when the AAAA query arrives - it simply re- sponds with the IPv6 address. Our measurement methodology captures any possible effects of this uncertainty. Thus, unless it may cause confusion, we refer to all clients behind the resolver that sends AAAA queries as IPv6-enabled or, interchangeably, dual-stack.

4.3 Related Work

Much effort has been devoted to IPv6 transition. A number of transition technologies have been proposed that help construct end-to-end IPv6 paths without the need for ubiquitous deployment of IPv6 network infrastructure (see, e.g., [24, 84, 57, 33, 48]). We look at another aspect of IPv6 migration, namely, the penalty for unilateral

IPv6 enabling when the end-to-end path does not exist. A number of studies have reported on the extent of IPv6 penetration from a variety of vantage points. In particular, Shen et al. [77] used netflow data from a Chinese tier-1 ISP, Savola [73] and Hei and Yamazki [43] analyzed data collected on 6to4 relays, Kreibich et al. [55] employed user-launched measurements, Malone [61] and Huston [49] studied IPv6 traffic attracted to IPv6-connected Web sites, and Karpilovsky et al. [53] considered IPv6 penetration from several vantage points including netflows in core networks, address allocations, and BGP route announcements. A general conclusion of these

90 studies is that IPv6 deployment remains low. For example, Huston found that in 2009, end-to-end IPv6 connectivity was only available to around 1% of the clients of the two Web sites he considered. These findings motivate our study by showing that most clients receiving an IPv6 address from a unilaterally IPv6-enabled Web site would have no end-to-end IPv6 connectivity to the site. Several studies considered the performance of the current IPv6 network infras- tructure. Zhou and Van Mieghem [90] compared the end-to-end delay of IPv6 and IPv4 packets between selected end-hosts and observed that IPv6 paths had higher variation in delay. Colitti et al. [30] compared latency experienced by clients access- ing the Google platform over IPv4 and IPv6 and found little difference once the effect of processing at tunnel termination points is factored out (otherwise the IPv6 latency was slightly higher). While this study considered performance of the IPv6 clients that had the end-to-end IPv6 path to their platform, we focus on the performance implications for IPv6-enabled clients that do not have this connectivity.

4.4 Methodology

We used the following methodology to measure the performance implications of uni- lateral IPv6 deployment when the client cannot reach the IPv6 server due to the lack of the end-to-end IPv6 path or tunnel. We utilized the same setup used in Section 2.3 described in Figure 2.1. In other worlds, We used a Web site as the Web content that clients are trying to reach. We used the same domain dns-research.com and utilized the specialized DNS server to act as its authoritative DNS server (ADNS) as well as the specialized Web server to host a single object (a one-pixel image) from subdomain sub.dns-research.com. We configured our specialized DNS server to respond to IPv6 queries (type-AAAA requests) for any hostname from domain sub.dns-research.com with a non-existent IPv6 address, and to any IPv4 DNS queries (type-A requests)

91 Client Side

2/3. dns-research.com (A, AAAA)? com (A, AA AA)? 8/9. 1_2_3_4.sub.dbs-research.com (A, AAAA)? InstrumentedI )? )? 5.6.7.8; NNXDOMAIN for AAAAADNS/HTTPADN server AA XDOMA IN f AAAA AA or AA 5.6.7.8; bogus IPv6 addr AAAAAAA 4. 5.6.7.8 bogusaddr IPv6 1. dns-research.com? ea g ial jp 2 33 4.sub.dbs-re_4.s 5.6.7.8

10/11. 5.6.7.8; 10/11. special.jpg GET oved to 1_2_ 200 OK 5. GET special.jpg2 M 7. 1_2_3_4.sub.dbs-research.com (A, 7. 1_2_3_4.sub.dbs-research.com 30230 Movedg toto 1_2_3_4.sub.dbs-research.com/special.jpgbogus IPv6 6/7. 1_2_3_4.sub.dbs-research.com (A, 6/7. 1_2_3_4.sub.dbs-research.com ial.jp spspec GET to 5.6.7.85.6 12. GET special.jpgial.jpg to bogus IPv6 spec 13. GETGETG special.jpg to 5.6.7.8

1.2.3.4

Figure 4.1: Measurement Setup. Presumed interactions are marked in blue font. with the valid IPv4 address of our Web server. Responding to IPv6 queries with a non-existent IPv6 address mimics the situation where there is no end-to-end IPv6 path between the client and server and thus the server is unreachable2. We then measure the delay penalty as the time between when we send the unreachable IPv6 address to the client’s DNS resolver and when the client falls back to IPv4 (i.e., the corresponding HTTP request from the client arrives over IPv4). We deployed both our ADNS and Web servers on the same host so that we could measure time intervals between DNS and HTTP events without clock skew. An illustration of our setup with the IPv6 interactions is highlighted in Fig- ure 4.1. As a reminder, to associate a given DNS query with the subsequent HTTP

2Indeed, attempting to communicate with our non-existent address has the same effect as an attempt to communicate with an existing IPv6 destination over a non-existent path, which is the same as a path that is not end-to-end IPv6-enabled.

92 request; we first associate a DNS query with the originating client using the approach from [62]. Then when a user tries to access the hosted image, their browser first sends a DNS query for dns-research.com ( “base query”) to the user’s DNS resolver (step 1 in the figure), which then sends it to our ADNS server (step 2). An IPv6-enabled client network is likely to send both A and AAAA DNS queries. Since the base DNS queries can not be reliably associated with the clients, our ADNS responds with NX- DOMAIN (“Non-Existent Domain”) to the AAAA query and with the proper IPv4 address of our Web server to the A query (step 3). The resolver forwards the re- sponse to the client (step 4), which then sends the HTTP request for this image to our server (step 5). Our Web server returns an HTTP 302 (“Moved”) response (step 6) redirecting the client to another URL in the sub.dns-research.com domain3, but with the host name that embeds the client’s IP address (we refer to these queries as “sub” requests). The client needs to resolve this name through its resolver again (steps 6-9). This time the DNS query can be reliably attributed to the HTTP client through the client’s IP address embedded in the hostname. Having associated the DNS query to the originating client, we measure the delay between the arrival of AAAA query in step 8 and the first subsequent HTTP request from the same client in step 13 as the delay penalty for unilateral IPv6 deploy- ment. To eliminate HTTP requests that utilized previously cached DNS resolutions (as their time since the preceding DNS interaction would obviously not indicate the delay penalty) we measure the incidents of IPv6 delays for an HTTP request only if it was immediately preceded (i.e., without another interposed HTTP request) by a full DNS interaction, including both A and AAAA requests, for that client. We contrast these delays with the delays for non-IPv6 enabled clients, whose resolvers did not send AAAA queries. We use the same technique to associate these HTTP clients with DNS queries, and measure the delays as the time between a type-A DNS

3In reality our setup involved more redirections as we discussed in Section 2.3 ; we omit these details for clarity as they are unrelated to the measurement study in this chapter.

93 query in step 2 and the first subsequent HTTP request in step 13. As we mentioned earlier in Section 2.3, we have collaborated with a major consumer-oriented Website to embed our image starting URL into their home page. Whenever a Web browser visits the home page, the browser downloads the linked

image and the interactions in Figure 4.1 take place. We used a low 10 seconds TTL for our DNS records. This allowed us to obtain repeated measurements from the same client without overwhelming our setup. Further, our Web server adds a “cache- control:no-cache” header field to its HTTP responses to make sure we receive every

request to our special image. Unfortunately, the conditions for this collaboration prevent us from releasing the datasets collected in the course of our experiment.

4.5 The Dataset

We have collected the DNS logs (including the timestamp of the query, LDNS IP, query type, and query string) and HTTP logs (request time, User-Agent and Host headers) resulting from the interactions described in the previous section.

Table 4.1: The basic IPv6 statistics Base DNS ”Sub” DNS Requests Requests # Requests 19,945,037 2,398,367 LDNS IP addrs 59,978 32,291 Client IP addrs No data 1,134,617

As a reminder of the high level characteristics of our data set, please refer to Table 2.1 from Section 2.4. Basically, we have collected over 21M client/LDNS asso- ciations between 11.3M unique client IP addresses from 17,778 autonomous systems (ASs) and almost 280K LDNS resolvers from 14,627 ASs.

Table 4.1 summarizes the general statistics about IPv6 requests, as well as clients and LDNSs behind them. Out of the 278,559 LDNSs we observed during our experiment, almost 22% were IPv6-enabled (i.e., sent some AAAA queries). However,

94 only around 54% of the latter sent AAAA “sub” requests, and the number of “sub” requests was much lower than that of the base queries. This is because some LDNS servers seem to cache the NXDOMAIN response (which, as discussed earlier, our DNS server returns to the IPv6 queries for the base domain) and not issue queries

for subdomains of the base domain, while other LDNS servers seem to not cache NXDOMAIN responses at all and send repeated base queries even when serving subsequent “sub” requests from the cache.

4.6 The Results

We now present our measurement results. We first consider if unilateral IPv6 enabling

entails any penalty clients’ DNS resolution, and then report our measurements of the overall delays.

4.6.1 DNS Resolution Penalty

Our first experiment investigates any potential delays in obtaining the IPv4 DNS resolution given that our IPv6 Web server is unreachable. If clients fail-over to IPv4

only after being unable to connect to the IPv6 Web server, then it could be that the DNS A-type query would only arrive after the corresponding timeout. To test for this behavior, we consider the time between A and AAAA “sub” request arrivals from the same client. Our immediate observation is that almost 88% out of the

2.3 Million AAAA “sub” requests were received after their corresponding A request. This says that not only do these clients/LDNSs perform both resolutions in parallel but, between the two wall-to-wall DNS requests, they most likely send the A query first. For the remaining 12% of requests, Figure 4.2 shows the CDF of the time difference between A and AAAA “sub” requests. The figure indicates that even among these requests, most clients did not wait for a failed attempt to contact the IPv6

95 1

0.9

0.8

0.7

0.6

0.5 CDF 0.4

0.3

0.2

0.1

0 0.001 0.01 0.1 1 10 Number of seconds between (AAAA and A) Sub Requests

Figure 4.2: Time difference between A and AAAA “sub” requests

Web server before obtaining the IPv4 address. Indeed, even assuming an accelerated default connection timeout used in this case by Safari and Chrome (270ms and 300ms respectively [35] - as opposed to hundreds of seconds for regular TCP timeout [7]), roughly 70% of type-A queries in these requests came within this timeout value. We conclude that a vast majority (roughly 88 + 0.7 × 12 ≈ 95%) of requests do not incur extra DNS resolution penalty due to IPv6 deployment.

4.6.2 End-to-End Penalty

Our first concern is to see whether unilateral IPv6 enabling can lead to disruption of Web accesses, that is, whether the IPv6-enabled clients successfully fail over to IPv4 for HTTP downloads. We compare the rate of interactions where HTTP request fails to arrive following the AAAA DNS query, either until the next DNS interaction from

96 1

0.95

0.9

0.85

CDF 0.8

0.75

0.7 All IPv4 Delays All IPv6 Delays 0.65 1 10 100 Number of Seconds between (DNS and HTTP) Sub requests

Figure 4.3: Comparison of all IPv6 and IPv4 delays. the same client or until the end of the trace. For IPv6-enabled clients, these lost

HTTP requests amounted to 154 out of 2,398,367 total interactions, or 0.0064%. For IPv4-only clients, this number was 1217 lost requests out of the total (34.4M-2.4M), or 0.0038%. Although the rate of lost requests in IPv6-enabled clients is higher, both rates are so extremely low that they can both be considered insignificant.

Turning to assessing the upper bound on the overall delay for IPv6-enabled clients, we measure the time between the arrivals of the AAAA “sub” DNS request (a conservative estimate of when the client receives the unreachable IPv6 address) and the actual subsequent HTTP request by the client. As a reminder, to eliminate HTTP requests that utilized previously cached DNS resolutions, we measure the incidents of

IPv6 delays for an HTTP request only if it was immediately preceded (i.e., without another interposed HTTP request) by a full DNS interaction, including both A and AAAA sub requests for that client. Applying this condition resulted in 1,949,231

97 1

0.95

0.9

0.85

0.8

CDF 0.75

0.7

0.65 IPv4 Avg. Delays 0.6 IPv4 Max. Delays IPv6 Avg. Delays IPv6 Max. Delays 0.55 1 10 100 Number of Seconds between (DNS and HTTP) Sub requests

Figure 4.4: IPv4 and IPv6 delays per client. instances of IPv6 delays from 1,086,323 unique client IP addresses. Our HTTP logs provide timestamps with granularity of one second; thus we can only report our delays at this granularity. Figures 4.3 and 4.4 compare delays incurred by IPv6-enabled and IPv4-only clients. Figure 4.3 shows CDFs of all delays across all clients in the respective cate- gories (i.e., multiple delay instances from the same client are counted multiple times) and Figure 4.4 shows the CDFs of average and maximum delays observed per client. Both figures concentrate on delays within 100s. There were 0.063% of IPv6 delays and 0.076% of IPv4 delays exceeding 100s, with the maximum IPv6 delay of 1.2M sec and IPv4 delay of 1.8M sec. We attribute exceedingly long delays to a combination of clients commonly violating DNS time-to-live (as first observed in [66]) with corner cases such as duplicate DNS requests resulting from a single client interaction (the behavior that we directly observed in a different study). For instance, one HTTP

98 request on January 7 was surrounded by 6 DNS queries, two of which arrived after the HTTP request; since there were no more DNS requests until the next HTTP re- quest on January 27 (presumably due to a TTL violation), this scenario contributed a delay of 1.7M sec.

Neither figure shows significant differences in delay between the two categories of clients. In fact, where one can discern the difference, the delay distributions actually show lower delay penalty for IPv6-enabled clients. The maximum per-client delays shows the most discernible difference; this could be explained by the fact that

there are an order of magnitude more IPv4-only interactions, thus there is a higher chance of an outlier value of maximum delay. While the one-second measurement granularity is clearly a limitation of this experiment, our study finds no evidence of delay penalty and in any case provides the upper bound of 1 sec. for any penalty

that could not be measured.

4.7 Summary

The transition to IPv6 is immanent as the last block of IPv4 addresses is already allocated to regional Internet registries. Many end-hosts are already IPv6 enabled. However, not all Internet content providers have enabled IPv6 access to their service

platforms. High profile Web content providers such as Google only enable access to their IPv6 platform for clients who explicitly enroll for such service and prove that a valid IPv6 end-to-end path exists for them. In this chapter, we present a measurement study to assess the performance penalty for unilateral IPv6 adoption by an Internet platform. Our results show no evidence of such performance penalty and an extremely small increase in failure to download the object (from 0.0038% to 0.0064% of accesses).

99 Chapter 5

IPv6 Anycast CDNs

5.1 Introduction

The transition to IPv6 has been accelerating as the supply of large unallocated IPv4 address blocks has been exhausted and allocated IP address space is rapidly running out. Once IPv6 gains wider adoption, the Internet will need content delivery networks that operate in the IPv6 environment and route clients and requests properly to IPv6 or IPv4 platforms. IPv6 increased the size of the IP address space, However, it retains the spirit of the DNS protocol and IP routing, and therefore the general mechanisms behind CDN request routing considered in the previous chapters apply equally to both IPv4 and IPv6.

However, IPv6 offers additional functionality that can be leveraged to imple- ment request routing in a more flexible manner. The designers of IPv6 reserved an addressing class for anycast addresses in the IPv6 addressing model defined by RFC 3513 [44]. Nevertheless, the same RFC restricted using anycast IPv6 addresses in the source field of any IPv6 packet. These restrictions were lifted in RFC 4291 [45], however, the RFC does not discuss the mechanisms and approach for IPv6 anycast

100 to be implemented. In fact, most of current operating systems do not provide the option to create an IPv6 address of type anycast. And if they do, as in the case of FreeBSSD [1], the system is forced to prevent any packet to have the anycast address as a source as per [50] which recommends to disconnect all TCP connections toward an IPv6 anycast address. In this chapter, we present a general lightweight IPv6 anycast protocol for communication utilizing connection-oriented transport and then use this protocol to design an architecture for an IPv6 CDN based on anycast request routing. This design relies heavily on IPv6 mobility support. In this dissertation we present the design details. The evaluation will be part of a future work.

5.2 Background

5.2.1 IPv6

IPv6 was designed as the successor to IPv4 [34]. In addition to expanding the ad- dressing capabilities and new addressing classes, IPv6 simplified packet header format and provided greater flexibility for introducing new extensions.

An IPv6 packet header consists of a fixed mandatory portion required for all packets and may be followed by optional extensions to support additional features. The mandatory portion of the header occupies the first 40 octets (320 bits) of the IPv6 packet. The extension headers and the data payload are of variable data length. Figure 5.1 shows the basic format of an IPv6 packet header.

The mandatory (main) header is required for all IPv6 datagrams. It contains the source and destination addresses as well as control information for processing and routing of the IPv6 datagram. These control information fields consist of a version field which is used to identify the version of IP protocol used to generate the datagram similar to IPv4 except that it carries the binary value of 0110 for IP version

101 Figure 5.1: IPv6 Packet Header Format.

6. After the version field is a 12 bits traffic class field for traffic classification options which replaces the Type Of Service(TOS) field in IPv4 and uses the differentiated services method defined in [41]. Next is the flow label field which consists of 20 bits. The flow label field was created to provide additional support for real time streams. Next is payload length which replaces the Total Length field in IPv4 but only carries the number of bytes of the payload (which includes extension headers). The “Next Header” field helps the receiver to interpret the packet format beyond the mandatory part. The “Next Header” of the main header specifies the identity of the first extension header if there are extensions in the packet. If no extensions are carried in the datagram, then this field specifies the upper layer protocol type (same as Protocol field in IPv4). Hop limit field is used in IPv6 to replace the TTL field in IPv4 header. This field is renamed to better reflect the actual usage of the field.

After the main header in IPv6 datagram, one or more extension headers may

102 Figure 5.2: IPv6 Destination Option Header Format. appear. These headers were created to provide both flexibility and efficiency for IPv6 datagrams to be included only when needed. This allows the size of the main data- gram header to be small and streamlined, containing only those fields that really must be present all the time. When extension headers are included in an IPv6 datagram, they appear one after the other following the main header. One type of IPv6 extensions that is of interest to this dissertation is Destination

Options extension header (DstOpt) . DstOpt is used to carry options intended for the ultimate destination 1. Figure 5.2 shows the format of IPv6 extension header for DstOpt. The first octet indicates the “Next Header” type of the extension header which follows this extension (or the upper layer porticol if this is the last header). Next is the current header length in octets followed by the option type and option

length. If more than an option is needed, other options are stacked all together one after the other each of which is preceded by the option type and length of data. The Hdr Ext Length shows the total length of all options data in octets.

5.2.2 TCP

In this chapter, we focus on connection oriented transport to present our IPv6 anycast

architecture. A connection-oriented transport is essential for our architecture as it allows our design to be secure and light weight. TCP is the dominant connection

1DstOpt can also be used for intermediate hops, but in this dissertation we are interested in the DstOpt for the ultimate destination.

103 oriented transport protocol among all Internet protocols. Our architecture will focus on TCP although we believe that it can be easily ported to any connection oriented transport. Figure 5.3 shows a summary of OS-related actions at the client and server sides

for establishing a typical TCP connection. To establish a TCP connection, the client starts by calling the Connect() function which sends the SYN packet to the server. The client enters the SYN-Sent state and waits for the SYN-ACK from the server. Client Server TCP IP IP TCP SYN-RCVD Queue Accept Queue

Connect ( ) [ SYN ] Src: Client Dst: Server SYN-Sent

New Socket Listen ( )

[ SYN-ACK ] Src: Server Dst: Client

Established

[ ACK ] Src: Client Dst: Server

Accept ( )

Some Cases, Servers Sends Data First

[ Data ] Dst: Server

Figure 5.3: Typical TCP Connection.

On the other side of the connection - assuming the server has already called the Listen () and accept() system calls, the server gets into the accept loop and waits for a connection request from the client. When the SYN packet is received from the client, the server creates a new socket and adds that socket to the SYN-Received

104 queue. The server sends the SYN-ACK packet back to the client to acknowledge the connection. At this point the connection is not yet established at the server side. However, on the client side, upon receiving the SYN-ACK packet, the client replies back with an ACK packet (which might piggyback the first portion of client’s payload data along the way) and at this point the TCP connection is fully established at the client side. When the ACK packet is received at the server side, the accept() system call returns successfully. The socket is moved out of the accept queue and given to the

server process. At this point the TCP connection is established at the server side. The server can now start sending data back to the client.

5.2.3 IPv6 Mobility Overview

IPv6 Mobility (MIPv6) allows a client (the “correspondent node”, or CN, in mobile IP parlance) to communicate with a mobile node (MN) using the mobile node home ad- dress (the permanent address of the node in its own home network). The mechanism from the high level is as follows. The mobile node is represented by two addresses: the permanent home address belonging to the MN’s home network and a care-of address that belongs to the currently visited (“foreign”) network and which can change as a result of node mobility. The home network maintains a special router endowed with mobility support, called home agent. As the mobile node moves from one foreign network to another, it keeps updates to its home agent, keeping it informed of its current care-of address. The correspondent node starts communication by sending the first packet to MN’s home address. As this packet enters MN’s home network, it is intercepted by the home agent, which tunnels this packet to the mobile agent using MN’s current care-of address. To take the home agent out of the loop for subsequent communication, MIPv6 allows the mobile agent to execute route optimization with the correspondent node.

105 For security reasons discussed shortly, the mobile node initiates route optimization by sending two messages to the correspondent node: HoTI (“home test initiation”) and CoTI (“care-of test initiation”). HoTI is sent through the home agent and CoTI directly. The CN responds to each message with a corresponding test message: HoT

(“home test”) through the home agent and CoT (“care-of test”) directly. Each mes- sage contains pieces of crypto material both of which are needed to construct a puzzle (“binding management key”) that the CN would require to complete the protocol. Once the MN receives both messages, it constructs the puzzle and includes it in a special binding update field in a mobility header in the next data packet to the CN. This packet also includes the MN’s care-of address as its source address and home address in its destination option (DST OPT) header. When the CN receives this packet and verifies the puzzle, it stores the binding in a special cache called the binding cache. CN node utilizes this binding cache to modify the destination address for all packets destined to the home address of the MN. For such packets, the CN node changes the destination address at the IP layer from the home address of the MN to the care-off address of the MN. CN will also add an optional header that contains the home address of the MN. At the other end of this communication, a reverse transformation happens within the IP layer of the MN. That is, upon receiving packets from the CN, the MN changes the destination address from care-off address to home address. Packets traveling from MN to the CN are also manipulated at IP layers of both

CN and MN. The MN replaces the source address of these packets from home address to care-off address and also includes the home address in a special option header. The CN, upon receiving such packets, replaces the care-off address in the source field with the home address. CN will also strip off the additional option header. The application code at both ends of the communication is oblivious to any mobility-related issues including the fact that the effective address of the MN (i.e., care-of address) changes.

106 Observe that the correspondent node cannot simply update its binding once it receives a packet with new home to care-of address mapping. Indeed, this could enable any malicious node to hijack the communication by sending to the CN a packet that maps the MN’s home address to the attacker’s own address as a new care-of address.

By sending HoT and CoT messages and routing one of them through the home agent, the CN verifies that the host possessing the new care-of address is properly associated with the home agent. Moreover, HoTI and CoTI messages are also important because without them, an attacker could mount a reflected denial of service attack on the MN’s home agent. The attacker could simply send packets to the CN with the MN’s home address and some fake care-of addresses, causing the CN to bombard the home agent with HoT messages. In MIPv6, the CN only sends these messages in response to HoTI messages received from the home agent, i.e., when asked by the recipient.

5.3 Related Work

Previous work proposed using IPv6 mobility support to implement CDN request routing [3] as well as for general anycast [82]. For request routing, [3] uses IPv6 Mobility binding updates to direct client to one of the CDN nodes. Specifically, in the approach sketched in [3] , the client starts its Web download from the CDN platform by opening a TCP connection to the IPv6 address of a request router, which acts like a home agent. The latter tunnels this request (i.e., TCP SYN segment) to a CDN node selected for this client, which responds to the client with the SYN-ACK TCP segment using its own IP address as the source address (which serves as care-of address of the mobility node) but including the original address of the request router as the home agent address and also supplying a binding update (BU) in the IPv6 mobility header. The client’s IP layer then remembers the binding between the two

107 addresses and uses this new IP address of the CDN node as the destination address for subsequent communication while also providing the original request router address in the destination option (DST OPT) header. The request routing mechanism sketched in [3] does not address security is- sues mentioned in Section 5.2.3, especially the crucial vulnerability that a malicious server can hijack client’s Web interaction. This issue has been addressed in [82], which again leveraged IPv6 mobility mechanisms, but used the full official version of MIPv6, namely the HoTI/CoTI/HoT/CoT protocol, in the context of implementing general anycast. This scheme involves a complex protocol with a two-level hand-off of communication from the home agent to the so-called contact node and then to the final anycast end-point. At each level, the full HoTI/CoTI/HoT/CoT protocol is executed for security purposes. This scheme aims at avoiding any modifications to both the correspondent node and home agent, and also at making the anycast end-point selection oblivious to upper layer protocols: packet delivery can switch to another anycast end-point at any time in communication. Both schemes above are free of drawbacks of both DNS-based and IPv4 anycast- based request routing. Unlike DNS-based request routing, CDN node selection occurs for the actual client and not its LDNS, can be done individually for each request (thus there is no issue with unpredictable amount of load being redirected) and also is done at the time of the request (removing the issue of coarse granularity of control). Un- like anycast-based request routing, request routing can fully reflect both CDN node load and network path conditions, and there is no possibility for session disruption. However, the first approach does not address security issues while the second, as we will show next, is unnecessarily heavy-weight for the CDN context.

108 5.4 Lightweight IPv6 Anycast for Connection-Oriented

Communication

We now describe a lightweight IPv6 anycast that derives its efficiency from leveraging connection-oriented communication, which happens to be the predominant mode of Internet communication in general and of CDN-accelerated communication in partic- ular. To deploy IPv6 anycast, the platform must setup a set of anycast servers that share the same anycast address and a set of unicast servers each with its own unicast address. These sets need not be distinct. In fact it would not be uncommon to have a single set of servers where each server has both the shared anycast address and its own unicast address. Neither would it be uncommon to have a single or a small number of anycast servers that act as request routers: they would receive the first packet from a client, select a unicast server for subsequent communication and handoff the connection to the selected unicast server. Each anycast server has a pre-installed secure channel to each unicast server. Figure 5.4 shows the message exchange in setting up anycast communication using TCP as a connection-oriented transport protocol although our scheme can be adapted to any session-oriented communication that includes a session establishment phase. Like in any TCP connection; the client starts by sending a SYN packet to the announced IP address of the service, in our case the anycast address. Once the anycast server receives the SYN packet, it selects the unicast server to handle the incoming connection and passes the packet to this server via a secure tunnel. The unicast server responds to the client with a SYN ACK packet using its own unicast address as the source address and piggybacks the anycast IP address in the DST OPT header.

109 Figure 5.4: TCP Interaction For an IPv6 Anycast Server

A SYN-ACK packet with a DST OPT informs the client that it has reached an anycast service. In order to establish the connection, the client needs to verify that the source of the SYN ACK packet (unicast server) is an authentic representative of the originally intended service (as represented by the anycast address, to which the client addressed its initial SYN packet and which is also reported in the DST OPT

field). To this end, the client issues tests very similar to HoT and CoT messages. It sends the CoT message to the unicast server directly and the HoT message to the anycast address, to be tunneled to the unicast server reported in the DST-OPT. Note that the connection is not yet established at the client site yet. A SYN-ACK packet from an anycast IPv6 server from the server does not establish the session at the client site. Once the unicast server receives both HoT and CoT messages, the server com- bines the tests and creates a binding update (BU) message in the same way as a mobile node would in MIPv6. Once the client receives the BU message, it activates

110 the binding entry at the IP layer’s bindings cache, and replies back with BA (binding acknowledgment) message along with the piggybacked first chunk of application data. At this time, the connection is established at the client side. The binding occurs fully at the IP layer, which means that it is transparent to

transport and higher layers. The connection is now established between the client and the unicast server. The upper layers at the client continue directing communication to the anycast address, which is then mapped securely to the unicast address at the client’s IP layer. Figure 5.5 shows a typical TCP interactions for an established

connection between a client and an anycast service server using a unicast IPv6 address.

Figure 5.5: TCP Interaction For an IPv6 Anycast Established Connection

The above protocol does away with HoTI and CoTI messages. The DST-OPT in the SYN-ACK packet triggers the HoT/CoT test instead. However, the denial of service attack against the home agent (the anycast server in our context) does not apply in our case. Indeed, to cause a reflected HoT message, the attacker would have

111 to send a well-timed SYN-ACK packet with correct sequence number acknowledging the correspondent’s SYN. Otherwise, this message would be discarded. Further, any unexpected SYN-ACKs would be discarded by the client, so the attacker can at most induce one spurious HoT message per a TCP connection being opened by the client.

Even if the attacker can time its message and guess a sequence number, only a small number of such messages would be generated, which could not cause a DoS attack. The security of our approach is further enhanced by the IPv6 requirement that all addresses be provider-specific. The corollary is that anycast and all unicast addresses in our approach must share a common prefix. This further prevents a malicious outside node to pretend to be a member of the anycast group. By checking for the common prefix restriction, the client can often detect or discard malicious SYN-ACKs from non-members right away without issuing HoT/CoT messages. In all other aspects, our protocol provides the same protection as MIPv6. Our approach is more humble and thus lightweight than the versatile anycast from [82]. We do not support in-flight TCP hand-off to another server: we believe this is an overkill for the CDN case (the primary intended application of our mechanism). Indeed, one can always issue a subsequent range request using a new connection.

Thus the server can simply reset the connection to affect a handoff as proposed in [8]. We also give up another goal of versatile anycast, which is to keep the client unmodified. The reason is that, given CDN clout, requesting clients to download a patch is not unreasonable: this is already often done through invitation to install a download manager to speed up performance. MIPv6 provides IP layer authentication for securing control traffic between the MN and the home agent through IPSec [14] to prevent DoS, DDos and man-in-the- middle attacks. In our approach, a CDN client acts as a CN while a CDN server acts as a MN. The analog of a home agent can be a router or a gateway that is still part of the CDN network. Since the analog of both MN and home agent are part of the

112 CDN network, IP layer authentication for route optimization through IPSec is not needed as both entities are under the control of the CDN. In our approach, the analog of route optimization occurs at the initial handshake stage of the connection. This means that whatever authentication takes place during connection establishment still can apply at this stage. Because current CDNs do not authenticate clients at the IP layer, at least in the context of CDNs, our approach to anycast does not weaken current security properties. Finally, while we overload the MIPv6 mechanism with support for lightweight anycast, it does not prevent the regular mobility support of MIPv6. Indeed, the client can always move to another network and perform regular MIPv6 route optimization by a full HoTI/CoTI/HoT/CoT protocol. The only exception is a short period while the TCP connection is being established. If the client migrates to another network within this period, it simply needs to reopen its TCP connection anew or delay the execution of route optimization until the original TCP connection is fully established. Similarly, although we do not expect the server to be mobile, it too can in principle migrate to another network and execute the full HoTI/CoTI/HoT/CoT protocol with the client - again, as long as this happens outside TCP connection establishment.

5.5 IPv6 Anycast CDN Architecture

The anycast mechanism from the previous subsection can serve as the basis of a CDN with various architectural flavors. We now present one such CDN architecture as an example. This architecture assumes multiple data centers distributed across the Internet, the deployment approach exemplified by Limelight and AT&T (but not Akamai, which pursues a much more dispersed approach). A datacenter consists of an anycast gateway, which subsumes the actions of the anycast server in the anycast protocol as

113 described in Section 5.4, and a number of edge servers each of which is responsible for serving content on behalf of the anycast service, and which act as unicast servers in our anycast protocol. Each anycast gateway maintains information of the current load on all servers in its datacenter. The anycast gateway is also aware of and maintains secure unicast tunnels to gateways from all other data centers as well as to every edge server in its own data center. This internal network of anycast gateways is used, as we will see shortly, for global load distribution and management. Finally, each gateway knows the address blocks used in each data center for edge server unicast addresses, so it can immediately tell which valid unicast address belongs to which data center.

Figure 5.6: IPv6 Anycast CDN

Figure 5.6 shows the basic architecture and interactions in such a platform, depicting for simplicity only two data centers, DC1 and DC2. The gateways in each data center share an anycast address A, and every host in the platform including

114 gateways also maintains a unicast address U1-U8. When a client tries to retrieve content delivered by this CDN, the client starts with a DNS request that eventually get resolved to the anycast address A. Next, the client sends the TCP SYN packet to this address, which is delivered by virtue of anycast to the “nearest” datacenter, in our case DC2, and thus to the gateway U8. Assuming there are non-overloaded local edge servers, the gateway chooses such an edge server, say, U5, to serve the actual content to the client. The gateway then passes the SYN packet to this edge server, which responds to the client with a SYN-

ACK packet with DST-OPT header as described in Section 5.4 thus triggering the rest of the anycast handoff.

Figure 5.7: Redirection in IPv6 Anycast CDN

If all local edge servers are overloaded, the anycast gateway will utilize the internal network of anycast gateways to tunnel the SYN to a neighbor gateway as shown in Figure 5.7. In principle, the SYN packet can be handed off from one gateway

115 to the next until it finds a data center with enough capacity to hang the connection, although in practice we do not expect many such handoff hops. Once the SYN packet reaches the closest datacenter with space capacity (DC1 in the figure), its gateway (U4 in our case) forwards the SYN packet to a non- overloaded local edge server, say, U3. U3 responds to the client with a SYN ACK with the anycast address A in the DST OPT header. The client will now generate HoT and CoT messages, but by virtue of anycast, the HoT message will likely be delivered to the original gateway U8 and not to the one local to edge server U3. In fact, it is possible that due to a route change, the HoT message will be delivered to an entirely different gateway. But since the unicast address of the edge server is included in the DST-OPT field of the HoT message, the receiving gateway can map this unicast address to the data center containing it and then route the HoT message to the proper anycast gateway, in this case U4. The rest of connection establishment with U3 is already explained in previous sections.

5.6 Summary

Transition to IPv6 is imminent as we are witnessing the exhaustion of IPv4 addresses. IPv6 provides, in addition to a larger addressing space, additional functionalities and address types that we can leverage to implement a more flexible request routing. This chapter discusses the viability of IPv6 anycast as the basis for request routing. In IPv6, several mechanisms for session-stable anycast have been previously described. In this chapter we discuss these mechanisms and also outline a new ap- proach, which leverages connection-oriented nature of the Web traffic to make the anycast hand-off more secure than some of these mechanisms and simpler than others. We also present an architecture of an anycast IPv6 CDN that utilizes the proposed IPv6 anycast architecture as the redirection mechanism.

116 Chapter 6

Conclusion

In this dissertation we analyze current mechanisms for CDN request routing, we char- acterize and quantify their limitations, and we propose new and enhanced mechanisms to implement request routing in CDNs. In particular, in this dissertation we show through a large scale measurement

study that the currently prevalent DNS-based request routing is fundamentally af- fected by the properties of sets of host sharing an LDNS. We study these sets (which we call LDNS clusters) and we find that among the two fundamental issues in DNS- based request routing - hidden load and client-LDNS distance, hidden load plays

appreciable role only for a small number of “elephant” LDNS servers while the client- LDNS distance is significant in many cases. Further, LDNS clusters vary widely in characteristics, and the largest clusters are actually more compact than others. Thus, a request routing system such as a content delivery network can attempt to balance load by reassigning non-compact LDNSs first as their clients benefit less from

proximity-sensitive routing anyway. In this dissertation we make the case that anycast CDNs are a practical alter- native to a DNS-based CDN redirection by revisiting anycast as a CDN redirection mechanism. We present a load-aware IP anycast CDN architecture and load balancing

117 algorithms to utilize IP anycast’s inherent proximity properties, without suffering the negative consequences of using IP anycast with session based protocols. We also eval- uate these algorithms using trace data from an operational CDN and show that they perform almost as well as native IP anycast in terms of proximity, manage to keep server load within capacity constraints and significantly outperform other approaches in terms of the number of session disruptions. Further, we consider the implication of impending transition to IPv6 by pro- viding a measurement study of the performance implications of a unilateral enabling of IPv6 by a web site, without requiring any verification or opt-in from the clients. The study show no evidence of performance penalty for such unilateral IPv6 adop- tion and an extremely small increase in failure to download the object (from 0.0038% to 0.0064% of accesses). While the one-second measurement granularity is clearly a limitation of this study, it -in any case- provides the upper bound of 1 second for any penalty that could not be measured. Finally, this dissertation argues for viability of IPv6 anycast as the basis for request routing. Several mechanisms for IPv6 anycast based request routing have been described; this dissertation discusses these mechanisms and also outline a new approach. Our proposed approach leverages connection-oriented nature of the Web traffic to make the anycast hand-off more secure than some of these mechanisms and simpler than others. We focus in our architecture on TCP as the connection oriented transport. However, we believe that our architecture can be ported to any connection oriented transport.

118 Bibliography

[1] Freebsd project. http://www.freebsd.org.

[2] F5 Networks. http://support.f5.com/kb/en-us/archived\_products/ 3-dns/, 2005.

[3] Arup Acharya and Anees Shaikh. Using mobility support for request routing in IPv6 CDNs. In 7th Int. Web Content Caching and Distribution Workshop (WCW), 2002.

[4] Bernhard Ager, Wolfgang M¨uhlbauer, Georgios Smaragdakis, and Steve Uhlig. Comparing DNS resolvers in the wild. In Proceedings of the 10th annual confer- ence on Internet measurement, IMC ’10, pages 15–21, 2010.

[5] Gagan Aggarwal, Rajeev Motwani, and An Zhu. The load rebalancing problem. In SPAA ’03: Proceedings of the fifteenth annual ACM symposium on Parallel algorithms and architectures, pages 258–265, New York, NY, USA, 2003. ACM.

[6] Akamai. http://www.akamai.com/html/technology/index.html.

[7] Z. Al-Qudah, M. Rabinovich, and M. Allman. Web timeouts and their implica- tions. In Passive and Active Measurement, pages 211–221. Springer, 2010.

[8] Zakaria Al-Qudah, Seungjoon Lee, Michael Rabinovich, Oliver Spatscheck, and Jacobus E. van der Merwe. Anycast-aware transport for content delivery net- works. In WWW, pages 301–310, 2009.

119 [9] Hussein A. Alzoubi, Seungjoon Lee, Michael Rabinovich, Oliver Spatscheck, and Jacobus Van der Merwe. Anycast cdns revisited. In Proceedings of the 17th International Conference on World Wide Web, WWW ’08, pages 277–286, New York, NY, USA, 2008. ACM.

[10] Hussein A. Alzoubi, Seungjoon Lee, Michael Rabinovich, Oliver Spatscheck, and Jacobus Van Der Merwe. A practical architecture for an anycast cdn. ACM Trans. Web, 5(4):17:1–17:29, October 2011.

[11] Hussein A Alzoubi, Michael Rabinovich, Seungjoon Lee, Oliver Spatscheck, and

Kobus Van Der Merwe. Advanced Content Delivery, Streaming, and Cloud Ser- vices, chapter Anycast Request Routing for Content Delivery Networks. Num- ber 5. Wiley, 2014.

[12] Hussein A. Alzoubi, Michael Rabinovich, and Oliver Spatscheck. The anatomy of

ldns clusters: Findings and implications for web content delivery. In Proceedings of the 22Nd International Conference on World Wide Web, WWW ’13, pages 83–94, Republic and Canton of Geneva, Switzerland, 2013. International World Wide Web Conferences Steering Committee.

[13] Hussein A. Alzoubi, Michael Rabinovich, and Oliver Spatscheck. Performance implications of unilateral enabling of ipv6. In Proceedings of the 14th Interna- tional Conference on Passive and Active Measurement, PAM’13, pages 115–124, Berlin, Heidelberg, 2013. Springer-Verlag.

[14] J. Arkko, V. Devarapalli, and F. Dupont. Using IPsec to Protect Mobile IPv6 Signaling Between Mobile Nodes and Home Agents. IETF RFC 3776, 2004.

[15] http://www.business.att.com/enterprise/Service/ digital-media-solutions-enterprise/, 2010.

120 [16] Hitesh Ballani, Paul Francis, and Sylvia Ratnasamy. A Measurement-based Deployment Proposal for IP Anycast. In Proc. ACM IMC, Oct 2006.

[17] A. Barbir, B. Cain, F. Douglis, M. Green, M. Hofmann, R. Nair, D. Potter, and O. Spatscheck. Known Content Network (CN) Request-Routing Mechanisms.

RFC 3568, July 2003.

[18] I. Bermudez, M. Mellia, M.M. Munaf`o, R. Keralapura, and A. Nucci. Dns to the rescue: Discerning content and services in a tangled web. In Proceedings of the 12th ACM SIGCOMM Conference on Internet Measurement, 2012.

[19] A. Biliris, C. Cranor, F. Douglis, M. Rabinovich, S. Sibal, O. Spatscheck, and W. Sturm. CDN brokering. In 6th Int. Workshop on Web Caching and Content Distribution, June 2001.

[20] Alex Biliris, C. Cranor, F. Douglis, M. Rabinovich, S.Sibal, O. Spatscheck, and

W. Sturm. CDN Brokering. Sixth International Workshop on Web Caching and Content Distribution, June 2001.

[21] Cachefly: Besthop global traffic management. http://www.cachefly.com/ video.html.

[22] Matt Calder, Xun Fan, Zi Hu, Ethan Katz-Bassett, John Heidemann, and Ramesh Govindan. Mapping the expansion of google’s serving infrastructure. In Proceedings of the 2013 Conference on Internet Measurement Conference, IMC ’13, pages 313–326, New York, NY, USA, 2013. ACM.

[23] Valeria Cardellini, Michele Colajanni, and Philip S. Yu. Request redirection algorithms for distributed web systems. IEEE Trans. Parallel Distrib. Syst., 14(4):355–368, 2003.

121 [24] B. Carpenter and K. Moore. Connection of IPv6 domains via IPv4 clouds. RFC 3056, 2001.

[25] M. Casado and M.J. Freedman. Peering through the shroud: The effect of edge opacity on ip-based client identification. In Proceedings of the 4th USENIX con-

ference on Networked systems design & implementation, pages 13–13. USENIX Association, 2007.

[26] Chandra Chekuri and Sanjeev Khanna. A ptas for the multiple knapsack prob- lem. In SODA ’00: Proceedings of the eleventh annual ACM-SIAM symposium

on Discrete algorithms, pages 213–222, Philadelphia, PA, USA, 2000. Society for Industrial and Applied Mathematics.

[27] Cisco GSS 4400 series global site selector appliances. http://www.cisco.com/ en/US/products/hw/contnetw/ps4162/index.html, 2009.

[28] Michele Colajanni and Philip S. Yu. A performance study of robust load sharing strategies for distributed heterogeneous web server systems. IEEE Trans. Knowl. Data Eng., 14(2):398–414, 2002.

[29] Michele Colajanni, Philip S. Yu, and Valeria Cardellini. Dynamic load balancing

in geographically distributed heterogeneous web servers. In ICDCS, pages 295– 302, 1998.

[30] L. Colitti, S. Gunderson, E. Kline, and T. Refice. Evaluating IPv6 adoption in the Internet. In Passive and Active Measurement Conf., pages 141–150, 2010.

[31] http://www.mesquite.com, 2005.

[32] D. Dagon, N. Provos, C.P. Lee, and W. Lee. Corrupted DNS resolution paths: The rise of a malicious resolution authority. In Proceedings of Network and Distributed Security Symposium (NDSS), 2008.

122 [33] J. De Clercq, D. Ooms, S. Prevost, and F. Le Faucheur. Connecting IPv6 islands over IPv4 MPLS using IPv6 provider edge routers (6PE). RFC 4798, 2007.

[34] S. Deering and R. Hinden. Internet Protocol Version 6 (IPv6) Specification. IETF RFC 2460, 1998.

[35] Dual stack esotropia. http://labs.apnic.net/blabs/?p=47.

[36] Nick Duffield, Kartik Gopalan, Michael R. Hines, Aman Shaikh, and Jacobus E. Van der Merwe. Measurement informed route selection. Passive and Active Measurement Conference, April 2007. Extended abstract.

[37] Michael J. Freedman, Eric Freudenthal, and David Mazi`eres. Democratizing content publication with coral. In NSDI, pages 239–252, 2004.

[38] The global internet speedup. http://www.afasterinternet.com/.

[39] Google over IPv6. http://www.google.com/intl/en/ipv6/.

[40] Google Public DNS. Performance Benefits. https://develo- pers.google.com/speed/public-dns/docs/performance.

[41] D. Grossman. New Terminology and Clarifications for Diffserv. IETF RFC 3260, 2002.

[42] T. Hardie. Distributing Authoritative Name Servers via Shared Unicast Ad- dresses. IETF RFC 3258, 2002.

[43] Y. Hei and K. Yamazaki. Traffic analysis and worldwide operation of open 6to4 relays for ipv6 deployment. In IEEE Int. Symp. on Applications and the Internet,

pages 265–268, 2004.

[44] R. Hinden and S. Deering. Internet Protocol Version 6 (IPv6) Addressing Ar- chitecture. IETF RFC 3513, 2003.

123 [45] R. Hinden and S. Deering. IP Version 6 Addressing Architecture. IETF RFC 4291, 2006.

[46] Cheng Huang, Ivan Batanov, and Jin Li. A practical solution to the client- LDNS mismatch problem. SIGCOMM Comput. Commun. Rev., 42(2):35–41,

April 2012.

[47] Cheng Huang, D.A. Maltz, Jin Li, and A. Greenberg. Public DNS system and global traffic management. In INFOCOM, 2011 Proceedings IEEE, pages 2615 –2623, 2011.

[48] C. Huitema. Teredo: Tunneling IPv6 over UDP through network address trans- lations (NATs). RFC 4380, 2006.

[49] G. Huston. IPv6 Transition. http://www.potaroo.net/presentations/2009-09- 01-ipv6-transition.pdf, 2009. Presentation at the 3d Meeting of the Australian

Network Operators Group.

[50] Jun ichiro Itoh. Disconnecting TCP connection toward IPv6 anycast address draft-itojun-ipv6-tcp-to-anycast-01.txt. IETF draft-itojun-ipv6-tcp-to-anycast- 01.txt, 2001.

[51] Sitaram Iyer, Antony Rowstron, and Peter Druschel. Squirrel: a decentralized peer-to-peer web cache. In PODC, pages 213–222, 2002.

[52] Jaeyeon Jung, Balachander Krishnamurthy, and Michael Rabinovich. Flash Crowds and Denial of Service Attacks: Characterization and Implications for

CDNs and Web Sites. In Proceedings of 11th WWW Conference, 2002.

[53] E. Karpilovsky, A. Gerber, D. Pei, J. Rexford, and A. Shaikh. Quantifying the extent of IPv6 deployment. Passive and Active Measurement Conf., pages 13–22, 2009.

124 [54] Andy King. The Average Web Page. http://www.optimizationweek.com/ reviews/average-web-page/, Oct 2006.

[55] Christian Kreibich, Nicholas Weaver, Boris Nechaev, and Vern Paxson. Netalyzr: illuminating the edge network. In Proceedings of the 10th Annual Conference on

Internet Measurement, IMC ’10, pages 246–259, 2010.

[56] Thomas T. Kwan, Robert McCrath, and Daniel A. Reed. NCSA’s world wide web server: Design and performance. IEEE Computer, 28(11):68–74, 1995.

[57] Y. Lee, A. Durand, J. Woodyatt, and R. Droms. Dual-Stack Lite broadband

deployments following IPv4 exhaustion. RFC 6333, 2011.

[58] http://www.limelightnetworks.com/platform/cdn/, 2010.

[59] R. Liston, S. Srinivasan, and E. Zegura. Diversity in DNS performance measures. In Proceedings of the 2nd ACM SIGCOMM Workshop on Internet measurment,

pages 19–31, 2002.

[60] G. Maier, F. Schneider, and A. Feldmann. NAT usage in residential broadband networks. In 12th Passive and Active Measurement Conf., pages 32–41, 2011.

[61] D. Malone. Observations of IPv6 addresses. Passive and Active Measurement

Conf., pages 21–30, 2008.

[62] Zhuoqing Morley Mao, Charles D. Cranor, Fred Douglis, Michael Rabinovich, Oliver Spatscheck, and Jia Wang. A precise and efficient evaluation of the proxim- ity between web clients and their local dns servers. In USENIX Annual Technical

Conference, General Track, pages 229–242, 2002.

[63] http://marketshare.hitslink.com/report.aspx?qprid=3.

[64] Maxmind GeoIP city database. http://www.maxmind.com/app/city.

125 [65] OpenDNS - A Technical Overview. http://www.opendns.com/technology.

[66] J. Pang, A. Akella, A. Shaikh, B. Krishnamurthy, and S. Seshan. On the respon- siveness of dns-based network control. In Proceedings of the 4th ACM SIGCOMM conference on Internet measurement, pages 21–26, 2004.

[67] J. Pang, A. Akella, A. Shaikh, B. Krishnamurthy, and S. Seshan. On the Respon- siveness of DNS-based Network Control. In Proceedings of Internet Measurement Conference (IMC), October 2004.

[68] Jeffrey Pang, James Hendricks, Aditya Akella, Roberto De Prisco, Bruce Maggs,

and Srinivasan Seshan. Availability, usage, and deployment characteristics of the domain name system. In Proceedings of the 4th ACM SIGCOMM Conference on Internet measurement, IMC ’04, pages 1–14, 2004.

[69] I. Poese, B. Frank, B. Ager, G. Smaragdakis, and A. Feldmann. Improving

content delivery using provider-aided distance information. In The 10th ACM Internet Measurement Conf., pages 22–34, 2010.

[70] M. Rabinovich and O. Spatscheck. Web caching and replication. Addison-Wesley, 2001.

[71] Michael Rabinovich, Zhen Xiao, and Amit Aggarwal. Computing on the edge: A platform for replicating Internet applications. In Proceedings of the 8th Inter- national Workshop on Web Content Caching and Distribution, September 2003.

[72] Amy Reibman, Subhabrata Sen, and Jacobus Van der Merwe. Network Moni-

toring for Video Quality over IP. Picture Coding Symposium, 2004.

[73] Pekka Savola. Observations of IPv6 traffic on a 6to4 relay. SIGCOMM Comput. Commun. Rev., 35(1):23–28, January 2005.

126 [74] K. Schomp, T. Callahan, M. Rabinovich, and M. Allman. Assessing the security of client-side DNS infrastructure. Submitted for publication, 2012.

[75] ServerIron DNSProxy. Fountry Networks. http://www.brocade.com/ products/all/switches/index.page, 2008.

[76] Anees Shaikh, Renu Tewari, and Mukesh Agrawal. On the effectiveness of dns- based server selection. In INFOCOM, pages 1801–1810, 2001.

[77] W. Shen, Y. Chen, Q. Zhang, Y. Chen, B. Deng, X. Li, and G. Lv. Observations of IPv6 traffic. In ISECS Int. Colloq. on Computing, Communication, Control,

and Management, volume 2, pages 278–282. IEEE, 2009.

[78] D. Shmoys and E. Tardos. An Approximation Algorithm for the Generalized Assignment Problem. Mathematical Programming, 62:461–474, 1993.

[79] David B. Shmoys, Eva´ Tardos, and Karen Aardal. Approximation algorithms

for facility location problems (extended abstract). In STOC ’97: Proceedings of the twenty-ninth annual ACM symposium on Theory of computing, New York, NY, USA, 1997. ACM.

[80] Florian Streibelt, Jan B¨ottger, Nikolaos Chatzis, Georgios Smaragdakis, and

Anja Feldmann. Exploring edns-client-subnet adopters in your free time. In Proceedings of the 2013 Conference on Internet Measurement Conference, IMC ’13, pages 305–312, New York, NY, USA, 2013. ACM.

[81] Daniel Stutzbach, Daniel Zappala, and Reza Rejaie. The scalability of swarming

peer-to-peer content delivery. In NETWORKING, pages 15–26, 2005.

[82] M. Szymaniak and G. Pierre. Enabling service adaptability with versatile any- cast. concurrency and computation: Practice and experience, 1837.

127 [83] Michal Szymaniak, Guillaume Pierre, Mariana Simons-Nikolova, and Maarten van Steen. Enabling service adaptability with versatile anycast. Concurrency and Computation: Practice and Experience, 19(13):1837–1863, 2007.

[84] M. Townsley and O. Troan. IPv6 Rapid Deployment on IPv4 Infrastructures

(6rd)–Protocol Specification. RFC 5969, 2010.

[85] Jacobus Van der Merwe, Paul Gausman, Chuck Cranor, and Rustam Akhmarov. Design, Implementation and Operation of a Large Enterprise Content Distribu- tion Network. In 8th International Workshop on Web Content Caching and

Distribution, Sept 2003.

[86] Jacobus Van der Merwe, Subhabrata Sen, and Charles Kalmanek. Streaming Video Traffic: Characterization and Network Impact. In 7th International Work- shop on Web Content Caching and Distribution (WCW), Aug 2002.

[87] Jacobus E. Van der Merwe et al. Dynamic Connectivity Management with an Intelligent Route Service Control Point. Proceedings of ACM SIGCOMM INM, October 2006.

[88] Patrick Verkaik, Dan Pei, Tom Scholl, Aman Shaikh, Alex Snoeren, and Jacobus

Van der Merwe. Wresting Control from BGP: Scalable Fine-grained Route Con- trol. In 2007 USENIX Annual Technical Conference, June 2007.

[89] D. Wing and A. Yourtchenko. Happy eyeballs: Success with dual-stack hosts. RFC 6555, April 2012.

[90] X. Zhou and P. Van Mieghem. Hopcount and E2E delay: IPv6 versus IPv4. Passive and Active Measurement Conf., pages 345–348, 2005.

128