SkipNet: A Scalable Overlay Network with Practical Locality Properties

Nicholas J.A. Harvey∗† John Dunagan∗ Michael B. Jones∗ Stefan Saroiu† Marvin Theimer∗ Alec Wolman∗

Microsoft Research Technical Report MSR-TR-2002-92

While DHTs provide nice load balancing proper- ties, they do so at the price of controlling where data Abstract: Scalable overlay networks such as Chord, is stored. This has at least two disadvantages: Data CAN, Pastry, and Tapestry have recently emerged as flex- may be stored far from its users and it may be stored ible infrastructure for building large peer-to-peer sys- outside the administrative domain to which it be- tems. In practice, such systems have two disadvantages: longs. This paper introduces SkipNet [14, 15], a dis- They provide no control over where data is stored and tributed generalization of Skip Lists [26], adapted to no guarantee that paths remain within an ad- meet the goals of peer-to-peer systems. SkipNet is ministrative domain whenever possible. SkipNet is a a scalable overlay network that supports traditional scalable overlay network that provides controlled data overlay functionality as well as two locality prop- placement and guaranteed routing locality by organizing erties that we refer to as content locality and path data primarily by string names. SkipNet allows for both locality. fine-grained and coarse-grained control over data place- Content locality refers to the ability to either ex- ment: Content can be placed either on a pre-determined plicitly place data on specific overlay nodes or dis- or distributed uniformly across the nodes of a hi- tribute it across nodes within a given organization. erarchical naming subtree. An additional useful con- Path locality refers to the ability to guarantee that sequence of SkipNet’s locality properties is that parti- message traffic between two overlay nodes within tion failures, in which an entire organization disconnects the same organization is routed within that organi- from the rest of the system, can result in two disjoint, but zation only. well-connected overlay networks. Furthermore, SkipNet Content and path locality provide a number of can efficiently re-merge these disjoint networks when the advantages for data retrieval, including improved partition heals. availability, performance, manageability, and se- curity. For example, nodes can store important 1 Introduction data within their organization (content locality) and Scalable overlay networks, such as Chord [34], nodes will be able to reach their data through the CAN [28], Pastry [30], and Tapestry [40], have re- overlay even if the organization becomes discon- cently emerged as flexible infrastructure for build- nected from the rest of the (path locality). ing large peer-to-peer systems. A key function that Storing data near the clients that use it also yields these networks enable is a performance benefits. Placing content onto a spe- (DHT), which allows data to be uniformly diffused cific overlay node—or a well-defined set of over- over all the participants in the peer-to-peer system. lay nodes—enables provisioning of those nodes to reflect demand. Content placement also allows ad- ∗Microsoft Research, Microsoft Corporation, Red- ministrative control over issues such as scheduling mond, WA 98052, {nickhar, jdunagan, mbj, theimer, maintenance for machines storing important data, alecw}@microsoft.com † thus improving manageability. Department of Computer Science and Engineering, University of Washington, Seattle, WA 98195, {nickhar, Content locality can improve security by allow- tzoompy}@cs.washington.edu ing one to control the administrative domain in which data resides. Even when encrypted and dig- ing to their name ID ordering, nodes within a single itally signed, data stored on an arbitrary overlay organization gracefully survive failures that discon- node outside the organization is susceptible to de- nect the organization from the rest of the Internet. nial of service (DoS) attacks as well as traffic anal- Furthermore, the organization’s SkipNet segment ysis. Although other techniques for improving the can be efficiently re-merged with the external Skip- resiliency of DHTs to DoS attacks exist [3], content Net when connectivity is restored. In the case of un- locality is a simple, zero-overhead technique. correlated, independent failures, SkipNet has simi- Path locality provides additional security bene- lar resiliency to previous overlay networks [30, 34]. fits to an overlay that supports content locality. Al- The basic SkipNet design, not including its en- though some overlay designs [4] are likely to keep hancements to support constrained load balancing, routing messages within an organization most of network proximity-aware routing, reduced overhead the time, none guarantee path locality. For ex- for virtual nodes, or merge algorithms, has been ample, without such a guarantee the route from concurrently and independently invented by Aspnes explorer.ford.com to mustang.ford.com could pass and Shah [1]. As described in Section 2, their work through camaro.gm.com, a scenario that people at has a substantially different focus than our work ford.com might prefer to prevent. With path local- and the two efforts are complementary to each other ity, nodes requesting data within their organization while still starting from the same underlying inspi- traverse a path that never leaves the organization. ration. This example also illustrates that path locality can The rest of this paper is organized as follows: be desirable even in a scenario where no content is Section 2 describes related work, Section 3 de- being placed on nodes. scribes SkipNet’s basic design, Section 4 discusses Controlling content placement is in direct tension SkipNet’s locality properties, Section 5 presents en- with the goal of a DHT, which is to uniformly dis- hancements to the basic design, Section 6 presents tribute data across a system in an automated fashion. the ring merge algorithms, Section 7 discusses de- A significant contribution of this paper is the con- sign alternatives to SkipNet, Section 8 presents a cept of constrained load balancing, which is a gen- theoretical analysis of SkipNet, Section 9 presents eralization that combines these two notions: Data is an experimental evaluation, and Section 10 con- uniformly distributed across a well-defined subset cludes the paper. of the nodes in a system, such as all nodes in a sin- gle organization, all nodes residing within a given 2 Related Work building, or all nodes residing within one or more data centers. A large number of peer-to-peer overlay net- SkipNet supports efficient message routing be- work designs have been proposed recently, tween overlay nodes, content placement, path local- such as CAN [28], Chord [34], [6], ity, and constrained load balancing. It does so by [12], [23], Pastry [30], employing two separate, but related address spaces: Salad [10], Tapestry [40], and Viceroy [22]. Skip- a string name ID space as well as a numeric ID Net is designed to provide the same functionality space. Node names and content identifier strings as existing peer-to-peer overlay networks, and ad- are mapped directly into the name ID space, while ditionally to provide improved content availability hashes of the node names and content identifiers are through explicit control over content placement. mapped into the numeric ID space. A single set of One key feature provided by systems such as routing pointers on each overlay node enables effi- CAN, Chord, Pastry, and Tapestry is scalable cient routing in either address space and a combina- routing performance while maintaining a scalable tion of routing in both address spaces provides the amount of routing state at each node. By scalable ability to do constrained load balancing. routing paths we mean that the expected number A useful consequence of SkipNet’s locality prop- of forwarding hops between any two communicat- erties is resiliency against a common form of Inter- ing nodes is small with respect to the total num- net failure. Because SkipNet clusters nodes accord- ber of nodes in the system. Chord, Pastry, and

2 Tapestry scale with log N, where N is the system Net, enabling, among other things, constrained load size, while maintaining log N routing state at each balancing. overlay node. CAN scales with D · N 1/D, where D Aspnes and Shah [1] have independently in- is a dimensionality parameter with a typical value of vented the same basic data structure that defines a 6, while maintaining an amount of per-node routing SkipNet, which they call a Skip Graph. Beyond state proportional to D. that, they investigate questions that are mostly or- A second key feature of these systems is that they thogonal to those addressed in this paper. In par- are able to route to destination addresses that do ticular, they describe and analyze different search not equal the address of any existing node. Each and insertion algorithms and they focus on formal message is routed to the node whose address is characterization of Skip Graph invariants. In con- “closest” to that specified in the destination field trast, our work is focused primarily on the con- of a message; we interchangeably use the terms tent and path locality properties of the design, and “route” and “search” to mean routing to the clos- we describe several extensions that are important est node to the specified destination. This feature in building a practical system: network proximity- enables implementation of a distributed hash table aware routing is obtained by means of two auxiliary (DHT) [13], in which content is stored at an over- routing tables; constrained load balancing is sup- lay node whose node ID is closest to the result of ported through a combination of searches in both applying a collision-resistant hash function to that the string name and numeric address spaces that content’s name (i.e. consistent hashing [18]). SkipNet defines; efficient algorithms are used to re- Distributed hash tables have been used, for in- merge disjoint SkipNet segments that result from stance, in constructing the PAST [31] and CFS [8] network partitions; and multiple virtual nodes can distributed filesystems, the Overlook [37] scalable be hosted on a single physical node with substan- name service, the Squirrel [16] cooperative web tially less overhead than the schemes described in cache, and scalable application-level [5, previous work. 32, 29]. For most of these systems, if not all of them, the overlay network on which they were de- 3 Basic SkipNet Structure signed can easily be substituted with SkipNet. In this section, we introduce the basic design of SkipNet has a fundamental philosophical differ- SkipNet. We present the SkipNet architecture, in- ence from existing overlay networks, such as Chord cluding how to route in SkipNet, and how to join and Pastry, whose goal is to implement a DHT. The and leave a SkipNet. basic philosophy of systems like Chord and Pastry is to diffuse content randomly throughout an over- 3.1 Analogy to Skip Lists lay in order to obtain uniform, load-balanced, peer- to-peer behavior. The basic philosophy of Skip- A Skip List, first described in Pugh [26], is a dic- Net is to enable systems to preserve useful content tionary data structure typically stored in-memory. A and path locality, while still enabling load balancing Skip List is a sorted linked list in which some nodes over constrained subsets of participating nodes. are supplemented with pointers that skip over many This paper is not the first to observe that local- list elements. A “perfect” Skip List is one where the ity properties are important in peer-to-peer systems. height of the ith node is the exponent of the largest Keleher et al. [19] make two main points: locality is power-of-two that divides i. Figure 1 depicts a per- a good thing, and DHTs destroy locality. Vahdat et fect Skip List. Note that pointers at level h have al. [38] raises the locality issue as well. SkipNet length 2h (i.e., they traverse 2h nodes). A perfect addresses this problem directly: By using names Skip List supports searches in O(log N) time. rather than hashed identifiers to order nodes in the Because it is prohibitively expensive to perform overlay, natural locality based on the names of ob- insertions and deletions in a perfect Skip List, Pugh jects is preserved. Furthermore, by arranging con- suggests a probabilistic scheme for determining tent in name order rather than dispersing it, efficient node heights while maintaining O(log N) searches operations on ranges of names are possible in Skip- with high probability. Briefly, each node chooses a

3 H T M D O E A 21 A 9 I D 6 17 26 L A T 3 7 12 19 25

Z V Figure 1. A perfect Skip List. X

Level Level H T E A 2 TT 2 DD 6 A I 25 1 MX 1 ZO D 9 17 L 3 7 12 19 21 26 0 DZ 0 XT Figure 2. A probabilistic Skip List. Figure 3. SkipNet nodes ordered by name ID. Rout- height such that the probability of choosing height ing tables of nodes A and V are shown. h is 1/2h. Thus, with probability 1/2 a node has Ring Ring Ring Ring Ring Ring Ring Ring height 1, with probability 1/4 it has height 2, etc. 000 001 010 011 100 101 110 111 M O D Figure 2 depicts a probabilistic Skip List. A T L = 3 X Z V M Whereas Skip Lists are an in-memory data struc- O D A Ring 00 T Ring 01 Ring 10 Ring 11 L = 2 ture that is traversed from its head node, we desire X Z V M a data structure that links together distributed com- D O puter nodes and supports traversals that may start A Ring 0 T Ring 1 L = 1 X Z V from any node in the system. Furthermore, because M D O peers should have uniform roles and responsibili- A Root Ring T Level: L = 0 Z V ties in a peer-to-peer system, we desire that the state X and processing overhead of all nodes be roughly the same. In contrast, Skip Lists maintain a highly vari- Figure 4. The full SkipNet routing infrastructure for able number of pointers per data record and expe- an 8 node system, including the ring labels. rience a substantially different amount of traversal traffic at each data record. shows the routing table pointers that nodes A and V maintain. 3.2 The SkipNet Structure The SkipNet in Figure 3 is a “perfect” SkipNet: The key idea we take from Skip Lists is the no- each level h pointer traverses exactly 2h nodes. Fig- tion of maintaining a sorted list of all data records as ure 4 depicts the same SkipNet of Figure 3, arranged well as pointers that “skip” over varying numbers of to show all node interconnections at every level si- records. We transform the concept of a Skip List to a multaneously. All nodes are connected by the root distributed system setting by replacing data records ring formed by each node’s pointers at level 0. The with computer nodes, using the string name IDs of pointers at level 1 point to nodes that are 2 nodes the nodes as the data record keys, and forming a ring away and hence the overlay nodes are implicity di- instead of a list. The ring must be doubly-linked to vided into two disjoint rings. Similarly, pointers enable path locality, as is explained in Section 3.3. at level 2 form four disjoint rings of nodes, and so Rather than having nodes store a highly variable forth. Note that rings at level h +1are obtained by number of pointers, as in Skip Lists, each SkipNet splitting a ring at level h into two disjoint sets, each node stores 2logN pointers, where N is the num- ring containing every second member of the level h ber of nodes in the overlay system. Each node’s set ring. of pointers is called its routing table, or R-Table, Maintaining a perfect SkipNet in the presence since the pointers are used to route message traffic of insertions and deletions is impractical, as is the between nodes. The pointers at level h ofagiven case with perfect Skip Lists. To facilitate efficient node’s routing table point to nodes that are roughly insertions and deletions, we derive a probabilistic 2h nodes to the left and right of the given node. Fig- SkipNet design. Each ring at level h is split into ure 3 depicts a SkipNet containing eight nodes and two rings at level h +1by having each node ran-

4 domly and uniformly choose to which of the two SendMsg(nameID, msg) { if( LongestPrefix(nameID,localNode.nameID)==0 ) rings it belongs. With this probabilistic scheme, msg.dir = RandomDirection(); else if( nameID= 0) { can be encoded as a unique binary number, which nbr = localNode.RouteTable[msg.dir][h]; we refer to as the node’s numeric ID. As illustrated if (LiesBetween(localNode.nameID, nbr.nameID, msg.nameID, msg.dir)) { in Figure 4, the first h bits of the number determine SendToNode(msg, nbr); return; ring membership at level h. For example, node X’s } h=h-1; numeric ID is 011 and its membership at level 2 is } // h<0 implies we are the closest node. determined by taking the first 2 bits of 011, which DeliverMessage(msg.msg); } designate Ring 01. As described in [34], there are advantages to using a collision-resistant hash (such Figure 5. Algorithm for routing by name ID in Skip- as SHA-1) of the node’s DNS name as the numeric Net. ID. The SkipNet design does not require the use of meric ID spaces by uniformly distributing numeric hashing to generate nodes’ numeric IDs; we only IDs and leveraging the sorted order of name IDs. require that numeric IDs are random and unique. Because the numeric IDs of nodes are unique 3.3 Routing by Name ID they can be thought of as a second address space that is maintained by the same SkipNet data struc- Routing/searching by name ID in SkipNet is ture. Whereas SkipNet’s string address space is based on the same basic principle as searching in populated by node name IDs that are not uniformly Skip Lists: Follow pointers that route closest to the distributed throughout the space, SkipNet’s numeric intended destination. At each node, a message will address space is populated by node numeric IDs that be routed along the highest-level pointer that does are uniformly distributed. The uniform distribution not point past the destination value. Routing ter- of numeric IDs in the numeric space is what ensures minates when the message arrives at a node whose that our routing table construction yields routing ta- name ID is closest to the destination. Figure 5 ble entries that skip over the appropriate number of presents this algorithm in pseudocode. nodes. Since nodes are ordered by name ID along each Readers familiar with Chord may have observed ring and a message is never forwarded past its des- that SkipNet’s routing pointers are exponentially tination, all nodes encountered during routing have distributed in a manner similar to Chord’s: The name IDs between the source and the destination. pointer at level h hops over 2h nodes in expecta- Thus, when a message originates at a node whose tion. The fundamental difference is that Chord’s name ID shares a common prefix with the destina- routing pointers skip over 2h nodes in the numeric tion, all nodes traversed by the message have name space. In contrast SkipNet’s pointers, when consid- IDs that share that same prefix. Because rings are ered from level 0 upward, skip over 2h nodes in the doubly-linked, this scheme can route using either name ID space and, when considered from the top right or left pointers depending upon whether the level downward, skip over 2h nodes in the numeric source’s name ID is smaller or greater than the des- ID space. Chord guarantees O(log N) routing and tination’s. The key observation of this scheme is node insertion performance by uniformly distribut- that routing by name ID traverses only nodes whose ing node identifiers in its numeric address space. name IDs share a non-decreasing prefix with the SkipNet guarantees O(log N) performance of node destination ID. Section 8.5 proves that node stress insertion and routing in both the name ID and nu- is well-balanced even under this scheme.

5 If the source name ID and the destination name // Invoked at all nodes (including the source and // destination nodes) along the routing path. ID share no common prefix, a message can be // Initially: // msg.ringLvl = -1 routed in either direction. For the sake of fairness, // msg.startNode = msg.bestNode = null // msg.finalDestination = false we randomly pick a direction so that nodes whose RouteByNumericID(msg) { if (msg.numID == localNode.numID || name IDs are near the middle of the sorted order- msg.finalDestination) { DeliverMessage(msg.msg); ing do not get a disproportionately large share of return; the forwarding traffic. } if (localNode == msg.startNode) { The number of message hops when routing by // Done traversing current ring. msg.finalDestination = true; name ID is O(log N) with high probability. For a SendToNode(msg.bestNode); return; proof see Section 8.1. }

h = CommonPrefixLen(msg.numID, localNode.numID); 3.4 Routing by Numeric ID if (h > msg.ringLvl) { // Found a higher ring. It is also possible to route messages efficiently to msg.ringLvl = h; msg.startNode = msg.bestNode = localNode; a given numeric ID. In brief, the routing operation } else if ( abs(localNode.numID - msg.numID) < abs(msg.bestNode.numID - msg.numID)) { begins by examining nodes in the level 0 (root) ring // Found a better candidate for current ring. msg.bestNode = localNode; until a node is found whose numeric ID matches } the destination numeric ID in the first digit. At this // Forward along current ring. nbr = localNode.RouteTable[clockWise][msg.ringLvl]; point the routing operation jumps up to this node’s SendToNode(nbr); level 1 ring, which also contains the destination } node. The routing operation then examines nodes Figure 6. Algorithm to route by numeric ID in Skip- in this level 1 ring until a node is found whose nu- Net meric ID matches the destination numeric ID in the ID is closest to 1011 of all ring 10 members. second digit. As before, we know that this node’s The number of message hops when routing by level 2 ring must also contain the destination node, numeric ID is O(log N) with high probability. For and thus the routing operation proceeds in this level a proof see Section 8.3. 2 ring. Some intuition for why SkipNet can support ef- This procedure repeats until we cannot make any ficient routing by both name ID and numeric ID more progress — we have reached a ring at some with the same data structure is illustrated in Fig- level h such that none of the nodes in that ring share ure 4. Note that the root ring, at the bottom, is h +1digits with the destination numeric ID. We sorted by name ID and, collectively, the top-level must now deterministically choose one of the nodes rings are sorted by numeric ID. For any given node, in this ring to be the destination node. Our algorithm the SkipNet rings to which it belongs precisely form defines the destination node to be the node whose a Skip List. Thus efficient searches by name ID are numeric ID is numerically closest to destination nu- possible. Furthermore, if you construct a trie on meric ID amongst all nodes in this highest ring. Fig- all nodes’ numeric IDs, the nodes of the resulting ure 6 presents this algorithm in pseudocode. trie would be in one-to-one correspondence with the As an example, imagine that the numeric IDs in SkipNet rings. This suggests that efficient searches Figure 4 are 4 bits long and that node Z’s ID is 1000 by numeric ID are also possible. and node O’s ID is 1001. If we want to route a mes- sage from node A to destination 1011 then A will 3.5 Node Join and Departure first forward the message to node D because D is in ring 1. D will then forward the message to node O To join a SkipNet, a newcomer must first find the because O is in ring 10. O will forward the message top-level ring that corresponds to the newcomer’s to Z because it is not in ring 101. Z will forward the numeric ID. This amounts to routing a message to message onward around the ring (and hence back) to the newcomer’s numeric ID, as described in Sec- O for the same reason. Since none of the members tion 3.4. of ring 10 belong to ring 101, node O will be picked The newcomer then finds its neighbors in this as the final message destination because its numeric top-level ring, using a search by name ID within this

6 InsertNode(nameID, numID) { D and V are the neighbors of node O in ring 1. msg = new JoinMessage(); msg.operation = findTopLevelRing; Similarly, node D forwards the insertion message RouteByNumericID(numID, msg); } to node M in the root ring, who concludes that node

DeliverMessage(msg) { O’s level 0 neighbors must be M and T . The inser- ... else if (msg.operation == findTopLevelRing) { tion message is returned to node O, who then in- msg.ringLvl = CommonPrefix(localNode.numID, msg.numID); structs all of its neighbors to insert it into the rings. msg.ringNbrClockWise = new Node[msg.ringLvl]; msg.ringNbrCClockWise = new Node[msg.ringLvl]; The key observation for this algorithm’seffi- msg.doInsertions = false; CollectRingInsertionNeighbors(msg); ciency is that a newcomer searches for its neighbors } else ... at a certain level only after finding its neighbors at } all higher levels. As a result, the search by name // Invoked at every intermediate routing hop. ID will traverse only a few nodes within each ring CollectRingInsertionNeighbors(msg) { if (msg.doInsertions) { to be joined: The range of nodes traversed at each InsertIntoRings(msg.ringNbrClockWise, msg.ringNbrCClockWise); level is limited to the range between the newcomer’s return; } neighbors at the next higher level. Therefore, with while (msg.ringLvl >= 0) { high probability, a node join in SkipNet will traverse nbr = localNode.RouteTable[clockWise][msg.ringLvl]; if (LiesBetween(localNode.nameID, msg.nameID, O(log N) hops. For a proof see Section 8.4. nbr.nameID, clockWise)) { // Found an insertion neighbor. The basic observation in handling node depar- msg.ringNbrClockWise[msg.ringLvl] = nbr; msg.ringNbrCClockWise[msg.ringLvl] = localNode; tures is that SkipNet can route correctly as long as msg.ringLvl = msg.ringLvl-1; } else { the root ring is maintained. All pointers but the // Keep looking SendToNode(msg, nbr); root ring ones can be regarded as routing optimiza- return; } tion hints, and thus are not necessary to maintain } routing protocol correctness. Therefore, like Chord msg.doInsertions = true; SendToNode(msg, msg.joiningNode); and Pastry, SkipNet maintains and repairs the upper- } level ring memberships by means of a background Figure 7. Algorithm to insert a SkipNet node. repair process. In addition, when a node voluntarily departs from the SkipNet, it can proactively notify ring only. Starting from one of these neighbors, the all of its neighbors to repair their pointers immedi- newcomer searches for its name ID at the next lower ately. level and thus finds its neighbors at this lower level. To maintain the root ring correctly, each SkipNet This process is repeated for each level until the new- node maintains a leaf set that points to additional comer reaches the root ring. For correctness, the nodes along the root ring, for redundancy. We de- existing nodes only point to the newcomer after it scribe the leaf set next. has joined the root ring; the newcomer then notifies its neighbors in each ring that it should be inserted 3.6 Leaf Set next to them. Figure 7 presents this algorithm in Every SkipNet node maintains a set of pointers pseudocode. to the L/2 nodes closest in name ID on the left side As an example, imagine inserting node O into and similarly on the right side. We call this set of the SkipNet of Figure 4. Node O initiates a search pointers a leaf set. Several previous peer-to-peer by numeric ID for its own ID (101) and the result- systems [30] incorporate a similar architectural fea- ing insertion message ends up at node Z in ring 10 ture; in Chord [35] they refer to this as a successor since that is the highest non-empty ring that shares list. a prefix with node O’s numeric ID. Since Z is the These additional pointers in the root ring provide only node in ring 10, Z concludes that it is both the two benefits. First, the leaf set increases fault toler- clockwise and counter-clockwise neighbor of node ance. If a search operation encounters a failed node, O in this ring. a node adjacent to the failed node will contain a leaf In order to find node O’s neighbors in the next set pointer with a destination on the other side of the lower ring (ring 1), node Z forwards the insertion failed node, and so the search will eventually move message to node D. Node D then concludes that past the failed node. Repair is also facilitated by re-

7 pairing the root ring first, and recursively relying on variant that a correct set of ring pointers at level h the accuracy of lower rings to repair higher rings. can be used to build a correct set of pointers in the Without a leaf set, it is not clear that higher level ring above it at level h +1. Each node periodically pointers (that point past a failed node) sufficiently routes a message a short distance around each ring enable repair. If two nodes fail, it may be that some that it belongs to, starting at level 0, verifying that node in the middle of them becomes invisible to the pointers in the ring above it point to the cor- other nodes looking for it using only higher level rect node and adjusting them if necessary. Once the pointers. Additionally, in the node failure scenario pointers at level h have been verified, this algorithm of an organizational disconnect, the leaf set pointers iteratively verifies and repairs the pointers one level on most nodes are more likely to remain intact than higher. At each level, verification and repair of a higher level pointers. The resiliency to node fail- pointer requires only a constant amount of work in ure that leaf sets provide (with the exception of the expectation. organizational disconnect scenario) was also noted The second of these algorithms performs local re- by [35]. pairs to rings whose nodes may have been inconsis- A second benefit of the leaf set is to increase tently inserted or whose members may have disap- search performance by subtracting a noticeable ad- peared. In this algorithm nodes periodically con- ditive constant from the required number of search tact their neighbors at each level saying “I believe hops. When a search message is within L/2 of its that I am your left(right) neighbor at level h”.If destination, the search message will be immediately the neighbor agrees with this information no reply forwarded to the destination. In our current imple- is necessary. If it doesn’t, the neighbor replies say- mentation we use a leaf set of size L =16, just as ing who he believes his left(right) neighbor is, and Pastry does. a reconciliation is performed based upon this infor- mation to correct any local ring inconsistencies dis- 3.7 Background Repair covered. SkipNet uses the leaf set to ensure with good probability that the neighbor pointers in the root 4 Useful Locality Properties of SkipNet ring point to the correct node. As is the case in In this section we discuss some of the useful lo- Chord [34], this is all that is required to guarantee cality properties that SkipNet is able to provide, and correct, if possibly inefficient, routing by name ID. their consequences. For an intuitive argument of why this is true, sup- pose that some higher-level pointer does not point 4.1 Content and Routing Path Locality to the correct node, and that the search algorithm tries to use this pointer. There are two cases. In the Given the basic structure of SkipNet, de- first case, the incorrect pointer points further around scribing how SkipNet supports content and path the ring than the routing destination. In this case the locality is straightforward. Incorporating a pointer will not be used, as it goes past the destina- node’s name ID into a content name guaran- tion. In the second case, the incorrect pointer points tees that the content will be hosted on that to a location between the current location and the node. As an example, to store a document doc- destination. In this case the pointer can be safely name on the node john.microsoft.com, naming it followed and routing will proceed from wherever it john.microsoft.com/doc-name is sufficient. points. The only potential loss is routing efficiency. SkipNet is oblivious to the naming convention In the worst case, correct routing will occur using used for nodes’ name IDs. Our simulations and the root ring. deployments of SkipNet use DNS names for name Nonetheless, for efficient routing, it is important IDs, after suitably reversing the components of the to ensure as much as possible that the other pointers DNS name. In this scheme, john.microsoft.com are correct. SkipNet employs two background algo- becomes com.microsoft.john, and thus all nodes rithms to detect and repair incorrect ring pointers. within microsoft.com share the com.microsoft pre- The first of these algorithms builds upon the in- fix in their name IDs. This yields path locality for

8 organizations in which all nodes share a single DNS Note that both traditional system-wide DHT se- suffix (and hence share a single name ID prefix). mantics as well as explicit content placement are special cases of constrained load balancing: system- 4.2 Constrained Load Balancing wide DHT semantics are obtained by placing the ‘!’ hashing delimiter at the beginning of a docu- As mentioned in the Introduction, SkipNet sup- ment name. Omission of the hashing delimiter and ports Constrained Load Balancing (CLB). To im- choosing the name of a data object to have a prefix plement CLB, we divide a data object’s name into that matches the name of a particular SkipNet node two parts: a part that specifies the set of nodes over will result in that data object being placed on that which DHT load balancing should be performed SkipNet node. (the CLB domain) and a part that is used as input to Constrained load balancing can be performed the DHT’s hash function (the CLB suffix). In Skip- over any naming subtree of the SkipNet but not over Net the special character ‘!’ is used as a delimiter an arbitrary subset of the nodes of the overlay net- between the two parts of the name. work. Another limitation is that CLB domain is en- For example, suppose we stored a document us- coded in the name of a data object. Thus, transpar- ing the name msn.com/DataCenter!TopStories.html. ent remapping to a different load balancing domain The CLB domain indicates that load balancing is not possible. should occur over all nodes whose names begin with the prefix msn.com/DataCenter. The CLB suf- 4.3 Fault Tolerance fix, TopStories.html, is used as input to the DHT Previous studies [21, 24] indicate that network hash function, and this determines the specific node connectivity failures in the Internet today are due within msn.com/DataCenter on which the docu- primarily to Border Gateway Protocol (BGP) mis- ment will be placed. Note that storing a document configurations and faults. Other hardware, software with CLB results in the document being placed on and human failures play a lesser role. As a result, precisely one node within the CLB domain (al- node failures in overlay networks are not indepen- though it would be possible to store replicas on dent; instead, nodes belonging to the same organi- other nodes). If numerous other documents were zation or AS tend to fail together. We consider both also stored in the msn.com/DataCenter CLB do- correlated and independent failure cases in this sec- main, then the documents would be uniformly dis- tion. tributed across all nodes in that domain. To search for a data object that has been stored 4.3.1 Independent Failures using CLB, we first search for any node within the SkipNet’s tolerance to uncorrelated, independent CLB domain using search by name ID. To find the failures is much the same as previous overlay de- specific node within the domain that stores the data signs’ (e.g., Chord and Pastry), and is achieved object, we perform a search by numeric ID within through similar mechanisms. The key observation the CLB domain for the hash of the CLB suffix. in failure recovery is that maintaining correct neigh- The search by name ID is unmodified from the bor pointers in the root ring is enough to ensure cor- description in Section 3.3, and takes O(log N) mes- rect functioning of the overlay. Since each node sage hops. The search by numeric ID is constrained maintains a leaf set of 16 neighbors at level 0, the by a name ID prefix and thus at any level must effec- root ring pointers can be repaired by replacing them tively step through a doubly-linked list rather than a with the leaf set entries that point to the nearest live ring. Upon encountering the right boundary of the nodes following the failed node. The live nodes in list (as determined by the name ID prefix boundary), the leaf set may be contacted to repopulate the leaf the search must reverse direction in order to ensure set fully. that no node is overlooked. Reversing directions in As described in Section 3.7, SkipNet also em- this manner affects the performance of the search ploys a background stabilization mechanism that by numeric ID by at most a factor of two, and thus gradually updates all necessary routing table entries O(log N) message hops are required in total. when a node fails. Any query to a live, reachable

9 node will still succeed during this time; the stabi- for name IDs does provide this mechanism: Arbi- lization mechanism simply restores optimal routing. trary nodes cannot create global DNS names con- taining the suffix of a registered organization with- 4.3.2 Failures along Organization Boundaries out its permission. In previous peer-to-peer overlay designs [28, 34, Path locality allows SkipNet to guarantee that 30, 40], node placement in the overlay topology messages between two machines within a single is determined by a randomly chosen numeric ID. administrative domain that uses a single name ID As a result, nodes within a single organization are prefix will never leave the administrative domain. placed uniformly throughout the address space of Thus, these messages are not susceptible to traf- the overlay. While a uniform distribution facilitates fic analysis or denial-of-service attacks by machines the O(log N) routing performance of the overlay located outside of the administrative domain. Fur- it makes it difficult to control the effect of phys- thermore, traffic that is internal to an organization ical link failures on the overlay network. In par- is not susceptible to a Sybil attack [9] originating ticular, the failure of a inter-organizational network from a foreign organization: Creating an unbounded link may manifest itself as multiple, scattered link number of nodes outside microsoft.com will not al- failures in the overlay. Indeed, it is possible for low the attacker to see any traffic internal to mi- each node within a single organization that has lost crosoft.com, nor allow the attacker to usurp con- connectivity to the Internet to become disconnected trol over documents placed specifically within mi- from the entire overlay and from all other nodes crosoft.com. within the organization. Section 9.4 reports experi- In Chord, the nodes belonging to an administra- mental results that confirm this observation. tive domain (for example, microsoft.com) are uni- formly dispersed throughout the overlay. Thus, in- Since SkipNet name IDs tend to encode organiza- tercepting a significant portion of the traffictomi- tional membership, and nodes with common name crosoft.com may require that an attacker create a ID prefixes are contiguous in the overlay, failures large number of nodes. In SkipNet, the nodes be- along organization boundaries do not completely longing to an administrative domain form a contigu- fragment the overlay, but instead result in ring seg- ous segment of the overlay. Thus, an attacker might ment partitions. Consequently, a significant fraction attempt to target microsoft.com by creating nodes of routing table entries of nodes within the discon- (for example, microsofa.com) that are adjacent to nected organization still point to live nodes within the target domain. Thus a security disadvantage the same network partition. This property allows of SkipNet is that it may be possible to target traf- SkipNet to gracefully survive failures along organi- fic between an administrative domain and the out- zation boundaries. Furthermore, the disconnected side world with fewer attacking nodes than would organization’s SkipNet segment can be efficiently be necessary in systems such as Chord. We believe re-merged with the external SkipNet when connec- that susceptibility to these kinds of attacks is a small tivity is restored, as described in Section 6. price to pay in return for the benefits provided by 4.4 Security path and content locality.

In this section, we discuss some security conse- 4.5 Range Queries quences of SkipNet’s content and path locality prop- Since SkipNet’s design is based on and inspired erties. Recent work [3] on improving the security by Skip Lists, it inherits their functionality and flex- of peer-to-peer systems has focused on certification ibility in supporting efficient range queries. In par- of node identifiers and the use of redundant routing ticular, since nodes and data are stored in name paths. The security advantages of content and path ID order, documents sharing common prefixes are locality depend on an access control mechanism for stored over contiguous ring segments. Performing creation of name IDs. SkipNet does not directly range queries in SkipNet is therefore equivalent to provide this mechanism but rather assumes that it routing along the corresponding ring segment. Be- is provided at another layer. Our use of DNS names cause our current focus is on SkipNet’s architecture

10 and locality properties, we do not discuss the use of generates binary numeric IDs but divides bits into range queries for implementing various higher-level groups of b. This is analogous to our scheme for data query operators further in this paper. choosing numeric IDs with k =2b. Implementing node join and departure in the case 5 SkipNet Enhancements of sparse R-Tables requires no modification to our previous algorithms. For dense R-Tables, the node This section presents several optimizations and join message must traverse and gather information enhancements to the basic SkipNet design. about at least k −1 nodes in both directions in every 5.1 Sparse and Dense Routing Tables ring containing the newcomer, before descending to the next ring. As before, node departure merely re- The basic SkipNet design may be modified in or- quires notifying every neighbor. der to improve routing performance. Thus far in If k =2, the sparse and dense constructions are our discussions, SkipNet numeric IDs consist of 128 identical. Increasing k makes the sparse R-Table random binary digits. However, the random digits sparser and the dense R-Table denser. Any given need not be binary. Indeed, Skip Lists using non- degree of sparsity/density can be well-approximated binary random digits are well-known [26]. by appropriate choice of k and either a sparse or a We can also use non-binary random digits for the dense R-Table. Our implementation chooses k =8 numeric IDs in SkipNet, which changes the ring to achieve a good balance between state per node structure depicted in Figure 4, the number of point- and routing performance. ers stored per node, and the expected routing cost. We denote the number of different possibilities for 5.2 Duplicate Pointer Elimination a digit by k; in the binary digit case, k =2.If k =3, the root ring of SkipNet remains a single Two nodes that are neighbors in a ring at level h ring, but there are now three level 1 rings, nine level may also be neighbors in a ring at level h+1. In this 2 rings, etc. As k increases, the total number of case, these two nodes maintain “duplicate” point- pointers in the R-Table will decrease. Because there ers to each other at levels h and h +1. Intuitively, are fewer pointers, it will take more routing hops to routing tables with more distinct pointers yield bet- get to any particular node. For increasing values of ter routing performance than tables with fewer dis- tinct pointers, and hence duplicate pointers reduce k, the number of pointers decreases to O(logk n) while the number of hops required for search in- the effectiveness of a routing table. Replacing a du- plicate pointer with a suitable alternative, such as creases to O(k logk n). We call the routing table that results from this modification a sparse R-Table the following neighbor in the higher ring, improves with parameter k. routing performance by a moderate amount (our ex- It is also possible to build a dense R-Table by ad- periments indicate improvements typically around ditionally storing k−1 pointers to contiguous nodes 25%). Routing table entries adjusted in this fashion at each level of the routing table and in both direc- can only be used when routing by name ID since tions. In this case, the expected number of search they violate the invariant that a node point to its hops decreases while the expected number of point- closest neighbor on a ring, which is required for cor- ers at a node increases. rect routing by numeric ID. Increasing k makes the sparse R-Table sparser 5.3 Incorporating Network Proximity for Rout- and the dense R-Table denser. The density parame- ing by Name ID ter k and choice of sparse or dense construction can be used to control the amount routing state used by In SkipNet, a node’s neighbors are determined by all SkipNet routing tables, and in Section 9 we ex- a random choice of ring memberships (i.e., numeric amine the relationship between routing performance IDs) and by the ordering of name IDs within those and the amount of routing table state maintained. rings. Accordingly, the SkipNet overlay is con- Our density parameter, k, bears some similar- structed without direct consideration of the phys- ity to Pastry’s density parameter, b. Pastry always ical network topology, potentially hurting routing

11 performance. For example, when sending a mes- by querying the P-Table of the seed node. Finally, sage from the node saturn.com/nodeA to the node we determine that two nodes are near each other by chrysler.com/nodeB, both in the USA, the message estimating the round-trip latency between them. might get routed through the intermediate node The following section provides a detailed de- jaguar.com/nodeC in the UK. This would result in scription of the algorithm that a SkipNet node uses a much longer path than if the message had been to construct its P-Table. After the initial P-Table is routed through another intermediate node in the constructed, SkipNet constantly tries to improve the USA. quality of its P-Table entries, as well as adjust to To deal with this problem, we introduce a second node joins and departures, by means of a periodic routing table called the P-Table, which is short for stabilization algorithm. The periodic stabilization proximity table. The goal of the P-Table is to main- algorithm is very similar to the initial construction tain routing in O(log N) hops, while also ensuring algorithm presented below. Finally, in Section 8.8 that each hop has low cost in terms of network la- we argue that P-Table routing performance and P- tency. Our P-Table design is inspired by Pastry’s Table construction are efficient. proximity-aware routing tables [4]. To incorporate network proximity into SkipNet, the key observa- 5.3.1 P-Table Construction tion is that any node that is roughly the right dis- Recall that the R-Table has only two configura- tance away in name ID space can be used as an ac- tion parameters: the value of k and either sparse or ceptable routing table entry that will maintain the dense construction. The P-Table inherits these pa- underlying O(log N) routing performance. For ex- rameters from the R-Table upon which it is based. ample, it doesn’t matter whether a P-Table entry at In certain cases it is possible to construct a P-Table level 3 points to the node that is exactly 8 nodes with parameters that differ from the R-Table’sby away or to one that is 7 or 9 nodes away; statistically first constructing a temporary R-Table with the the the number of forwarding hops that messages will desired parameters. For example, if the R-Table is take will end up being the same. However, if the 7th sparse, one may construct a dense P-Table by first or 9th node is nearby in network distance then using constructing a temporary dense R-Table to use as it as the P-Table entry can yield substantially better input to the P-Table construction algorithm. routing performance. In fact, the P-Table entry at When a node joins SkipNet it first constructs its level h can be anywhere between 2h and 2h+1 nodes R-Table. P-Table construction is then initiated by away while maintaining O(log N) routing perfor- copying the entries of the R-Table to a separate list, mance. where they are sorted by name ID and then duplicate To construct its P-Table, a node needs to locate entries are eliminated. Duplicates and out-of-order a set of candidate nodes that are close in terms of entries can arise in this list due to the probabilistic network distance and whose name IDs are appro- nature of constructing the R-Table. priately distributed around the root ring. Unlike The joining node then constructs a P-Table join Chord and Pastry, in SkipNet it is difficult to esti- message that contains the sorted list of endpoints: a mate distance along the root ring simply by looking list of j nodes defining j − 1 intervals. The joining at a candidate node’s name ID. We solve this prob- node sends this P-Table join message to a seed node lem by observing that a node’s basic routing table Ð a node that should be nearby in terms of network (the R-Table) conveniently divides the root ring into distance. intervals of exponentially increasing size. Thus, Every node that receives a P-Table join message two pointers at adjacent levels in the R-Table pro- uses its own P-Table entries to fill in the intervals vide the name ID boundaries of a contiguous inter- with “candidate” nodes. As a practical considera- val along the root ring. Given a node, we examine tion, we limit the maximum number of candidates these intervals to determine which P-Table entry it per interval to 10 in order to avoid accumulating too is a candidate for. We discover candidate nodes that many nodes. After filling in any possible intervals, are nearby using a recursive process: we start at a the node checks whether any of the intervals are still nearby seed node and discover other nearby nodes empty. If so, the node must forward the join mes-

12 sage to another node in order to fill the remaining the closest leaf set node is likely to be an excellent empty intervals. choice for a seed node. Assuming that all intervals are initially empty, After the initial P-Table is constructed, SkipNet the expected number of hops required to find a can- constantly tries to improve the quality of its P-Table didate for the jth farthest interval from the joining entries, and adjusts to node joins and departures, node is O(j). Thus, in order to find candidates that by means of a periodic stabilization algorithm. The are close to the joining node in terms of network P-Table is updated periodically so that the P-Table proximity, we use the following strategy: Nodes segment endpoints accurately reflect the distribution that receive the join message use their own P-Table of name IDs in the SkipNet, which may change over entries to forward the message towards the unfilled time. The periodic mechanism used to update P- interval that is the farthest from the joining node. Table entries is very similar to the initial construc- If all the intervals have at least one candidate, the tion algorithm presented above. One key difference node sends the completed join message back to the between the update mechanism and the initial con- original joining node. The expected total number of struction mechanism is that for update, the current hops to fill all intervals is O(log N). P-Table entries are considered as candidate nodes in When the original node receives its own join addition to the candidates returned by the P-Table message, it chooses one candidate node per inter- join message. The other difference is that for up- val as its P-Table entry. The choice between candi- date, the seed node is chosen as the best candidate date nodes is performed by estimating the network from the existing P-Table. Finally, the P-Table en- latency to each candidate and choosing the closest tries may also be incrementally updated as node node. joins and departures are discovered through ordi- We now summarize a few remaining key details nary message traffic. of P-Table construction. Since SkipNet can route either clockwise or counter-clockwise, the P-Table 5.4 Incorporating Network Proximity for Rout- contains intervals that cover the address space in ing by Numeric ID both directions from the joining node. Thus two join We add a third routing table, the C-Table, to in- messages are sent from the same starting node. corporate network proximity when searching by nu- The effectiveness of P-Table routing entries is de- meric ID. Constrained Load Balancing (CLB), be- pendent to a great extent on finding nearby nodes. cause it involves searches by both name ID and nu- The basis of this process is finding a good seed meric ID, takes advantage of both the P-Table and node. In our simulator, we implemented two strate- the C-Table. Because search by numeric ID as part gies for locating a seed node. Our first strategy of a CLB search must stay within the CLB domain, uses global knowledge from the simulator topology C-Table entries that step outside the domain cannot model to find the closest node in the entire sys- be used. When such an entry is encountered, the tem. The second and more realistic strategy is that CLB search must revert to using the R-Table. we choose the seed node at random, and then run The C-Table has identical functionality and de- the P-Table join algorithm twice. We use the first sign to the routing table that Pastry maintains [30]. run of the P-Table join algorithm to locate a nearby The suggested parameter choice for Pastry’s routing seed, and the second run to construct a better P-table table is b =4(i.e. k =16), while our implemen- based on the nearby seed. Section 9.6 summarizes a tation chooses k =8, as mentioned in Section 5.1. performance evaluation of these two approaches. As is the case with searching by numeric ID using For a real implementation, we make the follow- the R-Table, and as is the case with Pastry, search- ing simple proposal: The seed node should be de- ing by numeric ID with the C-Table requires at most termined by estimating the network latency to all O(log N) message hops. nodes in the leaf set and choosing the closest leaf For concreteness, we describe the C-Table in the set node. Since SkipNet name IDs incorporate nam- case that k =8, although this description could be ing locality, a node is likely to be close in terms of inferred from [30]. At each node the C-Table con- network proximity to the nodes in its leaf set. Thus sists of a set of arrays of node pointers, one array

13 per numeric ID digit, each array having an entry for R-Table pointers for all virtual nodes is therefore each of the eight possible digit values. Each entry of roughly v log N.Asv increases, the periodic main- the first array points to a node whose first numeric tenance traffic required for each of those pointers ID digit matches the array index value. Each en- poses a scalability concern. To alleviate this po- try of the second array points to a node whose first tential bottleneck, the present section describes a digit matches the first digit of the current node and variation on the SkipNet design that reduces the whose second digit matches the array index value. expected number of pointers required for v virtual This construction is repeated until we arrive at an nodes to O(v +logn), while maintaining logarith- empty array. mic expected path lengths for searches by name ID. In Section 8.6 we provide mathematical proofs for 5.4.1 C-Table Construction and Update the performance of this virtual node scheme. The details of C-Table construction can be found Although Skip Lists have comparable routing in [4]. The key idea is: For each array in the C- path lengths as SkipNet, Section 3 mentioned two Table, route to a nearby node with the necessary fundamental drawbacks of Skip Lists as an overlay numeric ID prefix, obtaining its C-Table entries at routing data structure: that level, and then populate the joining node’s array • Nodes in a Skip List experience markedly dis- with those entries. Since several candidate nodes proportionate routing loads. may be available for a particular table entry, the can- didate with the best network proximity is selected. • Nodes in a Skip List have low average edge Section 8.8 shows that the cost of constructing a C- connectivity. Table is O(log N) in terms of message traffic. As in Pastry, the C-Table is updated lazily, by means of a Our key insight is that neither of these two Skip List background stabilization algorithm. drawbacks apply to virtual nodes. In the context of We report experiments in Section 9.5 showing virtual nodes, we desire that: that use of the C-Table during CLB search reduces • A peer-to-peer system must avoid imposing a the RDP (Relative Delay Penalty). An adaptation disproportionate amount of work on any given of the argument presented in [4] for Pastry explains physical machine. It is less important that vir- why this should be the case. tual nodes on a single physical machine do pro- portionate amounts of work. 5.5 Virtual Nodes • Similarly, each physical machine should have Economies of scale and the ability to multiplex high edge connectivity. It is less important hardware resources among distinct web sites have that virtual nodes on a single physical machine led to the emergence of hosting services in the have high edge connectivity. World Wide Web. We anticipate a similar demand for hosting virtual nodes on a single hardware plat- In light of these revised objectives, we can relax form in peer-to-peer systems. In this section, we the requirement that each virtual node has roughly describe a scheme for scalably supporting virtual log n pointers. Instead, we allow the number of nodes within the SkipNet design. For ease of ex- pointers per virtual node to have a similar distribu- position, we describe only the changes to the R- tion to the number of pointers per data record in a Table; the corresponding changes to the P-Table and Skip List. More precisely, all but one of the vir- C-Table are obvious and hence omitted. tual nodes independently truncate their numeric IDs Nothing in the SkipNet design prevents multiple such that they have length i ≥ 0 with probability nodes from co-existing on a single machine; how- 1/2i+1. The one remaining virtual node keeps its ever, scalability becomes a concern as the number full-length numeric ID, in order to ensure that the of virtual nodes increases. As shown in Section 8.2, physical machine has at least log n expected neigh- a single SkipNet node’s R-Table will probably con- bors. As a result, in this scheme, the expected num- tain roughly log N pointers. If a single physical ber of total pointers for a set of v virtual nodes is machine hosts v virtual nodes, the total number of 2v +logn + O(1).

14 When a virtual node routes a message, it can use hops, but inter-segment traffic may initially require any pointer in the R-Table of any co-located virtual O(log M) hops for every segment that it traverses. node. Simply using the pointer that gets closest to In total, O(S log M) hops may be required for inter- the destination (without going past it) will maintain segment traffic, where S is the number of segments path locality and logarithmic expected routing per- in the organization. formance. A background process repairs the additional rout- The interaction between virtual nodes and DHT ing pointers, thereby eliminating the cross-segment functionality is more complicated. DHT function- penalty. SkipNet’s structure enables this repair pro- ality involves searching for a given numeric ID. cess to be done in a manner that avoids unnecessary Search by numeric ID terminates when it reaches duplication of work. When the organization recon- a ring from which it cannot go any higher; this is nects to the Internet, these same repair operations likely to occur in a relatively high-level ring. By can be used to merge the organization’s segments construction, virtual nodes are likely only to be back into the global SkipNet. members of low-level rings, and thus they are likely In contrast, most previous scalable, peer-to-peer not to shoulder an equal portion of the DHT stor- overlay designs [28, 30, 34, 40] place nodes in the age burden. However, because at least one node per overlay topology according to a unique random nu- physical machine is not virtualized, the storage bur- meric ID only. Disconnection of an organization in den of the physical machine is no less than it would most of these systems will result in its nodes frag- be without any virtual nodes. menting into many disjoint overlay pieces. During the time that these fragments are reforming into a 6 Recovery from Organizational Discon- single overlay, network routing efficiency may be nects poor or unbalanced, or may even fail.

In this section, we characterize the behavior of 6.1 Recovery Algorithms SkipNet with respect to a common failure mode: When an organization is disconnected from the when organizations become disconnected from the Internet, its nodes will be able to communicate with Internet. We describe and evaluate the recovery al- each other over IP but will not be able to commu- gorithms used to repair the SkipNet overlay when nicate with nodes outside the organization. If the such failures occur. One key benefit of SkipNet’s lo- organization’s nodes’ names employ only a few or- cality properties is graceful degradation in response ganizational prefixes then the nodes are mostly con- to disconnection Ð one of the more common forms tiguous in SkipNet, and hence the global SkipNet of Internet failure, which can be caused by will partition itself into several disjoint, but inter- misconfigurations and link and router faults [21, nally well-connected, segments. This is illustrated 24]. Because SkipNet orders nodes according to in Figure 8. their names, and assuming that organizations assign Because of SkipNet’s path locality property, mes- node names with one or a few organizational pre- sage traffic within each segment will be unaffected fixes, an organization’s nodes are naturally arranged by disconnection and will continue to be routed into a few contiguous overlay segments. Should with O(log M) efficiency, where M is the num- an organization become disconnected, its segments ber of nodes within the segment. Assuming that remain internally well-connected and intra-segment the disconnecting organization constitutes a small traffic can be routed with the same O(log M) hop fraction of the global SkipNet, cross-segment traf- efficiency as before, where M is the maximum fic among the global portions of the SkipNet will number of nodes in any segment. also remain largely unaffected because most cross- By repairing only a few key routing pointers be- segment pointers among global segments will re- tween the “edge” nodes of each segment, the entire main valid. This will not be true for the segments organization can be connected into a separate Skip- of the disconnected organization. Net that can route traffic with similar efficiency: Gracefully handling a partition in the underly- Intra-segment traffic is still routed in O(log M) ing IP network has two aspects: continuing to pro-

15 com.intel ConnectRootLevel(n1, n2) { n2 Global d2 edgeNodes = GatherEdgeNodeInfo(n1, n2, null) com.ibm SkipNet Connect edge node pairs. s3 } d1 com.microsoft s2 Microsoft GatherEdgeNodeInfo(n1, n2, msg) { SkipNet n2 routes msg to n1 in its SkipNet. n1 s0 Msg will arrive at d1. com.hotmail Microsoft d3 d1 appends d1 and next neighbor, d0, to msg contents. SkipNet d1 sends msg directly to n1 over IP. n1 routes msg to d0 in its SkipNet. s1 Msg will arrive at s1. d0 com.sun if (memberOf(s0, msg contents)) // => all segments com.google return msg contents // traversed Global edu.cmu else // => Message needs to discover more edge nodes com.amazon SkipNet s1 appends s1 and next neighbor, s0, to msg contents. jp.sony return GatherEdgeNodeInfo(s0, d0, msg) at.ac.tuwien za.gov } Figure 8. Two partitioned SkipNets to be merged. Figure 9. SkipNet root ring connection algorithm. vide internal connectivity for the duration of the these well-known nodes and uses them as contact partition, and efficiently repairing the overlay when points between the various overlay segments. the underlying IP network partition heals. Main- When an organization reconnects to the Inter- taining internal connectivity of the overlay requires net, the organizational and global SkipNets discover that communications be possible both within each each other through their segment edge nodes. Since overlay segment and across segments that still have each node maintains a leaf set, if a node discov- IP connectivity to each other. Repairing the over- ers that one side of its leaf set, but not the other, lay when the partition heals involves reestablish- is completely unreachable then it concludes that a ing communications between overlay segments that disconnect event has occurred and that it is an edge were formerly unreachable by IP. Thus, the primary node of a segment. These edge nodes keep track repair task after both disconnection and reconnec- of their unreachable leaf set pointers and periodi- tion is the merging of overlay segments. cally ping them for reachability; should a pointer The algorithms employed in both the disconnec- become reachable, the node initiates the merge pro- tion and reconnection cases are very similar: Skip- cess. Note that merging two previously independent Net segments must discover each other and then be SkipNets together—for example, when a new orga- merged together. For the disconnect case, the orga- nization joins the system—is functionally equiva- nization segments are merged into a separate Skip- lent to reconnecting a previously connected one, ex- Net and the global segments are merged to reform cept that an alternate means of discovery is needed. the global SkipNet. For the reconnect case, all seg- ments from the two separate SkipNets are merged 6.3 Connecting Root Ring Segments into a single SkipNet. The segment merge process is comprised of two steps: repair of the root ring pointers and repair of 6.2 Discovery Techniques the pointers for all higher-level rings. The first step When an organization disconnects from the In- can be done quickly, as it only involves repair of ternet there is no guarantee that the resulting non- the root ring pointers of the edge nodes of each seg- contiguous segments will have pointers into each ment. Once the first step has been done it will be other. Therefore its segments may not be able to possible to route messages correctly among nodes find each other using only SkipNet pointers. To in different segments and to do so with O(S log M) solve this discovery problem we assume that or- efficiency, where S is the total number of segments ganizations will divide their nodes into a relatively and M is the maximum number of nodes within a small number of name segments and that they des- segment. As a consequence, the second, more ex- ignate some number of nodes in each segment as pensive step can be done as a background task, as “well-known”. For instance, Microsoft might main- described in Section 6.4. tain well-known members of segments with name The key idea for connecting SkipNet root ring prefixes microsoft.com, hotmail.com, xbox.jp, etc. segments is to discover the relevant edge nodes by Each node in an organization maintains a list of having a node in one segment route a message to-

16 wards the name ID of a node in the other segment. Left Segment Being Connected Right Segment Being Connected This message will be routed to the edge node in ... the first segment that is lexicographically nearest to Level 2 11 00 01 10 00 10 01 11 the other node’s name ID. By repeating this process Pointers one can enumerate all edge nodes and hence all seg- Level 1 0 1 0 1 ments. Pointers

Level 0 The actual inter-segment pointer updates are then Pointers done as a single atomic operation among the seg- NumericNumeric Numeric Numeric NumericNumeric Numeric Numeric ment edge nodes, using distributed two-phase com- ID 11... ID 00... ID 01... ID 10... ID 00... ID 10... ID 01... ID 11... Segment boundary mit. This avoids routing inconsistencies where a Figure 10. Nodes whose pointers have been repaired at message destined for a specific node on one segment the boundary of two SkipNet segments. inadvertently ends up at a different node in another overlay segment because the segments to be merged do not yet form a fully connected root ring. // Called initially with level h=0 at node // to the left of the merge point PostMergeRepair(h) { To illustrate, Figure 8 shows two SkipNets to be Find closest node to left whose numeric ID matches mine in the first h bits and whose ID differs from merged, a Microsoft SkipNet and a global SkipNet, mine in the next bit, by following level h pointers to the left. each containing two different name segments. Sup- On my node: cont = FixMyRightPointer(h+1) pose that node n1 knows of node n2’s existence. if (cont) PostMergeRepair(h+1) In parallel, on the other node: Node n1 will send a message to node n2 (over IP) cont2 = FixMyRightPointer(h+1) if (cont) PostMergeRepair(h+1) asking it to route a search message towards n1 in the } global SkipNet. n2’s message will end up at node FixMyRightPointer(h) { Search right using level h-1 pointers until a node is d1 and, furthermore, d1’s neighbor on the global found that matches my numeric id in h bits. Connect our level h pointers. SkipNet will be d0. d1 sends a reply to n1 (over IP) if (pointers are already equal) return false telling it about d0 and d1. n1 routes a search mes- else return true sage towards d0 on the Microsoft SkipNet to dis- } cover s1 and s0 in the same manner. The procedure is iteratively invoked using s0 and d0 to gain infor- Figure 11. Level h ring repair algorithm for a single mation about s2, s3, d2, and d3. Figure 9 presents inter-segment boundary. the algorithm in pseudo-code. 6.4 Repairing Routing Pointers following Root Ring Connection Immediately following root ring connection, messages sent to cross-segment destinations will be Once the root ring connection phase has com- routed efficiently. Cross-segment messages will be pleted we can update all remaining pointers that routed to the edge of each segment they traverse need repair using a background task. We present and will then hop to the next segment using the root here an algorithm for doing this that avoids unnec- ring pointer connecting the segments. This leads to essary duplication of work through appropriate or- O(S log M) routing efficiency. When an organiza- dering of repair activities. tion reconnects its fully repaired SkipNet root ring The key idea is that we recursively repair point- to the global one, traffic destined for nodes external ers at one level by using correct pointers at the level to the organization will be routed in O(log M) hops below to find the desired nodes in each segment. All to an edge node of the organization’s SkipNet. The pointers at one level must be repaired across a seg- root ring pointer connecting the two SkipNets will ment boundary before repair of a higher level can be traversed and then O(log N) hops will be needed be initiated. To illustrate, consider Figure 10, which to route traffic within the global SkipNet. Note that depicts a single boundary between two SkipNet seg- traffic that does not have to cross between the two ments after pointers have been repaired. Figure 11 SkipNets will not incur this routing penalty. presents an algorithm in pseudo-code for repairing

17 pointers above the root ring across a single bound- 6.5 Repairing P-Table and C-Table Entries ary. We begin by discussing the single boundary In normal operation, both a node’s P-Table and case, and later we extend our algorithm to handle the C-Table entries are updated periodically using the multiple boundary case. information gathered from the node’s R-table. Once Assume that the root ring pointers have already the R-table repair algorithms above have run then been correctly connected. There are two sets of two these periodic update processes will likewise repair pointers to connect between the segments at level 1: the node’s P-Table an C-Table with no resulting in- the ones for the routing ring labeled 0 and the ones crease in maintenance traffic. for the routing ring labeled 1 (see Figure 4). We can repair the level 1 ring labeled 0 by traversing 7 Design Alternatives the root (level 0) ring from one of the edge nodes SkipNet’s locality properties can be obtained to until we find nodes in each segment belonging to a limited degree by suitable extensions to existing the ring labeled 0. The same procedure is followed overlay network designs. We explore several such to correctly connect the level 1 ring labeled 1. After extensions in this section. However, none of these the level 1 rings, we use the same approach to repair design alternatives provides all of SkipNet’s locality the four level 2 rings. advantages. Because rings at higher levels are nested within The space of alternative design choices can be di- rings at lower levels, repair of a ring at level h +1 vided into three cases: Rely on the inherent locality can be initiated by one of the nodes that had its properties of the underlying IP network and DNS pointer repaired for the enclosing ring at level h.A naming instead of using an overlay network; use a repair at level h +1is unnecessary if the level h single overlay network—possibly augmented—that ring (a) contains only a single member or (b) does supports locality properties; or use multiple overlay not have an inter-segment pointer that required re- networks that provide locality by spanning different pair. The latter termination condition implies that sets of member nodes. most rings—and hence most nodes—in the global SkipNet will not, in fact, need to be examined for 7.1 IP routing and DNS naming potential repair. A simple alternative to SkipNet’s content place- The total work involved in this repair algorithm is ment scheme is to route directly using IP after a O(M log(N/M)), where M is the size of the dis- DNS lookup. This approach would also arguably connecting/reconnecting SkipNet segment and N is provide path locality since most organizations struc- the size of the external SkipNet. Note that rings at ture their internal networks in a path-local manner. level h +1can be repaired in parallel once their en- However, discarding the overlay network also dis- closing rings at level h have been repaired across cards all of its advantages, including: all segment boundaries. Thus, the repair process for a given segment boundary parallelizes to the extent • Implicit support for DHTs, and in the case of supported by the underlying network infrastructure. SkipNet, support for constrained load balanc- We provide a theoretical analysis of the total work ing. and total time to complete repair in Section 8.7. • To repair multiple segment boundaries, we sim- Seamless reassignment of traffic to well- ply call the algorithm described above once for each defined alternative nodes in the presence of segment boundary. In the current implementation, node failures. we perform this process iteratively, waiting for the • Better support for higher level abstractions, repair operation to complete on one boundary be- such as application-level multicast [5, 32, 29] fore initiating the repair at the next boundary. In fu- and load-aware replication [37]. ture work, we plan to investigate initiating the seg- ment repair operations in parallel — the open ques- • The ability to reach named destinations inde- tion is how to avoid repair operations from different pendent of the availability of the DNS name boundaries interfering with each other. lookup service.

18 7.2 Single Overlay Networks In contrast, SkipNet is able to guarantee path local- ity, even across organizations that consist of sepa- Existing overlays are based on DHTs and depend rate clusters of nodes, as long as they are contiguous on random assignment of node IDs in order to ob- in name ID space. tain a uniform distribution of nodes within their ad- An alternative to virtualizing node names would dress spaces. To support explicit content placement be to lengthen node IDs and partition them into sep- onto a particular node requires changing either node arate, concatenated parts. For example, in a two- or data naming. One could name a node with the part scheme, node names would consist of two con- hash of the data object’s name, or some portion of catenated IDs and content names would also con- its name. This scheme effectively virtualizes over- sist of two parts: a numeric ID value and a string lay nodes so that each node joins the overlay once name. The numeric ID would map to the first part per data object. of an overlay ID while the hash of the string name The drawback of this solution is that separate would map to the second part. The result is a routing tables are required for each local data ob- static form of constrained load balancing: The nu- ject. This will result in a prohibitive cost whenever a meric ID of a data object’s name selects the DHT single node needs to store more than a few hundred formed by all nodes sharing the same numeric ID data objects due to the network traffic overhead of and the string name determines which node to map building and maintaining large numbers of routing to within the selected DHT. Furthermore, combin- table entries. ing this approach with node virtualization provides Alternatively, one could change object names explicit content placement. to use a two-part naming scheme, much like in This approach comes close to providing the same SkipNet, where content names consist of unique locality semantics as SkipNet: it provides explicit node addresses concatenated to local, node-relative, content placement, a static form of constrained load names. Although this approach supports content balancing, and path locality within each numeric ID placement, it does not support guaranteed path domain. The major drawbacks of this approach are locality nor constrained load balancing (including that the granularity of the hierarchy is frozen at the continued content locality in the event of failover to time of overlay creation by human decision; every a neighbor node). layer of the hierarchy incurs an additional cost in One might imagine providing path locality by the length of the numeric ID and in the size of the adding routing constraints to messages, so that mes- routing table that must be maintained; and the path sages are not allowed to be forwarded outside of a locality guarantee is only with respect to boundaries given organizational boundary. Unfortunately, such in the static hierarchy. constraints would also prevent routing from being consistent. That is, messages sent to the same des- 7.3 Multiple Overlay Networks tination ID from two different source nodes would Instead of using a single DHT-based overlay one not be guaranteed to end up at the same destination might consider employing multiple overlays with node. different memberships. These multiple overlays can Overlay networks such as Pastry can partially be arranged either as a static set of networks re- mitigate the path locality problem using their sup- flecting the desired locality requirements or as a dy- port for network proximity [4]. However, Pastry’s namic set of overlays reflecting the participation of network proximity support depends on having a nodes in particular applications. In the static over- nearby node to use as a “seed” node when joining lay case, a node could belong to just one of several an overlay. If the nearby node is not within the same alternative overlays, or belong to multiple overlays organization as the joining node, Pastry might not at different levels of a hierarchy. be able to provide good, let alone guaranteed, path In the case where each node belongs to only one locality. This problem is exacerbated for organiza- of several overlays, one could imagine accessing tions that consist of multiple separate “islands” of other overlays by gateways. These gateways need nodes that are far apart in terms of network distance. not be a single point of failure if we give the backup

19 gateway an appropriate neighboring numeric ID. parameter k; the basic SkipNet design described in One could either route directly to well-known gate- Section 3 is a sparse SkipNet with k =2). We ways, or the gateways could organize an overlay formally prove these results in Theorem 8.5 and network amongst themselves (imagine a overlay Theorem 8.2. Intuitively, searches in SkipNet re- network of overlay networks). In either case, inter- quire this many hops for the same reason that Skip domain routing requires serial traversal of the do- List searches do: every node’s pointers are approx- main hierarchy, resulting in potentially large laten- imately exponentially distributed, and hence there cies when routing between domains. will most likely be some pointer that halves the re- If instead each node belonged to multiple maining distance to the destination. A dense Skip- overlays (for example, to a global overlay, an Net maintains roughly a factor of k more pointers organization-wide overlay, and perhaps also a di- and makes roughly a factor of k more progress on visional or building-wide overlay), the associated every hop. overhead would correspondingly grow. Explicit content placement would still require extension of For the formal analysis, we will consider a sparse the overlay design. Furthermore, in this scheme, ac- R-Table first, and then extend our analysis to the cess to data that is constrained load balanced within dense R-Table. It will be helpful to have the follow- a single overlay is not readily accessible to clients ing definitions: The node from which the search op- outside that overlay network, although it could be eration begins is called the source node and the node made so by introducing gateways in this design. at which the search operation terminates is called A final design alternative involving multiple the destination node. The search operation visits overlays is to define an overlay network per appli- a sequence of nodes, until the destination node is cation. This lets applications dynamically define found; this sequence is called the search path. Each the set of participating nodes, and thus ensure that step along the search path from one node to the next application specific messages stay within this over- is called a hop. Throughout this subsection we will lay. It does not provide any notion of locality within refer to nodes by their name IDs, and we will denote a subset of the overlay, and therefore fails to pro- the name ID of the source by s, and the name ID of vide much of SkipNet’s functionality, such as con- the destination by d. strained load balancing. The rings to which s belongs induce a Skip List In contrast, SkipNet provides explicit content structure on all nodes, with s at the head. To analyze placement, allows clients to dynamically define new the search path in SkipNet, we consider the path DHTs over any name prefix scope, and guaran- that the Skip List search algorithm would use on the tees path locality within any shared name prefix, all induced Skip List; we then prove that the SkipNet within a single shared infrastructure. search path is no bigger the Skip List search path. Let P be the SkipNet search path from s to d using 8 Analysis of SkipNet a sparse R-Table. Let Q be the path that the Skip In this section we analyze various properties of List search algorithm would use in the Skip List in- and costs of operations in SkipNet. Each subsection duced by node s. Note that both search paths begin begins with a summary of the main results followed with s and end with d, and all the nodes in the paths by a brief, intuitive explanation. The remainder of lie between s and d. To see that P and Q need not each subsection proves the results formally. be identical, note that the levels of the pointers tra- versed in a Skip List search path are monotonically 8.1 Searching by Name ID non-increasing; in a SkipNet search path this is not necessarily true. Searches by name ID in a dense SkipNet take O(logk N) hops in expectation, and O(k logk N) To characterize the paths P and Q, it will be help- hops in a sparse SkipNet. Furthermore, these ful to let F (x, y) denote the longest common pre- bounds hold with high probability. (Refer to Sec- fixinx and y’s numeric IDs. The following useful tion 5.1 for the definition of ‘sparse’, ‘dense’, and identities follow immediately from the definition of

20 F : Q, the Skip List search path (a contradiction). Re- ferring back to the Skip List search path invariant, F (x, y)=F (y, x) (1) y ∈ [w,d] such that F (s, y) >F(s, w). Combin- F (x, y) F(s, x), and therefore x is in Q. F (x, y) >f,F(x, z) >f⇒ F (y, z) >f (4) We consider the last case F (s, x) F(s, x). the same number of digits with w as it does with The SkipNet search path P contains every node s. By the assumption that x is not in Q, the Skip ∈ between s and d such that no node closer to d List search path, there exists y [x, d] satisfying has more digits in common with the previous node F (s, y) >F(s, x). Combining F (s, y) >F(s, x) on the path. This uniquely defines P by speci- with the case assumption, F (s, w) >F(s, x) and fying the nodes in order; the node following s is applying Identity 4 yields F (w,y) >F(s, x). uniquely defined, and this uniquely defines the sub- Since F (s, x)=F (w,x), this y also satisfies ∈ sequent node, etc. Formally, x ∈ [s, d] immedi- F (w,y) >F(w,x). Combining this with y [x, d] ately follows w in P if and only if it is the closest implies that y violates the SkipNet search path in-  node following w such that y ∈ [x, d] satisfying variant for x; x is not in P . F (w,y) >F(w,x).

Lemma 8.1. Let P be the SkipNet search path from A consequence of Lemma 8.1 is that the length s to d using a sparse R-Table and let Q be the path of the Skip List search path bounds the length of the that the Skip List search algorithm would use in SkipNet search path. In the following theorem, we the induced Skip List. Then P is a subsequence of prove a bound on the length of the SkipNet search D Q. That is, every node encountered in the SkipNet path as a function of , the distance between the s d search is also encountered in the Skip List search. source and the destination , by analyzing the Skip List search path. Note that our high-probability re- Proof: Suppose for the purpose of showing a con- sult holds for arbitrary values of D; to the best of tradiction that some node x in P does not appear in our knowledge, analyses of Skip Lists and of other Q. Let x be the first such node. Clearly x = s be- overlay networks [35, 30] prove bounds that hold cause s must appear in both P and Q. Let w denote with high probability for large N. Because of the x’s predecessor in P ; since x = s, x is not the first SkipNet design, we expect that D N will be a node in P and so w is indeed well-defined. Node w common case. There is no reason to expect this in must belong to Q because x was the first node in P Skip Lists or other overlay networks. that is not in Q. It will be convenient to define some standard We first consider the case that F (s, x) > probability distribution functions. Let fn,1/k(g) be F (s, w), i.e., x shares more digits with s than w the distribution function of the binomial distribu- does. We show that this implies that w is not in tion: if each experiment succeeds with probability Q, the Skip List search path (a contradiction). Re- 1/k, then fn,1/k(g) is the probability that we see ex- ferring back to the Skip List search path invariant, actly g successes after n experiments. Let Fn,1/k(g) x ∈ [w,d] plays the role of y, thereby showing that be the cumulative distribution function of the bino- w is not in Q. mial distribution: Fn,1/k(g) is the probability that We next consider the case that F (s, x)= we see at most g successes after n experiments. Let F (s, w), i.e., x shares equally many digits with s Gg,1/k(n) be the cumulative distribution function of as w does. We show that this implies that x is in the negative binomial distribution: Gg,1/k(n) is the 21 probability that we see g successes after at most n It remains to show that the probability the search experiments. takes more than m hops without traversing a level We use the following two identities below: g pointer is small. The classical Skip List analy- sis [27] upper bounds this probability using the neg- G 1 (n)=1− F 1 (g − 1) (5) g, k n, k ative binomial distribution, showing that Pr[Y> 1−α 1 F 1 (αn − 1) < f 1 (αn) for α< (6) m and Xm] ≤ Pr[Y>mand X t ≤ ≥ − t than m hops via: m] < 3/z0. That is, Pr[Y m] 1 3/z0. The expectation bound straightforwardly follows.  Pr[Y>m]= Pr[Y>mand Xmand X g] ID in a SkipNet using a dense R-Table. Recall that a ≤ Pr[Y>mand XF(s, x) (from the characteri- Skip List’, but such a structure would not be useful zation of the Skip List search path). Since w ∈ Q because in a Skip List, comparisons are typically and y ∈ [w,d], it cannot be the case that F (s, y) > more expensive than hops. Whenever we refer to a F (s, w), otherwise that would contradict the fact Skip List, we are always referring to a sparse Skip that w ∈ Q (using the Skip List search path char- List. Define P to be the SkipNet search path with acterization again). Therefore F (s, y) ≤ F (s, w), a dense R-Table and, as before, let Q be the path and Identity 3 yields that F (w,y) ≥ F (s, y). Ap- that the Skip List search algorithm would use in the plying Identity 2 to F (s, x) let G(x, y, h) denote to be the number of hops be- F (s, x)=F (w,x). We apply the conclusion, tween nodes x and y in the ring that contains them F (w,y) >F(w,x), in the rest of the proof to de- both at level h.Ifh>F(x, y) (meaning nodes x rive a contradiction. and y are not in the same ring at level h), we define Consider the ring containing w at level F (w,y). G(x, y, h)=∞. Note that node x has a pointer to Node y must be in this ring but node x cannot be node y at level h if and only if G(x, y, h) F(w,x). Starting at w, con- each intermediate node on the SkipNet search path sider traversing this ring until we encounter z, the we hop using the pointer that takes us as close to the first node on this ring with x F (s, qi). Recall that F (s, qi) is monotonically non- F (s, w),F(s, x)=F (s, w),F(s, x) 0 number of search hops is then they must also belong to the same ring at level i−1, which we refer to as the parent ring of ring R. O(logk D) Moreover, every ring R at level i ≥ 0 is partitioned into at most k disjoint rings at level i +1, which to arrive at a node distance D away from√ the source. we refer to as the child rings of ring R. Thus, the More precisely, for constants z0 = e and t0 = rings naturally form a Ring Tree which is rooted at ≥ 9, and for t t0, the search completes in at most the root ring. 2 (2t +1)logk D +2t + t +1hops with probability Given a Ring Tree, one can construct a trie as fol- − t at least 1 3/z0. lows. First, remove all rings whose parent ring con- Proof: As in the proof of Theorem 8.2, with prob- tains a single node — this will collapse any subtree ability at least 1 − 3/zt the number of levels in the of the trie that contains only a single node. Every 0 remaining ring that contains a single node is called Skip List search path is at most g = t +logk D, and the number of hops is at most m = tkg = a leaf ring; label the leaf ring with the numeric ID of 2 its single node. The resulting structure on the rings (tk logk D + t k). Applying Lemma 8.4, the num- ber of hops in the dense SkipNet search path is is a trie containing all the numeric IDs of the nodes in the SkipNet. m tkg Let Y be the random variable denoting the num- + g +1= + g +1 N k − 1 k − 1 ber of non-null right (equivalently, left) pointers at a ≤ 2tg + g +1=(2t +1)g +1 particular node in a SkipNet containing N nodes. Papadakis defines DN to be the random variable =(2t +1)(t +logk D)+1 2 giving the depth of a node in a k-ary trie with keys =(2t +1)logk D +2t + t +1 drawn from the uniform [0, 1] distribution. Note that  YN is identical to the random variable giving the depth of a node’s numeric ID in the trie constructed above, and thus we have Y = D . 8.2 Correspondence between SkipNet and N N We may use this correspondence and Papadakis’ Tries analysis to show that E[YN ]=1+V 1 (N), where k The pointers of a SkipNet effectively make every V 1 (N) is (as defined in [20]): node the head of a Skip List ordered by the nodes’ k   name IDs. Simultaneously, every node is also the N · g−1 1 N − g g (1/k) root of a trie [11] on the nodes’ numeric IDs. Thus V 1 (N)= ( 1) − k N g 1 − (1/k)g 1 the SkipNet simultaneously implements two distinct g=2 data structures in a single structure. One implica- Knuth proves in [20, Ex. 6.3.19] that V 1 (N)= tion is that we can reuse the trie analysis to deter- k mine the expected number of non-null pointers in logk N + O(1), and thus the expected number of the sparse R-Table of a SkipNet node. This extends right (equivalently, left) non-null pointers is given previous work relating Skip Lists and tries by Pa- by E[YN ]=logk N + O(1). padakis in [25, pp. 38]: The expected height of 8.3 Searching by Numeric ID a Skip List with N nodes and parameter p corre- 1 sponds exactly to the expected height of a p -ary trie SkipNet supports searches by numeric ID as well with N +1keys drawn from the uniform [0, 1] dis- as searches by name ID. Searches by numeric ID tribution. in a dense SkipNet take O(logk N) hops in expec- Recall that ring membership in a SkipNet is de- tation, and O(k logk N) in a sparse SkipNet. We termined as follows: For i ≥ 0, two nodes belong to formally prove these results in Theorem 8.6. Intu- the same ring at level i if the first i digits of their nu- itively, search by numeric ID corrects digits one at meric ID match exactly. All nodes belong to the one a time and needs to correct at most O(logk N) dig- ring at level 0, which is called the root ring. Note its. In the sparse SkipNet correcting a single digit

24 requires about O(k) hops, while in the dense case 8.4 Node Joins and Departure only O(1) hops are required. We now analyze node join and departure opera- tions using the analysis of both search by name ID Theorem 8.6. The expected number of hops in a and by numeric ID from the previous sections. As search by numeric ID using a sparse R-Table is described in Section 3.5, a node join can be imple- O(k log N). In a dense R-Table, the expected k mented using a search by numeric ID followed by number of hops is O(log N). Additionally, these k a search by name ID, and will require O(k log N) bounds hold with high probability (i.e., the number k hops in either a sparse or a dense SkipNet. Imple- of hops is close to the expectation). menting node departure is even easier: As described in Section 3.5, a departing node need only notify its Proof: We use the same upper bound as in the proof left and right neighbors at every level that it is leav- of Theorem 8.2, ing, and that the left and right neighbors of the de- parting node should point to each other. This yields Pr[search takes more than m hops] a bound of O(logk N) hops for the sparse SkipNet ≤ Pr[more than m hops and at most g levels] and O(k logk N) for the dense SkipNet, where hops + Pr[more than g levels] measure the total number of hops traversed by mes- sages since these messages may be sent in parallel. and bound the two terms separately. In Theo- rem 8.2 we showed that the maximum number of Theorem 8.7. The number of hops required by a digits needed to uniquely identify a node is g = node join operation is O(k logk N) in expectation and with high probability in either a sparse or a O(logk N) with high probability, and thus no search by numeric ID will need to climb more than this dense SkipNet. many levels. This upper bounds the right-hand term. Proof: The join operation can be decomposed into The number of hops necessary on any given level in a search by numeric ID, followed by a Skip List the sparse R-Table before the next matching digit is search by name ID. Because of this, the bound on found is upper bounded by a geometric random vari- the number of hops follows immediately from The- able with parameter 1/k. The sum of g of these ran- orem 8.2 and Theorem 8.6. It only remains to estab- dom variables has expectation gk, and this random lish that the join operation finds all required neigh- variable is close to its expectation with high prob- bors of the joining node. ability (by standard arguments). Thus the expected For a sparse SkipNet, the joining node needs a number of hops in a search by numeric ID using a pointer at each level h to the node whose numeric sparse R-Table is O(k log N), and additionally the k ID matches in h digits that is closest to the right or bound holds with high probability. closest to the left in the order on the name IDs. For a For a search by numeric ID using a dense R- dense SkipNet, the joining node must find the same Table, we upper bound the number of hops neces- nodes as in the sparse SkipNet case, and then notify sary on any given level differently. Informally, in- k − 2 additional neighbors at each level. stead of performing one experiment that succeeds The join operation begins with a search for a node with probability 1/k repeatedly, we perform k − 1 with the most numeric ID digits in common with such experiments simultaneously. Formally, the the joining node. The search by name ID opera- probability of finding a matching digit in one hop tion for the joining node starts at this node, and it is now 1−(1−1/k)k−1 ≥ 1/2. Therefore the anal- is implemented as a Skip List search by name ID; ysis in the case of a sparse R-Table need only be the pointers traversed are monotonically decreasing modified by replacing the parameter 1/k with 1/2. in height, in contrast to the normal SkipNet search Thus the expected number of hops in a search by nu- by name ID. Whenever the Skip List search path meric ID using a dense R-Table is O(log N), and k drops a level, it is because the current node at level h additionally the bound holds with high probability. points to a node beyond the joining node. Therefore  this last node at level h on the Skip List search path

25 is the closest node that matches the joining node in given an upper bound of O(log d) on the number of h digits. This gives the level h neighbor on one side, hops between two nodes at distance d. In order to and the joining node’slevelh neighbor on the other estimate the average load, we assume a tight bound side is that node’s former neighbor. The message of Θ(log d) without proof. traversing the Skip List search path accumulates this Theorem 8.8. Consider an interval on which we information about all the required neighbors on its preserve path locality containing N nodes. Then the way to the joining node. This establishes the cor- { − } uth node of the interval bears a Θ( log min u,N u ) rectness of the join operation.  log N fraction of the average load in expectation. Proof: We first establish the expected load on node 8.5 Node Stress u due to routing traffic between a particular source l and destination r. The search path can only en- We now analyze the distribution of load when counter u if, for some h, the numeric IDs of l and performing searches by name ID using R-Tables. To u have a common prefix of length h but no node analyze the routing load, we must assume some dis- between u and r has a longer common prefix with tribution of routing traffic. We assume a uniform l. We observe that every node’s random choice of distribution on both the source and the destination numeric ID digits is independent, and apply a union of all routing traffic. This assumption may or may bound over h to obtain the following upper bound not seem plausible, but its plausibility is increased on the probability that the search encounters u. De- if SkipNet uses an obvious optimization. If the des- note the distance from u to r by d. tination of a SkipNet routing query is cached at the Pr[search encounters u] search originator, then subsequent searches to the  same destination could be routed directly over IP. ≤ Pr[u and l share h digits] Servicing repeated queries directly from the cache h≥0 would increase the randomness of the queries that · Pr[no node between u and r shares more] SkipNet must handle.    1 1 d Under some routing algorithms (which happen = · 1 − 2h 2h+1 not to preserve path locality), the distribution of h≥0 routing load is obviously uniform. For example, if routing traffic were always routed to the right, the Denote the term in the above summation by H(h) H(h) load would be uniform. If the source and destina- . Because falls by at most a factor of h tion name ID do not share a common prefix, then 2 when increases by 1, we can upper bound the summation using: path locality is not an issue and the SkipNet routing  algorithm randomly chooses a direction in which to  ≤ · route — such traffic is uniformly distributed. H(h) 2 H(h)dh ≥ h≥0 If the SkipNet routing algorithm can preserve h 0 path locality, it does so by always routing in the di- − 1 Making the change of variables α =1 2h+1 , and rection of the destination (i.e., if the destination is ln 2 · hence dα = 2h+1 dh, we obtain: to the right of the source, routing proceeds to the   right). We show that in this case load is approxi- 1 2 · d · mately balanced: very few nodes’ loads are much H(h)dh = α dα h≥0 α=1/2 ln 2 smaller than the average load. We also shows that 2 1d+1 − ( 1 )d+1 no node’s load exceeds the average load by more = · 2 = O(1/d) than a constant factor with high probability; this re- ln 2 d +1 sult is relevant whether the routing algorithm pre- This completes the analysis of a single serves path locality or not. In the interest of sim- source/destination pair. A similar single pair plicity, our proof assumes that k =2; a similar re- analysis was also noted in [1]. We complete our sult holds for arbitrary k. Also, we have previously theorem by considering all source/destination pairs.

26 Our bound on the average load of a node is given u and r shares exactly h bits with u. (Note that if by the total number of source/destination pairs mul- r shares exactly h bits with u, it must share more tiplied by the bound on search hops divided by the than h bits with l, and thus routing traffic from l to total number of nodes. Summing over all the routing r does not pass through u.) The analysis in the pre- traffic that passes through u and dividing by the av- vious paragraph implies that the load on u is exactly erage load yields the proportion of the average load h LhRh. We desire to show that this quantity is that u carries. To within a constant factor, this is: O(N log N) with high probability.   The random variable Lh has the binomial dis- 1 1 h+1 ∈ − ∈ | − | + | − | tribution with parameter 1/2 . From this obser- l [1,u 1] r [u+1,r] u l u r N vation, standard arguments (that we have made ex- ( 2 log N)/(N) plicit in earlier proofs in this section) show that Lh u log(N − u)+(N − u)logu = has expectation N/2h+1, and for h ∈ [0, log N − ((N − 1) log N)/2 h+1   log log N], Lh = O(N/2 ) with high probabil- log min{u, N − u} − =Θ ity. The number of l that share more than log N log N log log N bits with u is log N in expectation, and  is O(log N) with high probability; these l (whose number of common bits with u we do not bound) can contribute at most O(N log N) to the final to- Corollary 8.9. The number of nodes with expected tal. load less than Θ(α · average load) is N α. To analyze the random variables Rh, we intro- log u duce new random variables Rh that stochastically Proof: Apply Theorem 8.8 and note that log N <α α  dominate Rh. In particular, let Rh be the dis- implies that u

27 distributed like a geometric random variable with uses pointers of increasing level. At some point, we parameter 1/2h multiplied by O(N/2h), and thus encounter a node whose highest pointer goes be- has expectation O(N). This yields the O(N log N) yond the destination. From this point on (the sec- bound with high probability.  ond phase), we consider the Skip List search path to the destination that begins at this node. As in Theo- 8.6 Virtual Node Analysis rem 8.2, the rest of the actual search path will be a subsequence of this Skip List path. We outlined in Section 5.5 a scheme by which As in Theorem 8.2, the maximum level of any a single physical node could host multiple virtual pointer in this interval of D nodes is O(log D) with nodes. Using this scheme, the bounds on search k high probability. Suppose that some particular node hops are unaffected, and the number of pointers t is the first node encountered whose highest pointer per physical node is only O(k log N + kv) in the k points beyond the destination. In this case, the first dense case, where v is the number of virtual nodes. phase is exactly a search by numeric ID for t’s nu- In the sparse case, the number of pointers is just meric ID, and therefore the high probability bound O(log N + v). k of Theorem 8.6 on the number of hops applies. The Intuitively, we obtain this by relaxing the re- second phase is a search from t for d, and the high quirement that nodes after the first have height probability bound of Theorem 8.2 on the number O(log N). We instead allow node heights to be k of hops applies. There is a subtlety to this second randomly distributed as they are in a Skip List. Be- argument — although some or all of the intermedi- cause Skip List nodes maintain a constant number ate nodes may be virtual, the actual search path is of pointers in expectation, we add only O(k) point- necessarily a subset of the search path in the Skip ers per virtual node in the dense case, and O(1) in List induced by t (by the arguments of Lemma 8.1 the sparse case. Search are still efficient, just as they and Lemma 8.3). We previously supposed that t was are in a Skip List. fixed; because there are at most D possibilities for t, Theorem 8.11. Consider a single physical node considering all such possibilities increases the prob- supporting v virtual nodes using the scheme of ability of requiring more than O(k logk D) hops by Section 5.5. In the dense case, searches require at most a factor of D. Because the bound held with high probability initially, the probability of exceed- O(logk D) hops, and the number of pointers is ing this bound remains negligible. O(k logk N + kv). In the sparse case, searches re- This yields the result in the sparse case. An iden- quire O(k logk D) hops, and the number of pointers tical argument holds in the dense case.  is O(logk N + v). All these bounds hold in expec- tation and with high probability.

Proof: The bound on the number of pointers is by 8.7 Ring Merge construction. Consider the sparse case. The lead- ing term in the bound, O(logk N), is due to the one We now analyze the performance of the proactive virtual node that is given all of its SkipNet point- algorithm for merging disjoint SkipNet segments, ers. The additional virtual nodes have heights given as described in Section 6. Consider the merge of a by geometric random variables with parameter 1/2, single SkipNet segment containing M nodes with a which is O(1) in expectation. The claimed bound larger SkipNet segment containing N nodes. In the on the number of pointers immediately follows, and interest of simplicity, our discussion assumes that the dense case follows by an identical argument k =2; a similar analysis applies for arbitrary k. with an additional factor of k. Recall that the expected maximum level of a ring in We now analyze the number of search hops, fo- the merged SkipNet is O(log N) with high proba- cusing first on the sparse case. Because we might bility (Section 8.2). Intuitively, the expected time begin the search at a virtual node that does not to repair a ring at a given level after having reached have full height, we will break the analysis into that level is O(1) and ring repair occurs in parallel two phases. During the first phase, the search path across all rings at a given level. This suggests that

28 the expected time required to perform the merge op- Identity 5 and Identity 6, we obtain the following eration is O(log N), and we will show this formally upper bound on the probability of taking more than in Theorem 8.12 under the assumption that the un- c2 log N hops: derlying network accommodates unbounded paral- F (c log N) lelization of the repair traffic. In practice, the band- c2 log N,1/2 1 − width of the network may impose a limit: doing ≤ 1 c1/c2 fc2 log N,1/2(c1 log N) many repairs in parallel may saturate the network 1 − 2c1/c2    and hence take more time. − 1 c1/c2 c2 log N c2 log N The expected amount of work required by the = (1/2) 1 − 2c1/c2 c1 log N merge is O(M log(N/M)) = O(N).Wefirst give   an intuitive justification for this. The merge op- 1 − c /c (c log N)c1 log N ≤ 1 2 2 (1/2)c2 log N eration repairs at most four pointers per SkipNet 1 − 2c1/c2 (c1 log N)! ring. Since the total number of rings in the merged   1 − c /c (c log N)c1 log N SkipNet is O(N) and the expected work required ≤ 1 2  2  2−c2 log N − c1 log N to repair a ring is O(1), the expected total work 1 2c1/c2 c1 log N   e performed by the merge operation is O(N). Ad-   ditionally, if M is much less than N, the bound 1 − c1/c2 c2 · e c1 log N − < 2 c2 log N O(M log(N/M)) proved in Theorem 8.13 is much 1 − 2c1/c2 c1 less than O(N). { } Now consider an organization consisting of S Choosing c2 = max 7c1, 7 , this is at most −2 disjoint SkipNet segments, each of size at most M, 2N . Applying a union bound over the N pos-  merging into a global SkipNet of size N. In this sible paths completes the proof. case, the merge algorithm sequentially merges each segment of the organization one at a time into the Theorem 8.13. The expected total work to merge global SkipNet. The total time required in this a SkipNet segment of size M with a larger SkipNet case is O(S log N) and the total work performed segment of size N is O(M log(N/M)). is O(SM log(N/M)); these are straightforward corollaries of Theorem 8.12 and Theorem 8.13. Proof: Suppose all the pointers at level i have been repaired and consider any two level i +1rings that Theorem 8.12. The time to merge a SkipNet seg- are children of a single level i ring. To repair the ment of size M with a larger SkipNet segment of pointers in these two child rings, the nodes adjacent size N is O(log N) with high probability, assuming to the segment boundary at level i must each find sufficient bandwidth in the underlying network. the first node in the direction away from the segment Proof: After repairing a ring, the merge operation boundary who differs in the ith bit. The number of branches to repair both child rings in parallel, un- hops necessary to find either node is upper bounded til there are no more child rings. Using the anal- by a geometric random variable with parameter 1/2. ogy with tries from Section 8.2, consider any path Only O(1) additional hops are necessary to finish along the branches from the root ring to a ring with the repair operation. no children. We show that this path uses O(log N) By considering a particular order on the random hops with high probability. Union bounding over all bit choices, we show that the number of additional such paths will complete the theorem. hops incurred in every ring repair operation are in- We can assume that the height of any pointer dependent random variables. Let all the level i bits is at most c1 log N. The number of hops to tra- be chosen before the level i +1bits. Then the num- verse this path is then upper bounded by a sum of ber of hops incurred in fixing any two level i +1 c1 log N geometric random variables with param- rings that are children of the same level i ring de- eter 1/2. We now show that this sum is at most pends only on the level i +1random bits of those c2 log N = O(log N) with high probability. Ap- two rings. Also, only rings that require repair ini- plying the same reduction as in Section 8.1, using tiate a repair operation on their children. Therefore

29 we can assume that the level i rings from which we bound on node join as on search by numeric ID. will continue the merge operation are fixed before During node departure, no work is performed to we choose the level i +1bits. Hence the number of maintain the C-Table. hops incurred in repairing these two child rings is We only give an informal argument that search independent of the number of hops incurred in the by name ID, node join, and departure continue to be repair of any other ring. efficient with the addition of P-Tables. Intuitively, We now consider the levels of the pointers that search by name ID using P-Tables encounters nodes require repair. For low levels, we use the bound that interleave the R-Table nodes and since the R- that the number of pointers needing repair at level Table nodes are exponentially distributed in expec- i is at most 2i because there are at most 2i rings at tation, we expect the P-Table nodes to be approx- this level. For higher levels, we prove a high prob- imately exponentially distributed as well. Thus ability bound on the total number of pointers that search should still approximately divide the distance need to be repaired, showing that the total number to the destination by k on each hop. is M(log N + O(1)) with high probability in M. P-Table construction during node join is more in- A node of height i cannot contribute more than volved. Suppose that the intervals defined by the R- i pointers to the total number needing repair. We Table are perfectly exponentially distributed. Find- upper bound the probability that a particular node’s ing a node in the furthest interval is essentially a height exceeds h by: single search by name ID, and thus takes O(logk N) time. Suppose the interval we are currently in con- N + M 2N 1 ≤ ≤ tains g nodes. Finding a node in the next closest Pr[height >h] h h = h−log N−1 2 2 2 interval (containing at least g/k nodes) has at least Thus each node’s height is upper bounded by a geo- constant probability of requiring only one hop. If metric random variable starting at (log N +1)with we don’t arrive in the next closest interval after the parameter 1/2, and these random variables are in- first hop, we expect to be much closer, and we ex- dependent. By standard arguments, their sum is at pect the second hop to succeed in arriving in the most M(log N +3)with high probability in M. next closest interval with good probability. Iterat- The contribution of the first log M levels is ing over all intervals, the total number of hops is at most 2M pointers, while the remaining levels O(k logk N) to fill in every P-Table entry. contribute at most M(log N +3− log M) with This completes the informal argument for con- high probability. In total, the number of point- struction of P-Tables during node join. As with ers is O(M log(N/M)). The total number of C-Tables, no work is performed to maintain the P- hops is bounded by the sum of this many geomet- Table during node departure. ric random variables. This sum has expectation O(M log(N/M)) and is close to this expectation 9 Experimental Evaluation with high probability, again by standard arguments.  To understand and evaluate SkipNet’s design and performance, we used a simple packet-level, dis- crete event simulator that counts the number of 8.8 Incorporating the P-Table and the C-Table packets sent over a physical link and assigns either a We first argue that our bounds on search by nu- unit hop count or a specified delay for each link, de- meric ID, node join, and node departure continue pending upon the topology used. It does not model to hold with the addition of C-Tables to SkipNet. either queuing delay or packet losses because mod- Search by numeric ID corrects at least one digit on elling these would prevent simulation of large net- each hop, and there are never more than O(logk N) works. digits to correct (Section 8.2). Construction of a C- Our simulator implements three overlay network Table during node join amounts to a search by nu- designs: Pastry, Chord, and SkipNet. The Pastry meric ID, using C-Tables, from an arbitrary Skip- implementation is described in [30]. Our Chord im- Net node to the joining node. This yields the same plementation is based on the one available on the

30 MIT Chord web site [17], adapted to operate within network hops since the Mercator topology does not our simulator. The corresponding algorithms are provide link latencies. For the GT-ITM topology we described in [34]. For our simulations, we run the measure latency in terms of milliseconds. In con- Chord stabilization algorithm until no finger point- trast, RDP measures the penalty of using an overlay ers need updating after all nodes have joined. We network relative to IP. However, since part of Skip- use two different implementations of SkipNet: a Net’s goal is to enable the placement of data near its “basic” implementation that uses only the R-Table clients, we also care about the absolute latency that with duplicate pointer elimination, and a “full” im- a DHT lookup request incurs. plementation that includes the P-Table and C-Table Number of failed lookups: The number of un- as well. The full SkipNet implementation uses a successful lookup requests in the presence of fail- sparse R-Table, and a dense P-Table with density ures. parameter k =8. For full SkipNet, we run two rounds of stabilization for P-Table entries before We also model the presence of organizations each experiment. within the overlay network; each participating node All our experiments were run both on a Merca- belongs to a single organization. The number of or- topology [36] and a GT-ITM topology [39]. The ganizations is a parameter to the experiment, as is Mercator topology has 102,639 nodes and 142,303 the total number of nodes in the overlay. For each links. Each node is assigned to one of 2,662 Au- experiment, the total number of client lookups is ten tonomous Systems (ASs). There are 4,851 links be- times the number of nodes in the overlay. tween ASs in the topology. The Mercator topology The format of the names of participating nodes assigns a unit hop count for each link. All figures is org-name/node-name. The format of data ob- shown in this section are for the Mercator topol- ject names is org-name/node-name/random-obj-name. ogy. The experiments based on the GT-ITM topol- Therefore we assume that the “owner” of a particu- ogy produced similar results. lar data object will name it with the owner node’s Our GT-ITM topology has 5050 core routers gen- name followed by a node-local object name. In erated using the Georgia Tech random graph gener- SkipNet, this results in a data object being placed ator according to a transit-stub model. Application on the owner’s node; in Chord and Pastry, the ob- nodes were assigned to core routers with uniform ject is placed on a node corresponding to the SHA-1 probability. Each end system was directly attached hash of the object’s name. For constrained load bal- by a LAN link to its assigned router (as was done ancing experiments we use data object names that in [5]). We used the routing policy weights gen- include the ‘!’ delimiter following the name of the erated by the Georgia Tech random graph genera- organization. tor [39] to perform IP unicast routing. The delay of We model organization sizes two ways: a uni- each LAN link was set to 1ms and the average delay form model and a Zipf-like model. of core links was 40.5ms. • In the uniform model the size of each organi- 9.1 Methodology zation is uniformly distributed between 1 and N Ð the total number of application nodes in We measured the performance characteristics of the overlay network. lookups using the following evaluation criteria: • Relative Delay Penalty (RDP): The ratio of the In the Zipf-like model, the size of an organiza- tion is determined according to a distribution latency of the overlay network path between two −1.25 nodes to the latency of the IP-level path between governed by x +0.5 and normalized to the them. total number of overlay nodes in the system. All other Zipf-like distributions mentioned in Physical network distance: The absolute length this section are defined in a similar manner. of the overlay path between two nodes, in terms of the underlying network distance. For the Mercator We model three kinds of node locality: uniform, topology we measure latency in terms of physical clustered, and Zipf-clustered.

31 9 • In the uniform model, nodes are uniformly Chord 8 Pastry spread throughout the overlay. Basic SkipNet 7 Full SkipNet • In the clustered model, the nodes of an or- 6 ganization are uniformly spread throughout a 5 single randomly chosen autonomous system in 4 the Mercator topology and throughout a ran- 3 domly chosen stub network in GT-ITM. In 2 Relative Delay Penalty (RDP) Penalty Delay Relative Mercator we ensure that the selected AS has 1 at least 1/10-th as many core router nodes as 0 1,000 10,000 100,000 overlay nodes. For GT-ITM, if an organiza- Number of Nodes tion has 1000 or less member nodes, then we spread it across a single stub network, other- Figure 12. RDP as a function of network size. Con- wise we spread it across a “stub cluster”–a set figuration: 1000 organizations with Zipf-like sizes, of stub networks that all connect to the same nodes and data names are Zipf-clustered. transit link. Chord Basic SkipNet Full SkipNet Pastry • For Zipf-clustered, we place organizations 16.3 41.7 102.2 63.2 within ASes or stub networks, as before. How- ever, the nodes of an organization are spread Table 1. Average number of unique routing entries throughout its AS or stub network as follows: per node in an overlay with 216 nodes. A “root” physical node is randomly placed within the AS or stub network and all over- simulations may impact performance, so we used lay nodes are placed relative to this root, at realistic distributions for both host names and or- distances modeled by a Zipf-like distribution. ganization names. Our distribution of organization In this configuration most of the overlay nodes names was derived from a list of 5,608 unique orga- of an organization will be closely clustered to- nizations which had at least one peer participating in gether within their AS or stub network. This Gnutella in March 2001 [33]. The host name distri- configuration is especially relevant to the Mer- bution was obtained from a list of 177,000 internal cator topology, in which some ASes span large host names in use at Microsoft Corporation. portions of the entire topology. We model locality of data access by specifying Data object names, and therefore data placement, what fraction of all data lookups will be forced to are modelled similarly. In a uniform model, data request data local to the requestor’s organization. names are generated by randomly selecting an or- Finally, we model system behavior under Internet- ganization and then a random node within that or- like failures and study document availability within ganization. In a clustered model, data names are a disconnected organization. We simulate domain generated by selecting an organization according to isolation by failing the links connecting the organi- a Zipf-like distribution and then a random member zation’s AS to the rest of the network in Mercator node within that organization. For Zipf-clustered, and by failing the relevant transit links in GT-IM. data names are generated by randomly selecting an Each experiment is run ten times, with different organization according to a Zipf-like distribution random seeds, and the mean values are presented. and then selecting a member node according to a SkipNet uses 128-bit numeric IDs and a leaf set of Zipf-like distribution of its distance from the “root” 16 nodes. Chord and Pastry use their default config- node of the organization. Note that for Chord and urations [34, 30]. Pastry, but not SkipNet, hashing spreads data ob- Our experiments measured the costs of sending jects uniformly among all overlay nodes in all of overlay messages to overlay nodes using the dif- these three models. ferent overlays under various distributions of nodes For SkipNet, the actual node names used in our and content. Data gathered included:

32 Application Hops: The number of application- or By Node. Uniform spreads document names uni- level hops required to route a message via the over- formly across all nodes. By Org applies a Zipf- lay to the destination like distribution causing larger organizations to have Relative Delay Penalty (RDP): The ratio be- a larger share of documents, with documents uni- tween the average delay using overlay routing and formly distributed across nodes within each orga- the average delay using IP routing. nization. By Node is used in conjunction with a Zipf-like distribution of nodes within an organiza- Experimental parameters varied included: tion to distribute documents within the organization Overlay Type: Chord, Pastry, Basic SkipNet,or with the same distribution as the nodes themselves. Full SkipNet. % Local: Fraction of lookups that are con- Topology: Mercator (the default) or GT-ITM. strained to be local to documents within the client’s Message Type: Either DHT Lookup (the de- organization. Non-local lookups are distributed fault), indicating that messages are DHT lookups, among all documents in the experiment. or Send, indicating that messages are being sent to randomly chosen overlay nodes. Overlay-specific parameter defaults were: Nodes (N): Number of overlay nodes. Most ex- Chord: NodeID Bits = 40. periments vary N from 28 through 216 increasing Pastry: NodeID Bits = 128, Bits per Digit (b)= by powers of two. Some fix N at 216. 4, Leaf Set size = 16. Lookups: Number of lookup requests routed per SkipNet: Basic configuration: Random ID Bits experiment. Usually 10 × N. = 128, Leaf Set size = 16, ring branching factor (k) Trials: The number of times each experiment is =2. Full configuration: Same as basic, except k = run, each with different random seed values. Usu- 8 and adds use of P-Table for proximity awareness ally 10. Results reported are the average of all runs. and C-Table for efficient numeric routing. Organizations: Number of distinct organization names content is located within. Typical values in- 9.2 Basic Routing Costs clude 1, 10, 100, and 1000 organizations. Nodes within an organization are located within the same To understand SkipNet’s routing performance we region of the simulated network topology. For Mer- simulated overlay networks varying the number of cator topologies they are located within the same nodes from 1,024 to 65,536. We ran experiments Autonomous System (AS). In a GT-ITM topology with 10, 100, and 1000 organizations and with all for small organizations they are all nodes attached the permutations obtainable for organization size to the same stub network and for large organizations distribution, node placement, and data placement. they are all nodes connected to the same stub cluster The intent was to see how RDP behaves under var- Ð a set of stub networks that all connect to the same ious configurations. We were especially curious to transit link. see whether the non-uniform distribution of data ob- Organization Sizes: One of Uniform Ð indi- ject names would adversely affect the performance cating randomly chosen organization sizes between of SkipNet lookups, as compared to Chord and Pas- 1 and N in size or Zipf Ð indicating organization try. 1 sizes chosen using a x1.25 Zipf distribution with the Figure 12 presents the RDPs measured for both 1 largest organization size being 2 N. implementations of SkipNet, as well as Chord and Node Locality: One of Uniform or Zipf. Con- Pastry. Table 1 shows the average number of unique trols how node locations cluster within each organi- routing table entries per node in an overlay with zation. Uniform spreads nodes randomly among the 216 nodes. All other configurations, including the nodes within an organization’s topology. Zipf sorts completely uniform ones, exhibited similar results candidate nodes by distance from a chosen root to those shown here. node within an organization’s topology and clusters Our conclusion is that basic SkipNet performs nodes using a Zipf distribution near that node. similarly to Chord and full SkipNet performs sim- Document Locality: One of Uniform, By Org, ilarly to Pastry. This is not surprising since both

33 120 100%

100 Chord Chord Pastry 80% Pastry Basic SkipNet 80 Full SkipNet 60% Basic SkipNet 60 40% 40 Chord Full SkipNet Pastry Physical Network Hops 20% Basic SkipNet

20 ofLookups Failed Percentage Full SkipNet

0 0% 0% 20% 40% 60% 80% 100% 0% 20% 40% 60% 80% 100% Fraction of Forced Local Lookups Fraction of Forced Local Lookups

Figure 13. Absolute latency (in network hops) for Figure 14. Number of failed lookup requests as lookups as a function of data access locality (per- a function of data access locality (percentage of centage of lookups forced to be within a single or- lookup requests forced to be within a single orga- ganization). Configuration: 216 nodes, 100 organi- nization) for a disconnected organization. Configu- zations with Zipf-like sizes, nodes and data names ration: 216 nodes, 100 organizations with Zipf-like are Zipf-clustered. sizes, nodes and data names are Zipf-clustered. basic SkipNet and Chord do not support network were over a factor of seven less than Pastry’s for proximity-aware routing whereas full SkipNet and 100% local lookups. Pastry do. Since all our other configurations produced similar results, we conclude that Skip- 9.4 Fault Tolerance Net’s performance is not adversely affected by non- Content locality also improves fault tolerance. uniform distributions of names. Figure 14 shows the number of lookups that failed when an organization was disconnected from the 9.3 Exploiting Locality of Placement rest of the network. RDP only measures performance relative to IP- This (common) Internet-like failure had catas- based routing. However, one of SkipNet’s key ben- trophic consequences for Chord and Pastry. The efits is that it enables localized placement of data. size of the isolated organization in this experiment Figure 13 shows the average number of physical was roughly 15% of the total nodes in the system. network hops for lookup requests. The x-axis in- Consequently, Chord and Pastry will both place dicates what fraction of lookups were forced to be roughly 85% of the organization’s data on nodes to local data (i.e., the data object names that were outside the organization. Furthermore, they must looked up were from the same organization as the also attempt to route lookup requests with 85% of requesting client). The y-axis shows the number of the overlay network’s nodes effectively failed (from physical network hops for lookup requests. the disconnected organization’s point-of-view). At As expected, both Chord and Pastry are oblivi- this level of failures, routing is effectively impossi- ous to the locality of data references since they dif- ble. The net result is a failed lookups ratio of very fuse data throughout their overlay network. On the close to 100%. other hand, both versions of SkipNet show signif- In contrast, both versions of SkipNet do better icant performance improvements as the locality of the more locality of reference there is. When no data references increases. It should be noted that lookups are forced to be local, SkipNet fails to ac- Figure 13 actually understates the benefits gained by cess the 85% of data that is non-local to the orga- SkipNet because, in our Mercator topology, inter- nization. As the percentage of local lookups is in- domain links have the same cost as intra-domain creased to 100%, the percentage of failed lookups links. In an equivalent experiment run on the GT- goes to 0. ITM topology, SkipNet end-to-end lookup latencies To experimentally confirm the behavior of Skip-

34 16 8 Pastry 14 7 Basic CLB 12 Full CLB 6 10 5 8 4 6 Routing Hops Routing 3 4 After Root Ring Merge After All Levels Merged 2 2

0 (RDP) Penalty Delay Relative 1 0 2000 4000 6000 8000 10000 Nodes 0 100 1,000 10,000 100,000 Number of Nodes Figure 15. Number of routing hops taken to route inter- organizational messages, as a function of network size, after an organization’s internal SkipNet has been recon- Figure 16. RDP of lookups for data that is con- nected to the global SkipNet root ring and after the merge strained load balanced (CLB) as a function of net- has been fully completed. work size. Configuration: 100 organizations with Zipf-like sizes, nodes and data names are Zipf- Net’s disconnection and merge algorithms de- clustered. scribed in Section 6, we extended the simulator to support disconnection of AS subnetworks. Figure want to load-balance across. 15 shows the routing performance we observed be- 9.6 Network Proximity tween a previously disconnected organization and the rest of the system once the organization’s Skip- Figure 17 shows the performance of SkipNet Net root ring has been connected to the global Skip- routing using the P-Table. The x-axis varies the Net root ring. We also show the routing perfor- configuration parameter k which controls the den- mance observed when all higher level pointers have sity of P-Table pointers. The y-axis shows the rout- been repaired. ing performance in terms of RDP, and each data point is labelled with the average number of unique 9.5 Constrained Load Balancing pointers per node. Note that the C-Table was not en- abled so the pointers are from the R-table, P-Table Figure 16 explores the routing performance of and leaf set. Figure 17 shows that for small val- two different CLB configurations, and compares ues of k, increasing k yields a large RDP improve- their performance with Pastry. For each system, all ment with a small increase in the number of point- lookup traffic is organization-local data. The orga- ers. As k grows, we see minimal improvement in nization sizes as well as node and data placement RDP but significantly more pointers. This suggests are clustered with a Zipf-like distribution. The Ba- that choosing k =8provides most of the RDP ben- sic CLB configuration uses only the R-Table de- efit with a reasonable number of pointers. scribed in Section 3, whereas Full CLB makes use We also analyzed the sensitivity of P-Table per- of the R-Table and the C-Table, as described in Sec- formance to the choice of the initial seed node. We tion 5.4. compared the performance when choosing a seed The Full CLB curve shows a significant perfor- node at random with choosing the seed as the clos- mance improvement over Basic CLB, justifying the est node in the system. Our results show virtu- cost of maintaining the extra routing tables. How- ally identical performance, which indicates that the ever, even with the additional tables, the Full CLB P-Table join mechanism is effective at locating a performance trails Pastry’s performance. We plan to nearby seed. investigate further techniques to reduce the latency of CLB. The key observation, however, is that in 10 Conclusion order to mimic the CLB functionality with a tradi- tional peer-to-peer overlay network, multiple rout- To become broadly acceptable application infras- ing tables are required, one for each domain that you tructure, peer-to-peer systems need to support both

35 5 ures, SkipNet behaves comparably to other peer-to- 49.7 peer systems. 4 Our evaluation has demonstrated that SkipNet’s 49.5 54.1 performance is similar to other peer-to-peer sys- 3 tems such as Chord and Pastry under uniform ac- 64.5 74.0 116.8 95.6 2 cess patterns. Under access patterns where intra- organizational traffic predominates, SkipNet per- 1 Relative Delay Penalty (RDP) Penalty Delay Relative forms better. Our experiments show that SkipNet Full SkipNet is significantly more resilient to organizational net- 0 0 2 4 6 8 10 12 14 16 work partitions than other peer-to-peer systems. Density Parameter K In future work, we plan to deploy SkipNet across a testbed of 2000 machines emulating a WAN. Figure 17. RDP for Full SkipNet as a function This deployment should further our understanding of the density configuration parameter k. The la- of SkipNet’s behavior in the face of dynamic node bels next to each point represent the average num- joins and departures, network congestion, and other ber of unique pointers per node. Configuration: real-world scenarios. We also plan to evaluate Skip- 16 2 nodes, 1000 organizations with Zipf-like sizes, Net as infrastructure for implementing a scalable nodes and data names are Zipf-clustered. event notification service [2]. content and path locality: the ability to control Acknowledgements where data is stored and to guarantee that routing paths remain local within an administrative domain We thank Antony Rowstron, Miguel Castro, and whenever possible. These properties provide a num- Anne-Marie Kermarrec for allowing us to use their ber of advantages, including improved availability, Pastry implementation and network simulator. We performance, manageability, and security. To our thank Atul Adya, who independently observed that knowledge, SkipNet is the first peer-to-peer sys- Chord’s structure suggested the possibility of a Skip tem design that achieves both content and routing List-based distributed data structure, and provided path locality. SkipNet achieves this without sacrific- helpful feedback on drafts of this paper. We also ing the performance goals of previous peer-to-peer thank Scott Sheffield for his insights on the analysis systems: Nodes maintain a logarithmic amount of of searching by name. state and operations require a logarithmic number of message hops. References SkipNet provides content locality at any desired degree of granularity. Constrained load balancing [1] J. Aspnes and G. Shah. Skip Graphs. In Proceedings of the 14th Annual ACM-SIAM Symposium on Discrete encompasses placing data on a particular node, as Algorithms, Jan. 2003. well as traditional DHT functionality, and any inter- [2] L. F. Cabrera, M. B. Jones, and M. Theimer. Herald: mediate level of granularity. This granularity is only Achieving a Global Event Notification Service. In Pro- limited by the hierarchy encoded in nodes’ name ceedings of the Eighth Workshop on Hot Topics in Oper- ating Systems (HotOS-VIII), May 2001. IDs. [3] M. Castro, P. Druschel, A. Ganesh, A. Rowstron, and Clustering node names by organization allows D. Wallach. Security for peer-to-peer routing overlays. In SkipNet to perform gracefully in the face of a com- Proceedings of the Fifth Symposium on Operating System Design and Implementation (OSDI). USENIX, December mon type of Internet failure: When an organization 2002. loses connectivity to the rest of the network, Skip- [4] M. Castro, P. Druschel, Y. C. Hu, and A. Rowstron. Net fragments into two segments that are still able Topology-aware routing in structured peer-to-peer over- to route efficiently internally. SkipNet also provides lay networks. Technical Report MSR-TR-2002-82, Mi- a mechanism to efficiently re-merge these segments crosoft Research, 2002. [5] Y.-H. Chu, S. G. Rao, and H. Zhang. A case for end with the global SkipNet when the network partition system multicast. In ACM SIGMETRICS 2000, pages 1Ð heals. With uncorrelated and independent node fail- 12, Santa Clara, CA, June 2000.

36 [6] I. Clarke, O. Sandberg, B. Wiley, and T. W. Hong. Tolerant Computing Symposium (FTCS), June 1999. Freenet: A Distributed Anonymous Information Storage [22] D. Malkhi, M. Naor, and D. Ratajczak. Viceroy: A Scal- and Retrieval System. In Workshop on Design Issues able and Dynamic Emulation of the Butterfly. In Proceed- in Anonymity and Unobservability, pages 311Ð320, July ings of the 21st Annual ACM Symposium on Principles of 2000. ICSI, Berkeley, CA, USA. (PODC), July 2002. [7] T. H. Cormen, C. E. Leiserson, and R. L. Rivest. In- [23] P. Maymounkov and D. Mazieres.` Kademlia: A Peer-to- troduction to Algorithms. MIT Press, Cambridge, MA, peer Information System Based on the XOR Metric. In 1990. Proceedings of the First International Workshop on Peer- [8] F. Dabek, M. F. Kaashoek, D. Karger, R. Morris, and to-Peer Systems (IPTPS’02), MIT, March 2002. I. Stoica. Wide-area cooperative storage with CFS. In [24] D. Oppenheimer, A. Ganapathi, and D. A. Patterson. 18th ACM Symposium on Operating Systems Principles, Why do Internet services fail, and what can be done about Oct. 2001. it? In Proceedings of Fourth USENIX Symposium on In- [9] J. R. Douceur. The Sybil Attack. In Proceedings of First ternet Technologies and Systems (USITS ’03), Mar. 2003. International Workshop on Peer-to-Peer Systems (IPTPS [25] T. Papadakis. Skip Lists and Probabilistic Analysis of Al- ’02), March 2002. gorithms. PhD thesis, University of Waterloo, 1993. Also [10] J. R. Douceur, A. Adya, W. J. Bolosky, D. Simon, and available as Technical Report CS93-28. M. Theimer. Reclaiming space from duplicate files in a [26] W. Pugh. Skip Lists: A Probabilistic Alternative to Bal- serverless distributed file system. In Proceedings of the anced Trees. In Workshop on Algorithms and Data Struc- 22nd ICDCS, July 2002. tures, pages 437Ð449, 1989. [11] E. Fredkin. Trie Memory. Communications of the ACM, [27] W. Pugh. A Skip List Cookbook. Technical Report CS- 3(9):490Ð499, Sept. 1960. TR-2286.1, University of Maryland, 1990. [12] Gnutella. http://www.gnutelliums.com/. [28] S. Ratnasamy, P. Francis, M. Handley, R. Karp, and [13] S. Gribble, E. Brewer, J. Hellerstein, and D. Culler. Scal- S. Shenker. A Scalable Content-Addressable Network. able, distributed data structures for Internet service con- In Proceedings of ACM SIGCOMM, Aug. 2001. struction. In Proceedings of the Fourth Symposium on [29] S. Ratnasamy, M. Handley, R. Karp, and S. Shenker. Operating Systems Design and Implementation (OSDI Application-level Multicast using Content-Addressable 2000), October 2000. Networks. In Proceedings of the Third International [14] N. J. A. Harvey, M. B. Jones, S. Saroiu, M. Theimer, and Workshop on Networked Group Communication,Nov. A. Wolman. SkipNet: A Scalable Overlay Network with 2001. Practical Locality Properties. In Proceedings of Fourth [30] A. Rowstron and P. Druschel. Pastry: Scalable, dis- USENIX Symposium on Internet Technologies and Sys- tributed object location and routing for large-scale peer- tems (USITS ’03), Mar. 2003. to-peer systems. In International Conference on Dis- [15] N. J. A. Harvey, M. B. Jones, M. Theimer, and A. Wol- tributed Systems Platforms (Middleware), pages 329Ð man. Efficient Recovery From Organizational Discon- 350, Heidelberg, Germany, Nov. 2001. nects in SkipNet. In Proceedings of Second Interna- [31] A. Rowstron and P. Druschel. Storage management and tional Workshop on Peer-to-Peer Systems (IPTPS ’03), caching in PAST, a large-scale, persistent peer-to-peer Feb. 2003. storage utility. In 18th ACM Symposium on Operating [16] S. Iyer, A. Rowstron, and P. Druschel. Squirrel: A de- Systems Principles, Oct. 2001. centralized, peer-to-peer web cache. In Proceedings of [32] A. Rowstron, A.-M. Kermarrec, M. Castro, and P. Dr- the 21st Annual ACM Symposium on Principles of Dis- uschel. Scribe: The design of a large-scale event notifi- tributed Computing (PODC). ACM, July 2002. cation infrastructure. In Third International Workshop on [17] F. Kaashoek, R. Morris, F. Dabek, I. Stoica, E. Brunskill, Networked Group Communications, Nov 2001. D. Karger, R. Cox, and A. Muthitacharoen. The Chord [33] S. Saroiu, P. K. Gummadi, and S. D. Gribble. A mea- Project, 2002. http://www.pdos.lcs.mit.edu/chord/. surement study of peer-to-peer file sharing systems. In [18] D. Karger, E. Lehman, F. Leighton, M. Levine, D. Lewin, Proceedings of Multimedia Computing and Networking, and R. Panigraphy. Consistent hashing and random trees: San Jose, CA, USA, Jan. 2002. Distributed caching protocols for relieving hot spots on [34] I. Stoica, R. Morris, D. Karger, M. F. Kaashoek, and the World Wide Web. In Proceedings of the 29th Annual H. Balakrishnan. Chord: A Scalable Peer-To-Peer ACM Symposium on Theory of Computing, pages 654Ð Lookup Service for Internet Applications. In Proceedings 663, May 1997. of the ACM SIGCOMM ’01 Conference, pages 149Ð160, [19] P. Keleher, S. Bhattacharjee, and B. Silaghi. Are Virtual- San Diego, California, August 2001. ized Overlay Networks Too Much of a Good Thing? In [35] I. Stoica, R. Morris, D. Karger, M. F. Kaashoek, and Proceedings of First International Workshop on Peer-to- H. Balakrishnan. Chord: A Scalable Peer-To-Peer Peer Systems (IPTPS ’02), March 2002. Lookup Service for Internet Applications. Technical Re- [20] D. E. Knuth. The Art of Computer Programming, Vol- port TR-819, MIT, March 2001. ume 3: Sorting and Searching. Addison-Wesley, Read- [36] H. Tangmunarunkit, R. Govindan, S. Shenker, and D. Es- ing, MA, 1973. trin. The Impact of Routing Policy on Internet Paths. In [21] C. Labovitz and A. Ahuja. Experimental Study of Inter- INFOCOM, pages 736Ð742, April 2001. net Stability and Wide-Area Backbone Failures. In Fault- [37] M. Theimer and M. B. Jones. Overlook: Scalable Name

37 Service on an Overlay Network. In Proceedings of the 22nd International Conference on Distributed Computing Systems (ICDCS). IEEE Computer Society, July 2002. [38] A. Vahdat, J. Chase, R. Braynard, D. Kostic, and A. Ro- driguez. Self-Organizing Subsets: From Each According to His Abilities, To Each According to His Needs. In Pro- ceedings of First International Workshop on Peer-to-Peer Systems (IPTPS ’02), March 2002. [39] E. W. Zegura, K. L. Calvert, and S. Bhattacharjee. How to Model an Internetwork. In Proceedings of IEEE Infocom ’96, April 1996. [40] B. Y. Zhao, J. D. Kubiatowicz, and A. D. Joseph. Tapestry: An Infrastructure for Fault-Resilient Wide-area Location and Routing. Technical Report UCB//CSD-01- 1141, U. C. Berkeley, April 2001.

38