Acknowledgements Distributed Hash Tables (DHTs) Chord & CAN
The followings slides are borrowed or adapted from talks by Robert CS 699/IT 818 Morris (MIT) and Sylvia Ratnasamy (ICSI) Sanjeev Setia
1 2
Chord Simplicity
Resolution entails participation by O(log(N)) Chord: A Scalable Peer-to-peer Lookup Service for Internet nodes Applications Resolution is efficient when each node enjoys accurate information about Robert Morris O(log(N)) other nodes Ion Stoica, David Karger, Resolution is possible when each node M. Frans Kaashoek, Hari Balakrishnan enjoys accurate information about 1 other MIT and Berkeley node “Degrades gracefully” SIGCOMM Proceedings, 2001 3 4
1 Chord Algorithms Chord Properties
Basic Lookup Efficient: O(log(N)) messages per lookup N is the total number of servers Node Joins Scalable: O(log(N)) state per node Stabilization Robust: survives massive failures Failures and Replication Proofs are in paper / tech report Assuming no malicious participants
5 6
Chord IDs Consistent Hashing[Karger 97]
Target: web page caching Key identifier = SHA-1(key) Like normal hashing, assigns items to Node identifier = SHA-1(IP address) buckets so that each bucket receives Both are uniformly distributed roughly the same number of items Both exist in the same ID space Unlike normal hashing, a small change in the bucket set does not induce a total How to map key IDs to node IDs? remapping of items to buckets
7 8
2 Consistent Hashing [Karger 97] Basic lookup
Key 5 K5 N120 Node 105 N10 “Where is key 80?” N105 K20 N105
Circular 7-bit “N90 has K80” N32 ID space N32
K80 N90 N90 A key is stored at its successor: node with next higher ID K80 N60
9 10
Simple lookup algorithm “Finger table” allows log(N)-time lookups
Lookup(my-id, key-id)
n = my successor 1/4 1/2 if my-id < n < key-id call Lookup(id) on node n // next hop 1/8 else
1/16 return my successor // done 1/32 1/64 1/128 N80 Correctness depends only on successors
11 12
3 Lookup with fingers Finger i points to successor of n+2i-1
Lookup(my-id, key-id) N120 look in local finger table for 112 1/4 1/2 highest node n s.t. my-id < n < key-id if n exists call Lookup(id) on node n // next hop 1/8 else 1/16 1/32 1/64 return my successor // done 1/128 N80
13 14
Lookups take O(log(N)) hops Node Join - Linked List Insert
N5
N10 N110 K19 N20 N25 N99 N36 N32 Lookup(K19) 1. Lookup(36) K30 N40 K38 N80
N60
15 16
4 Node Join (2) Node Join (3)
N25 N25
2. N36 sets its own N36 3. Copy keys 26..36 N36 K30 successor pointer from N40 to N36 K30 K30 N40 N40 K38 K38
17 18
Node Join (4) Stabilization
Case 1: finger tables are reasonably fresh Case 2: successor pointers are correct; fingers N25 are inaccurate Case 3: successor pointers are inaccurate or key 4. Set N25’s successor N36 K30 migration is incomplete pointer K30 N40 K38 Stabilization algorithm periodically verifies and refreshes node knowledge Successor pointers Update finger pointers in the background Predecessor pointers Correct successors produce correct lookups Finger tables
19 20
5 Solution: successor lists Failures and Replication
N120 N10 r N113 Each node knows immediate successors After failure, will know first live successor N102 Correct successors guarantee correct lookups
N85 Lookup(90) Guarantee is with some probability
N80
N80 doesn’t know correct successor, so incorrect lookup
21 22
Choosing the successor list length Chord status
Working implementation as part of CFS Chord library: 3,000 lines of C++ Assume 1/2 of nodes fail Deployed in small Internet testbed r P(successor list all dead) = (1/2) Includes: I.e. P(this node breaks the Chord ring) Correct concurrent join/fail Proximity-based routing for low delay Depends on independent failure Load control for heterogeneous nodes r N P(no broken nodes) = (1 – (1/2) ) Resistance to spoofed node IDs r = 2log(N) makes prob. = 1 – 1/N
23 24
6 Experimental overview Chord lookup cost is O(log N)
Quick lookup in large systems Low variation in lookup costs Robust despite massive failure See paper for more results Constant is 1/2
Experiments confirm theoretical results Average Messages per Lookup
Number of Nodes
25 26
Failure experimental setup Massive failures have little impact
Start 1,000 CFS/Chord servers 1.4 Successor list has 20 entries 6 1.2 (1/2) is 1.6% Wait until they stabilize 1 Insert 1,000 key/value pairs 0.8 0.6 Five replicas of each 0.4 Stop X% of the servers 0.2 Failed Lookups (Percent) Immediately perform 1,000 lookups 0 5 10 15 20 25 30 35 40 45 50 Failed Nodes (Percent)
27 28
7 Latency Measurements Chord Summary
180 Nodes at 10 sites in US testbed 16 queries per physical site (sic) for random keys Chord provides peer-to-peer hash lookup Maximum < 600 ms Efficient: O(log(n)) messages per lookup Minimum > 40 ms Median = 285 ms Robust as nodes fail and join Expected value = 300 ms (5 round trips) Good primitive for peer-to-peer systems
http://www.pdos.lcs.mit.edu/chord
29 30
Content-Addressable Network (CAN) A Scalable, Content-Addressable Network CAN: Internet-scale hash table Interface Sylvia Ratnasamy, Paul Francis, Mark Handley, insert(key,value)
Richard Karp, Scott Shenker value = retrieve(key) Properties scalable operationally simple good performance Related systems: Chord/Pastry/Tapestry/Buzz/Plaxton ...
31 32
8 Problem Scope CAN: basic idea
Design a system that provides the interface scalability K V robustness K V performance K V security K V
K V Application-specific, higher level primitives K V K V keyword searching mutable content K V anonymity K V K V K V
33 34
CAN: basic idea CAN: basic idea
K V K V K V K V
K V K V K V K V
K V K V
K V K V K V K V
K V K V
K V K V K V K V insert K V insert K V
(K1,V1) (K1,V1) 35 36
9 CAN: basic idea CAN: basic idea
(K1,V1) K V K V K V K V
K V K V K V K V
K V K V
K V K V K V K V
K V K V
K V K V K V K V K V K V
retrieve (K1) 37 38
CAN: solution CAN: simple example
virtual Cartesian coordinate space
entire space is partitioned amongst all the nodes every node “owns” a zone in the overall space
abstraction 1 can store data at “points” in the space can route from one “point” to another
point = node that owns the enclosing zone
39 40
10 CAN: simple example CAN: simple example
3
1 2 1
2
41 42
CAN: simple example CAN: simple example
3
1
2 4
43 44
11 CAN: simple example CAN: simple example
node I::insert(K,V) I I
45 46
CAN: simple example CAN: simple example
node I::insert(K,V) node I::insert(K,V)
(1) a = hx(K) I (1) a = hx(K) I
b = hy(K) y = b
x = a 47 x = a 48
12 CAN: simple example CAN: simple example
node I::insert(K,V) node I::insert(K,V)
(1) a = hx(K) I (1) a = hx(K) I
b = hy(K) b = hy(K)
(2) route(K,V) -> (a,b) (2) route(K,V) -> (a,b) (K,V)
(3) (a,b) stores (K,V)
49 50
CAN: simple example CAN
node J::retrieve(K)
(1) a = hx(K) Data stored in the CAN is addressed by b = hy(K) name (i.e. key), not location (i.e. IP (2) route “retrieve(K)” to (a,b) (K,V) address)
J
51 52
13 CAN: routing table CAN: routing
(a,b)
(x,y)
53 54
CAN: routing CAN: node insertion
A node only maintains state for its Bootstrap immediate neighboring nodes node
new node 1) Discover some node “I” already in CAN
55 56
14 CAN: node insertion CAN: node insertion
(p,q) 2) pick random point in space
I I new node new node 1) discover some node “I” already in CAN 57 58
CAN: node insertion CAN: node insertion
(p,q)
J J new I new node 3) I routes to (p,q), discovers node J 59 4) split J’s zone in half… new owns one half 60
15 CAN: node insertion CAN: node failures
Need to repair the space Inserting a new node affects only a single recover database . soft-state updates other node and its immediate neighbors . use replication, rebuild database from replicas repair routing . takeover algorithm
61 62
CAN: takeover algorithm CAN: node failures
Simple failures Only the failed node’s immediate neighbors know your neighbor’s neighbors when a node fails, one of its neighbors takes over are required for recovery its zone More complex failure modes simultaneous failure of multiple adjacent nodes scoped flooding to discover neighbors hopefully, a rare event
63 64
16 Design recap Evaluation
Basic CAN completely distributed Scalability self-organizing Low-latency nodes only maintain state for their immediate neighbors Load balancing Additional design features multiple, independent spaces (realities) Robustness background load balancing algorithm simple heuristics to improve performance
65 66
CAN: scalability CAN: low-latency
For a uniformly partitioned space with n nodes and d Problem dimensions latency stretch = (CAN routing delay) per node, number of neighbors is 2d (IP routing delay) 1/d average routing path is (dn )/4 hops application-level routing may lead to high stretch simulations show that the above results hold in practice Solution Can scale the network without increasing per-node increase dimensions state heuristics Chord/Plaxton/Tapestry/Buzz . RTT-weighted routing log(n) nbrs with log(n) hops . multiple nodes per zone (peer nodes) . deterministically replicate entries
67 68
17 CAN: low-latency CAN: low-latency
#dimensions = 2 #dimensions = 10 10 180 160 8 140 w/o heuristics w/o heuristics 120 w/ heuristics 6 w/ heuristics 100 80 4 60 40 2
Latency stretch 20 Latency stretch 0 0 16K 32K 65K 131K 16K 32K 65K 131K #nodes #nodes
69 70
CAN: load balancing Uniform Partitioning
Two pieces Dealing with hot-spots . popular (key,value) pairs Added check . nodes cache recently requested entries at join time, pick a zone . overloaded node replicates popular entries at neighbors check neighboring zones Uniform coordinate space partitioning pick the largest zone and split that one . uniformly spread (key,value) entries . uniformly spread out routing load
71 72
18 Uniform Partitioning CAN: Robustness
65,000 nodes, 3 dimensions 100 Completely distributed
80 w/o check no single point of failure
of nodes Percentage w/ check Not exploring database recovery 60 Resilience of routing 40 can route around trouble V = total volume n 20
0 V V V V V 2V 4V 8V 16 8 4 2 Volume
73 74
Routing resilience Routing resilience
destination
source
75 76
19 Routing resilience Routing resilience
destination
77 78
Routing resilience Routing resilience
Node X::route(D) If (X cannot make progress to D) – check if any neighbor of X can make progress – if yes, forward message to one such nbr
79 80
20 Routing resilience Routing resilience
CAN size = 16K nodes CAN size = 16K nodes Pr(node failure) = 0.25 #dimensions = 10 100 100
80 80
60 60
40 40
20 20
Pr(successful routing) Pr(successful routing) 0 0 0 0.25 0.5 0.75 2 4 6 8 10 Pr(node failure)
81 82 dimensions
Summary
CAN
an Internet-scale hash table
potential building block in Internet applications Scalability
O(d) per-node state Low-latency routing
simple heuristics help a lot Robust
decentralized, can route around trouble
83
21