Distributed Hash Tables (Dhts) Chord & CAN

Acknowledgements Distributed Hash Tables (DHTs) Chord & CAN The followings slides are borrowed or adapted from talks by Robert CS 699/IT 818 Morris (MIT) and Sylvia Ratnasamy (ICSI) Sanjeev Setia 1 2 Chord Simplicity Resolution entails participation by O(log(N)) Chord: A Scalable Peer-to-peer Lookup Service for Internet nodes Applications Resolution is efficient when each node enjoys accurate information about Robert Morris O(log(N)) other nodes Ion Stoica, David Karger, Resolution is possible when each node M. Frans Kaashoek, Hari Balakrishnan enjoys accurate information about 1 other MIT and Berkeley node “Degrades gracefully” SIGCOMM Proceedings, 2001 3 4 1 Chord Algorithms Chord Properties Basic Lookup Efficient: O(log(N)) messages per lookup N is the total number of servers Node Joins Scalable: O(log(N)) state per node Stabilization Robust: survives massive failures Failures and Replication Proofs are in paper / tech report Assuming no malicious participants 5 6 Chord IDs Consistent Hashing[Karger 97] Target: web page caching Key identifier = SHA-1(key) Like normal hashing, assigns items to Node identifier = SHA-1(IP address) buckets so that each bucket receives Both are uniformly distributed roughly the same number of items Both exist in the same ID space Unlike normal hashing, a small change in the bucket set does not induce a total How to map key IDs to node IDs? remapping of items to buckets 7 8 2 Consistent Hashing [Karger 97] Basic lookup Key 5 K5 N120 Node 105 N10 “Where is key 80?” N105 K20 N105 Circular 7-bit “N90 has K80” N32 ID space N32 K80 N90 N90 A key is stored at its successor: node with next higher ID K80 N60 9 10 Simple lookup algorithm “Finger table” allows log(N)-time lookups Lookup(my-id, key-id) n = my successor 1/4 1/2 if my-id < n < key-id call Lookup(id) on node n // next hop 1/8 else 1/16 return my successor // done 1/32 1/64 1/128 N80 Correctness depends only on successors 11 12 3 Lookup with fingers Finger i points to successor of n+2i-1 Lookup(my-id, key-id) N120 look in local finger table for 112 1/4 1/2 highest node n s.t. my-id < n < key-id if n exists call Lookup(id) on node n // next hop 1/8 else 1/16 1/32 1/64 return my successor // done 1/128 N80 13 14 Lookups take O(log(N)) hops Node Join - Linked List Insert N5 N10 N110 K19 N20 N25 N99 N36 N32 Lookup(K19) 1. Lookup(36) K30 N40 K38 N80 N60 15 16 4 Node Join (2) Node Join (3) N25 N25 2. N36 sets its own N36 3. Copy keys 26..36 N36 K30 successor pointer from N40 to N36 K30 K30 N40 N40 K38 K38 17 18 Node Join (4) Stabilization Case 1: finger tables are reasonably fresh Case 2: successor pointers are correct; fingers N25 are inaccurate Case 3: successor pointers are inaccurate or key 4. Set N25’s successor N36 K30 migration is incomplete pointer K30 N40 K38 Stabilization algorithm periodically verifies and refreshes node knowledge Successor pointers Update finger pointers in the background Predecessor pointers Correct successors produce correct lookups Finger tables 19 20 5 Solution: successor lists Failures and Replication N120 N10 r N113 Each node knows immediate successors After failure, will know first live successor N102 Correct successors guarantee correct lookups N85 Lookup(90) Guarantee is with some probability N80 N80 doesn’t know correct successor, so incorrect lookup 21 22 Choosing the successor list length Chord status Working implementation as part of CFS Chord library: 3,000 lines of C++ Assume 1/2 of nodes fail Deployed in small Internet testbed r P(successor list all dead) = (1/2) Includes: I.e. P(this node breaks the Chord ring) Correct concurrent join/fail Proximity-based routing for low delay Depends on independent failure Load control for heterogeneous nodes r N P(no broken nodes) = (1 – (1/2) ) Resistance to spoofed node IDs r = 2log(N) makes prob. = 1 – 1/N 23 24 6 Experimental overview Chord lookup cost is O(log N) Quick lookup in large systems Low variation in lookup costs Robust despite massive failure See paper for more results Constant is 1/2 Experiments confirm theoretical results Average Messages per Lookup Number of Nodes 25 26 Failure experimental setup Massive failures have little impact Start 1,000 CFS/Chord servers 1.4 Successor list has 20 entries 6 1.2 (1/2) is 1.6% Wait until they stabilize 1 Insert 1,000 key/value pairs 0.8 0.6 Five replicas of each 0.4 Stop X% of the servers 0.2 Failed Lookups (Percent) Immediately perform 1,000 lookups 0 5 10 15 20 25 30 35 40 45 50 Failed Nodes (Percent) 27 28 7 Latency Measurements Chord Summary 180 Nodes at 10 sites in US testbed 16 queries per physical site (sic) for random keys Chord provides peer-to-peer hash lookup Maximum < 600 ms Efficient: O(log(n)) messages per lookup Minimum > 40 ms Median = 285 ms Robust as nodes fail and join Expected value = 300 ms (5 round trips) Good primitive for peer-to-peer systems http://www.pdos.lcs.mit.edu/chord 29 30 Content-Addressable Network (CAN) A Scalable, Content-Addressable Network CAN: Internet-scale hash table Interface Sylvia Ratnasamy, Paul Francis, Mark Handley, insert(key,value) Richard Karp, Scott Shenker value = retrieve(key) Properties scalable operationally simple good performance Related systems: Chord/Pastry/Tapestry/Buzz/Plaxton ... 31 32 8 Problem Scope CAN: basic idea Design a system that provides the interface scalability K V robustness K V performance K V security K V K V Application-specific, higher level primitives K V K V keyword searching mutable content K V anonymity K V K V K V 33 34 CAN: basic idea CAN: basic idea K V K V K V K V K V K V K V K V K V K V K V K V K V K V K V K V K V K V K V K V insert K V insert K V (K1,V1) (K1,V1) 35 36 9 CAN: basic idea CAN: basic idea (K1,V1) K V K V K V K V K V K V K V K V K V K V K V K V K V K V K V K V K V K V K V K V K V K V retrieve (K1) 37 38 CAN: solution CAN: simple example virtual Cartesian coordinate space entire space is partitioned amongst all the nodes every node “owns” a zone in the overall space abstraction 1 can store data at “points” in the space can route from one “point” to another point = node that owns the enclosing zone 39 40 10 CAN: simple example CAN: simple example 3 1 2 1 2 41 42 CAN: simple example CAN: simple example 3 1 2 4 43 44 11 CAN: simple example CAN: simple example node I::insert(K,V) I I 45 46 CAN: simple example CAN: simple example node I::insert(K,V) node I::insert(K,V) (1) a = hx(K) I (1) a = hx(K) I b = hy(K) y = b x = a 47 x = a 48 12 CAN: simple example CAN: simple example node I::insert(K,V) node I::insert(K,V) (1) a = hx(K) I (1) a = hx(K) I b = hy(K) b = hy(K) (2) route(K,V) -> (a,b) (2) route(K,V) -> (a,b) (K,V) (3) (a,b) stores (K,V) 49 50 CAN: simple example CAN node J::retrieve(K) (1) a = hx(K) Data stored in the CAN is addressed by b = hy(K) name (i.e. key), not location (i.e. IP (2) route “retrieve(K)” to (a,b) (K,V) address) J 51 52 13 CAN: routing table CAN: routing (a,b) (x,y) 53 54 CAN: routing CAN: node insertion A node only maintains state for its Bootstrap immediate neighboring nodes node new node 1) Discover some node “I” already in CAN 55 56 14 CAN: node insertion CAN: node insertion (p,q) 2) pick random point in space I I new node new node 1) discover some node “I” already in CAN 57 58 CAN: node insertion CAN: node insertion (p,q) J J new I new node 3) I routes to (p,q), discovers node J 59 4) split J’s zone in half… new owns one half 60 15 CAN: node insertion CAN: node failures Need to repair the space Inserting a new node affects only a single recover database . soft-state updates other node and its immediate neighbors . use replication, rebuild database from replicas repair routing . takeover algorithm 61 62 CAN: takeover algorithm CAN: node failures Simple failures Only the failed node’s immediate neighbors know your neighbor’s neighbors when a node fails, one of its neighbors takes over are required for recovery its zone More complex failure modes simultaneous failure of multiple adjacent nodes scoped flooding to discover neighbors hopefully, a rare event 63 64 16 Design recap Evaluation Basic CAN completely distributed Scalability self-organizing Low-latency nodes only maintain state for their immediate neighbors Load balancing Additional design features multiple, independent spaces (realities) Robustness background load balancing algorithm simple heuristics to improve performance 65 66 CAN: scalability CAN: low-latency For a uniformly partitioned space with n nodes and d Problem dimensions latency stretch = (CAN routing delay) per node, number of neighbors is 2d (IP routing delay) 1/d average routing path is (dn )/4 hops application-level routing may lead to high stretch simulations show that the above results hold in practice Solution Can scale the network without increasing per-node increase dimensions state heuristics Chord/Plaxton/Tapestry/Buzz .

Load more